hpc-class: Using OpenMP or -parallel


Parallelism beyond a single node (2 CPUs on hpc-class)
requires the use of MPI, however MPI requires major changes 
to an existing program.  Two ways exist to get parallelism 
with a single 2 CPU node can either be obtained with automatic 
parallelism (the -parallel compiler option) or with OpenMP
(the -openmp compiler option).

The simplest way to get parallel execution is to add -parallel
to your compile command.  Then issue
setenv OMP_NUM_THREADS 2
./a.out

Another simple way to obtain parallelism is by using OpenMP, 
which can be used to express parallelism on a shared memory machine.
Since each of the nodes on hpc-class is a shared memory
machine with 2 processors, OpenMP can be used to obtain 
parallelism for two processors.
It requires changes to the program but not nearly so much as
MPI. (The gains are generally less than for MPI, but greater 
than that for automatic parallelism.)

E.g. 

  Having the OpenMP directive  
!OMP$  PARALLEL DO
just before 

do j=2,n-1
 do i=2,m-1
 a(i,j)=(b(i,j+1)+b(i,j-1)+b(i-1,j)+b(i+1,j)+4.d0*b(i,j))/6.d0
 enddo
enddo

signals to an OpenMP compiler that the j loop can be performed on multiple
processors.

When run, issue

setenv OMP_NUM_THREADS 2
./a.out

and the program will be run with two "threads" which can run on each of
the two processors.  Everything runs on just one thread until the 
above directive is reached, when the second thread performs half 
the work in the j loop.

Without the -openmp flag on the compilation step the directive is 
ignored as a comment.

For C and C++, pragmas are used rather than directives.

In general, OpenMP programs run the fastest when most of the operations
are on data which is "private" rather than "shared". See the standard
for the meaning of private and shared data with regard to OpenMP.

The Intel compilers on hpc-class implement OpenMP 2.0 except for the 
workshare directive.