Performance for HPF programs and their communications on any particular parallel system is influenced by several factors including the amount of communications required by a program for computation and for overhead and the system's latency and bandwidth where communication is required. Another factor that influences performance is the number and power of optimizations performed to improve or eliminate communications. Latency describes the minimum time required for communication between two processors. Bandwidth refers to the maximum rate at which a message can be sent from one processor to another. Optimizations group communications to eliminate unnecessary communications, minimize the effects of latency and to maximize bandwidth per communication. Some communication optimizations that pghpf performs are covered in this chapter while other factors such as system configuration, latency and bandwidth are useful to keep in mind while considering a program's performance on a parallel system.
Table 6-1: Communication Primitives - General Case
Left Hand Right Hand Communication Primitive Name Side Side i i No communication required i i+c Overlap Shift (optimized) i s*i+c Copy Section (optimized) i j Unstructured (permute section) ic: compile time constant
v(i) v(i)
i Gather/Scatter i
unknown unknown
i Scalarization (scalar communications)
PROGRAM TEST45
INTEGER I, A(100), B(100)
!HPF$ DISTRIBUTE (*):: A,B
B=A
END
PROGRAM TEST_OVERLAPIn the first stage of the overlap shift communication, the compiler determines that a computation involving the array B requires an overlap shift area in the positive direction (pghpf also permits negative overlap shift areas). A portion of B is then allocated with the extra overlap location(s).
INTEGER I, A(8), B(8)
!HPF$ DISTRIBUTE (BLOCK):: A,B
FORALL(I=1:7) A(I)=B(I+1)
END
Figure 6-2 Sample Overlap Shift Optimization
Following is a sample of code that would use the copy section communications optimization.
PROGRAM TEST_SECTION
INTEGER SCALAR_VAL, I, A(100), B(100)
!HPF$ DISTRIBUTE (BLOCK)::A,B
READ *, SCALAR_VAL
FORALL(I=1:100) A(I)=B(SCALAR_VAL*I+1)
END
Indirection arrays generally require expensive scheduling. By careful programming, one can reduce the number of schedules generated. For example, consider the following code segment:
!hpf$ distribute (block, block) :: FR, FI do i = 1, nprocThe compiler generates two schedules for the code above, because schedules are not reused across loops. However, if the code is written as follows, the compiler will reuse the first communications schedule for the second array assignment:
FR(i,:) = FR(i, v(:))
enddo do i = 1, nproc
FI(i,:) = FI(i, v(:))
enddo
do i = 1, nprocThe compiler generates two communication schedules for FR(i,v(:)) and FI(i,v(:)). If the code is written as in the second example, pghpf generates one schedule and reuses this schedule for the second communication.
FR(i,:) = FR(i, v(:))
FI(i,:) = FI(i, v(:))
enddo
Since the value of v is not changed between statements and its second use is in the loop, pghpf may be able to use a single schedule for the different communications, thus reducing the overhead required for producing the communications schedule.
Another technique that allows the compiler to optimize this type of communication is indirection array alignment. Using an indirection on the right-hand-side, it is better to align with the left-hand-side array, or replicate the indirection array.
Similarly, if you use indirection for the left-hand-side, it is better to align the indirection with one of the right-hand-side arrays.
The compiler also recognizes patterns within FORALL statements and constructs as scatter operations. For example, in the statement:
A(V) = A(V) + Bgenerates a call to an internal SUM_SCATTER operation which is similar to the SUM_SCATTER routine found in the HPF library.
Another optimization that a programmer may use involves generating indirection arrays to reduce the use of expensive scalar communications. Using the compiler option -Minfo, the compiler provides diagnostic messages when the compiler scalarizes a FORALL. For example:
4, forall is scalarized: complicated communicationIf a FORALL construct uses an array index with complicated subscripts, it may be better to put complicated array subscripts into an indirection array. For example the following two code segments show how this is accomplished:
forall(i=1:n) FR(I) = FR(I/2+2*v(i))this code could be replaced to use an indirection array, as shown:
forall(i=1:N) indx(i) = i/2+2*v(i) forall(i=1:N) fr(i) = fr(indx(i))Here the pghpf will not scalarize the complicated subscripts in the FORALL statements in the second example, since the index is a simple indirection, and does not add the extra complication of a complicated computation and an indirection.
Parallel speedup measures the decrease in program execution time as more processors are added. If is the time to execute the program with i processors, then perfect speedup occurs when .
Another measure of speedup that may be used is the comparison of a program's parallel execution time with the execution time of an optimized sequential version of the program.
Another way to measure the efficiency of compiler-generated code for a parallel program is to compare it against a hand-optimized, parallel version of the same program.