5 Data Distribution

This chapter describes some of the techniques the HPF language and the pghpf compiler use to handle data distribution among processors on a parallel system. This chapter also describes some data distribution limitations for the current version of pghpf. The pghpf compiler distributes data and generates necessary communications with the assistance of the pghpf runtime library. Depending on the data type, distribution specification, and alignment specification, as well as each computation's required data access, data is communicated for a particular expression involving some computation. Data distribution is based on the data layout specified in the HPF program, the design of the parallel computer system and the layout and number of processors used. Some mapping of data to processors is specified by the programmer, other mapping is determined by the compiler.

5.1 Run-time Model - Overview

The pghpf compiler targets an SPMD programming model. In the SPMD model, each processor executes the same program, but operates on different data. This is implemented by loading the same program image into each processor. Each processor then allocates and operates on its own local portion of distributed arrays, according to the distributions, array sizes and number of processors as determined at runtime. Special attention is required to address the unique communication characteristics of many parallel systems. The pghpf Runtime library handles HPF data distribution tasks in a generic manner so that HPF programs will work on distributed memory and shared memory systems (some parallel systems use shared memory others use distributed memory, there are also hybrid systems. Lower levels of the pghpf runtime library are customized for different parallel architectures). Figure 5-1, "Distributed Memory Parallel System", shows a conceptual view of a parallel system running an HPF program.

The pghpf runtime library takes into account the communications to be performed and is optimized at two levels, the transport independent level where efficient communications are generated based on the type and pattern of data access for computations, and at the transport dependent level where the runtime library's communication is performed using a communications mechanism and the system's hardware. To generate efficient code, data locality, parallelism, and communications must be managed by the compiler. This chapter describes the principles of data distribution that HPF and the pghpf compiler use; the HPF programmer needs to be aware of some details of data distribution to generate efficient parallel code.

Figure 5-1 A Distributed Memory Parallel System

5.2 HPF Data Distribution

The data distribution phase of the pghpf compiler has two important tasks that map data to a parallel system's memory and enable computations on that data:
  1. The compiler propagates an efficient data distribution for the program's variables.
  2. Each computation is partitioned according to the specified data distribution and non-local values are communicated, as necessary, for each computation.
The following sections describe these tasks in more detail.

5.2.1 Propagating Data Distribution

The pghpf compiler distributes data for three types of variables: The compiler uses HPF directives as a guide for distributing the data that has a user specified distribution. Data without distribution directives is replicated across all processors. Compiler-created temporaries are distributed corresponding to their required usage.

Default Distribution

Using the compiler's default distribution, all unspecified data, data without an explicit HPF distribution, is replicated among the available processors. For example, if the integer array BARRAY is used in a program and no HPF directives are supplied for distributing or aligning BARRAY, the default distribution is used and BARRAY is replicated. PROG1 and PROG2 in Example 5-1 show the default distribution. In PROG1, the compiler generates code using the default distribution because BARRAY is specified without a distribution, PROG2 shows an equivalent user specified distribution where BARRAY is also replicated.

! distribution directives not supplied - replication
	PROG1
	INTEGER BARRAY(100)


!default distribution directives supplied - replication
	PROG2
INTEGER BARRAY(100) !HPF$ DISTRIBUTE BARRAY(*)

Example 5-1 A default distribution

Explicit HPF Distribution

As described in Chapters 4 and 5 of the High Performance Fortran Handbook, pghpf distributes data according to the supplied HPF directives. The ALIGN and DISTRIBUTE directives allow data to be distributed over processors in a variety of patterns. For example, the following code represents a distribution where a computation is partitioned over the available processors. With the given ALIGN directive, this computation involves no communication.
		REAL X(15), Y(16)
!HPF$	DISTRIBUTE Y(BLOCK)
!HPF$	ALIGN X(I) WITH Y(I+1)
		FORALL(I=1:15) X(I)=Y(I+1)
The next example is very similar, but uses a CYCLIC distribution. A cyclic distribution divides data among processors in a round-robin fashion. A block distribution divides data into evenly distributed chunks (as evenly as possible) over the available processors. A cyclic distribution divides data over processors so that each processor gets one element from each group of n elements, where n is the number of processors.

Figure 5-2 shows block and cyclic distributions for a one dimensional array. Depending on the computation performed different data distributions may be advantageous. For this computation a CYCLIC distribution would involve communication for each element computed.

		REAL X(15), Y(16)
!HPF$	DISTRIBUTE Y(CYCLIC)
!HPF$	ALIGN X(I) WITH Y(I+1)
		FORALL(I=1:15) X(I)=Y(I+1)
In the next example, a similar distribution represents a computation that would be partitioned over the available processors, (for the example we call the processors processor one and processor two). Because of the alignment specified in these ALIGN and DISTRIBUTE directives, the computation involves communication since the value for Y(9) when I is 8 needs to be communicated to assign it to X(8). X(8) is stored on processor one and Y(9) is stored on processor two.
		REAL X(15), Y(16)
!HPF$	DISTRIBUTE Y(BLOCK)
!HPF$	ALIGN X(I) WITH Y(I)
		FORALL(I=1:15) X(I)=Y(I+1)
The following example shows an erroneous distribution that programmers should avoid. According to the HPF specification, the value of a dummy index variable, I in this example, must be valid for all subscript values possible for the data, X in this example. When the ALIGN dummy index ranges for all possible value of I, 1 to 16 for this example, there would be an invalid access to the value Y(16+1). This error will give a runtime error.
		REAL X(16), Y(16)
!HPF$	DISTRIBUTE Y(BLOCK)
!HPF$	ALIGN X(I) WITH Y(I+1)
		FORALL(I=1:15) X(I)=Y(I+1)
This code produces the following runtime error message :
0: __hpf_setaln: invalid alignment
1: __hpf_setaln: invalid alignment
For more details on different data distributions and examples showing more HPF data mapping directives, refer to Chapter 4 of The High Performance Fortran Handbook.

Figure 5-2 Block and Cyclic Distribution

Distributing Allocatable Arrays

Allocatable arrays can be distributed in a manner similar to standard arrays (arrays without the ALLOCATABLE attribute). The directives that determine the distribution and alignment of an allocatable array are evaluated on entry to the allocatable array's scoping unit and are used throughout the scoping unit for creation of the array, although the arrays may later be realigned or redistributed.

Using allocatable arrays, it is important to keep in mind that an object that is being aligned with another object must exist. Thus, in the following example, the order of the ALLOCATE statements is correct; however, an incorrect ordering, when B is allocated before A, will produce a runtime alignment error.

        REAL, ALLOCATABLE:: A(:), B(:)
!HPF$   ALIGN B(I) WITH A(I)
!HPF$   DISTRIBUTE A(BLOCK)
        ALLOCATE (A(16))
        ALLOCATE (B(16))

Distribution of Procedure Arguments

The distribution of procedure arguments is described in detail in Chapter 5 of The High Performance Fortran Handbook. An important principle for HPF is the alignment of an argument when a procedure is called is maintained when the procedure returns, regardless of the distribution of the argument within the procedure. Thus, the compiler may need to redistribute the variable upon entry to the procedure, and when exiting the procedure.

Distribution of Compiler Created Temporaries

The pghpf compiler creates a distribution for compiler-created temporary variables. Compiler-created temporaries are distributed corresponding to the required usage. The compiler creates temporaries for several reasons: Distribution of temporaries and user variables are performed identically; the use of temporaries is transparent from the HPF programmer's point of view (the temporaries are visible in the Fortran 77 intermediate code).

The algorithm pghpf uses to determine distribution of temporaries takes the statement in which the temporary is used into account. Temporaries are allocated before the statement in which they are used and deallocated immediately after that statement. For example, an array assignment:

INTEGER DIMENSION(1000)::  A,B,C,D
A = SUM(B,DIM=1) + MATMUL(C,D)
would generate intermediate code using temporary arrays.

For this class of temporaries, distribution is based on the usage of the temporary. If a temporary is used as the argument to an intrinsic, the compiler tries to determine the distribution based on the other intrinsic arguments. Otherwise, it tries to assign a distribution based on the value assigned to the temporary. Otherwise, the temporary is replicated across all processors.

Numerous factors including array alignment, array distribution, array subsection usage and argument usage need to be taken into account in determining temporary distribution. For example, consider the following :

A(1:m:3) = SUM(B(1:n:2,:) + C(:,1:n:4), dim = 2)
The section of A is passed directly to the SUM intrinsic to receive the result. A temporary is needed to compute the argument to SUM. The distribution of that temporary has two possibly conflicting goals: minimize communication in the B+C expression, or minimize communication in the SUM computation and in the assignment to A.

5.2.2 Computation Partitioning

Computations are partitioned when pghpf applies the owner-computes rule. This rule causes the computation to be partitioned according to the distribution of the assigned portion of the computation and involves localization based on the left-hand-side (lhs) of an array assignment statement.

The bounds of a FORALL statement are localized according to the array elements owned by the left-hand-side.

For BLOCK partitioned dimensions, the loop bounds are adjusted to index the slice of data owned by the current processor.

For CYCLIC partitioning, two loops are required. The outer loop iterates over the cycles of the data, and the inner loop iterates over the data items in the cycle.