D.3.1. FMS NUMA Parameter NUMAFL

Data Type

Integer

Default Value

Description

This is the master flag for activating all the NonUniform Memory Access (NUMA) directives. You must set this parameter in the FMS License File in order to activate any of the NUMA technology. The following values are available:

0, Do not use NUMA technology.
1, Use NUMA technology for large problems.
2, Use NUMA technology for all problems.
   

This option will use NUMA technology even if the problem is so small that it results in using less than the number of processors.
This is mostly useful for debugging.

The NUMA directives discussed below are a powerful set of programming tools for achieving maximum performance. These are the directives which FMS uses internally to achieve peak performance. However, you may also use these directives in your own application.

Many of the NUMA features used by FMS work best when the machine is dedicated to FMS applications. This may be a single application or multiple applications where each one is explicitly placed. Running more than one application on a group of processors, or letting the operating system schedule other tasks on top of a FMS NUMA job may significantly degrade performance.

The fundamental building block of a NUMA machine is the "node". Within the node, all processing cores share a common memory with high-speed uniform access. This may be a single multi-core processor or multiple processors each having multiple cores. Programming at this level is similar to Symmetric MultiProcessor (SMP) programming techniques. The FMS Parameter NPNODE defines the number of processors per node. It defaults to a value appropriate for your machine.

The number of nodes NUMNOD is obtained by dividing the total number of processing cores being used MAXCPU by the number of processors per node NPNODE. This parameter is automatically computed by FMS.

In order to achieve peak performance on a NUMA machine, the software must maximize references to local memory and minimize references to remote memory. This requires close coordination between which processor runs the thread and where the memory is placed. In addition, the software must be designed to distribute the data in a fashion similar to programming techniques used on distributed memory computers.

To achieve peak performance, FMS implements the following features:

Memory Placement

When FMS places memory using NUMA directives, the requested memory is dealt out round robin among the nodes using a stride that is specified by the FMS Parameter MAXLMD. For example, the records used to hold matrix data use a stride that evenly distributes the record among the number of nodes being used for the problem. If you allocate memory for your application using one of the FMS memory management routines FMSIMG, FMSRMG or FMSCMG, and make the call from the parent, your memory will be distributed according to the specified value of MAXLMD. If you call the FMS memory management routines to allocate memory from a child thread, all the memory requested will be allocated on the node where that thread is running. You may also allocate memory and place it on each node with a single call from the parent using FMSILG, FMSRLG or FMSCLG. This provides a simple, yet effective way for you to control where your data resides on the NUMA machine.

For some applications, a single distribution of data will be optimal for the entire job. Most FMS applications, however, go through different phases, with each phase requiring a different data distribution. For example, an application may form matrix data, factor, solve and process results as distinct phases. FMS includes the Parameter MDWHEN that controls when the memory is placed. When the NUMA flag is set, memory is placed as it is required. When the NUMA flag is not set, all the memory is allocated once at the beginning.

Thread Binding

FMS includes options for attaching each thread to an individual processing core or any processing core on the node. The NUMAFX Parameter controls these options. When the NUMA flag is set, this parameter automatically defaults to the optimum value for each machine.

The threads are placed in increasing order, starting on processing core MYCPU1 (1 is the first CPU core in the machine), and continuing for MAXCPU processing cores. You may obtain the thread number MYCPU of a subroutine you have running in parallel by calling FMSIGT('MYNODE', MYCPU). Knowing the thread number and how the memory was distributed, you can determine what calculations should be performed on the local data.

One simple case is filling a matrix. First, you call the routine that fills the matrix in parallel using FMSPAR. At the beginning of the subroutine, you find out what thread you are by calling FMSIGT ('MYNODE', MYCPU). Next, you call one or more of the FMS memory management routines FMSIMG, FMSRMG or FMSCMG to allocate memory for your data. Because these routines are being called from a thread, they will automatically allocate the memory on the local node. After performing your part of the computation, you return the memory and return from your subroutine.