OpenMP -

Anonymous included in HPC

2024-04-30 2024-06-05

Contents

Check the OpenMP version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <stdio.h>

int main() {
#ifdef _OPENMP
    printf("OpenMP version: %d\n", _OPENMP);
#else
    printf("OpenMP is not supported.\n");
#endif
    return 0;
}

If system supports the OpenMP, you will receive similar output results

OpenMP version: 201511

“201511” means that this OpenMP is released in November 2015.

Next, you need to go to this official website to search the edition and the released time, you will find that OpenMP 4.5 is released in November 2015.

Key concept in OpenMP

construct（构造）

construct = pragma + structural block

region（区域）

region = the code in structural block

The end of the region has a implicit barrier, i.e. every thread has finished the executation of the code in the construct

clause（子句）

clause = behavior that defines or edits the construct

Understand the `pragma`¹

This command is the most important element for OpenMP.

On the one hand, it’s called 编译制导指令，because these commands instruct the compiler to execute some particular operations when compiling.

On the other hand, it’s like to the #define command, which is called 预处理指令(Preprocessor Directive).

Discriminate parallel and for

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <omp.h>

int main() {

    omp_set_num_threads(2);

    #pragma omp parallel
    {
        for (int i = 0; i < 3; i++) {
            int thread_id = omp_get_thread_num();

            std::cout << "thread_id: " << thread_id << " , i: " << i << std::endl;
        }
    }
    #pragma omp barrier

    std::cout << std::endl;

    // 等价于 #pragma omp parallel for
    #pragma omp parallel
    {
        #pragma omp for
        for (int i = 0; i < 10; i++) {
            int thread_id = omp_get_thread_num();

            std::cout << "thread_id: " << thread_id << " , i: " << i << std::endl;
        }
        #pragma omp barrier
    }

    return 0;
}

输出结果为：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
thread_id: 1 , i: 0
thread_id: 1 , i: 1
thread_id: 1 , i: 2
thread_id: 0 , i: 0
thread_id: 0 , i: 1
thread_id: 0 , i: 2

thread_id: 0 , i: 0
thread_id: 0 , i: 1
thread_id: 0 , i: 2
thread_id: 0 , i: 3
thread_id: 0 , i: 4
thread_id: 1 , i: 5
thread_id: 1 , i: 6
thread_id: 1 , i: 7
thread_id: 1 , i: 8
thread_id: 1 , i: 9

可见，parallel 中的代码是每个 thread 都会完整执行的，而 for 中的代码会分配给不同 thread 来执行 ²

Fallible point

实测发现大括号不能随意添加，下方代码如果为 #pragma omp for ordered 添加一个大括号，那么执行将报错 “‘ordered’ region must be closely nested inside a loop region with an ‘ordered’ clause“

1
2
3
4
5
6
7
    #pragma omp for ordered
    for (int i = 0; i < 100; i++) {
        #pragma omp ordered
        {
            sum += thread_id;
        }
    }

Construct

sections

仅仅使用 #pragma omp parallel 所实现的并行实际是 SIMD，即在 parallel 所指定的 region 下，每个 thread 看到并且指定的代码都是相同的，只不过是通过 thredd id 实现的计算数据的不相同。

而实现并行的方式并不仅仅局限于 SIMD, 还能够通过 MIMD实现。sections 完成的就是这样一项任务，每个 section 下的代码由一个线程负责执行（注意这并不表示执行两个 section 的线程一定不会是同一个，这里实际的执行情况可能要看线程的调度策略），每个 section 下的代码是可以不同的，这也就能够实现 MIMD 了。

上述判断可以通过以下程序进行验证

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <stdio.h> 
#include <omp.h>
#include <unistd.h>

int main() {
	#pragma omp parallel 
	{ 
		int threadid = omp_get_thread_num();

		printf("thread %d \n", threadid);

		#pragma omp sections 
		{ 
			#pragma omp section 
			{
				printf("section 0 - thread %d\n", threadid); 
				sleep(1);
			}
			

			#pragma omp section
			{
				printf("section 1 - thread %d\n", threadid);
				sleep(1);
			}
			

			#pragma omp section 
			{
				printf("section 2 - thread %d\n", threadid); 
				sleep(1);
			}
		} 
	}
	return 0;
} 

Synchronization

barrier

There is an implicit barrier after parallel, for, sections and single, and we can use nowati clause to cancel this implicit barrier, such as #pragma omp for nowait.

ordered

ordered construct 的一个使用条件就是，要求并行循环使用 ordered clause, 即下方的 #pragma omp for ordered 和 #pragma omp ordered，前者是 clause，后者是 construct.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <stdio.h>
#include <omp.h>

int main() {
    #pragma omp parallel for ordered
    for (int i = 0; i < 10; ++i) {
        // Some parallel work
        int tid = omp_get_thread_num();

        #pragma omp ordered
        {
            printf("Thread %d processed iteration %d\n", tid, i);
        }
    }

    return 0;
}

以上程序中，循环将会顺序执行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
int sum_serial = 0;
#pragma omp parallel num_threads(100)
{
    int thread_id = omp_get_thread_num();

    #pragma omp for ordered
    for (int i = 0; i < 100; i++) {
        #pragma omp ordered
        {
            sum_serial += thread_id;
        }
    }
}
std::cout << sum_serial << std::endl;

int sum_parallel = 0;
#pragma omp parallel num_threads(100)
{
    int thread_id = omp_get_thread_num();

    #pragma omp for
    for (int i = 0; i < 100; i++) {
            sum_parallel += thread_id;
    }
}
std::cout << sum_parallel << std::endl;

需要实现的场景是，在一个大的 parallel region 中，我们希望有一小段代码是串行执行的，即严格遵循线程编号依次执行。通过递增计数器和锁可以简单实现这一点。

上方代码给了两种实现，理论上第2种实现可能出现多个 thread 同时修改 sum_parallel 变量从而导致结果出现错误，第一种实现从测试来看确实是正确的，不过令人困惑的是 #pragma omp for ordered 和 #pragma omp ordered 两个重复的 ordered。

critical

critical 用于保护代码段，使得该代码段在任何时候都只能由一个线程执行，从而避免数据竞争, 当一个线程在执行 critical 指定的 region 时，其他要执行同样 region 的线程必须等待。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <omp.h>

int main() {
    int sum = 0;

    #pragma omp parallel for
    for (int i = 0; i < 10; ++i) {
        #pragma omp critical
        {
            sum += i;
            printf("Thread %d added %d, sum now is %d\n", omp_get_thread_num(), i, sum);
        }
    }

    printf("Final sum is %d\n", sum);
    return 0;
}

atomic

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <omp.h>

int main() {
    int sum = 0;

    #pragma omp parallel for
    for (int i = 0; i < 10; ++i) {
        #pragma omp atomic
        sum += i;

        // Note: printf is not atomic, so the output may still be interleaved
        printf("Thread %d added %d, sum now is %d\n", omp_get_thread_num(), i, sum);
    }

    printf("Final sum is %d\n", sum);
    return 0;
}

The logic of this program is same as the example of critical construct program.

task

task 的常用执行逻辑是，由 1 个线程负责创建 task，然后每个线程会负责一个 task，具体每个线程执行哪一个 task 取决于实际场景下的调度策略。

这种 construct 类似创建了一个 task pool，thread 就是计算资源，从 task pool 中选择 task 来执行。

需要注意的是，实际测试结果显示，最初创建 task 的线程，最后也是有可能会去执行某一项 task 的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>
#include <omp.h>

int main() {
    #pragma omp parallel
    {
        #pragma omp single
        {
            printf("Thread %d: Creating tasks\n", omp_get_thread_num());

            #pragma omp task
            {
                printf("Thread %d: Executing task 1\n", omp_get_thread_num());
            }

            #pragma omp task
            {
                printf("Thread %d: Executing task 2\n", omp_get_thread_num());
            }
        }
    }
    return 0;
}

When we specify the thread num is 4, the result of above program is

Thread 0: Creating tasks
Thread 2: Executing task 1
Thread 3: Executing task 2

Clause

Data Storage Property

shared: different threads share the same data
private: different threads have they own data replica

firstprivate()
lastprivate()

Hot to handle array ?

We can use array split grammar, for example

[lower bound : length : stride]
[lower bound : length]
[: length : stride] (To be honest, I am puzzled about this which doesn’t specify the begin position)

1
#pragma omp parallel firstprivate(vptr[0 : 1000 : 1])

OpenMP