7가지 동시성 모델 - 데이터 병렬성

7가지 동시성 모델
Ch7. 데이터 병렬성
아꿈사
Cecil

SIMD(Single instruction multiple data)
대량의 데이터에 대해 똑같은 작업을 병렬적으로 수행
ex) 이미지 처리..
GPU는 강력한 데이터 병렬 처리기

GPGPU 프로그래밍
GPU는 수를 계산하는데 있어 CPU를 능가
하지만, GPU 마다 세부 구현이 다름
이를 위해 추상화 라이브러리를 사용
(ex: OpenCL ..)

데이터 병렬성의 구현 방법들
파이프 라이닝, 다중 ALU(산술/논리 유닛) …

곱셈을 하기 위해 여러 단계가 필요하고,
곱해야 하는 수가 많은 경우
ex) 1000개의 곱셈, 곱셈당 5개의 단계가 필요
Non-파이프라인: 5000 Cycle, 파이프라인: 1005 Cycle
a7*b7 a6*b6 a5*b5 a4*b4 a3*b3
a1
*b1
a2
*b2b8b7b6 b9
a8a7a6 a9
So multiplying a thousand pairs of numbers takes a whisker over a thousand
clock cycles, not the five thousand we might expect from the fact that multi-
plying a single pair takes five clock cycles.
Multiple ALUs
The component within a CPU that performs operations such as multiplication
is commonly known as the arithmetic logic unit, or ALU:
Operand 1 Operand 2
Day 1: GPGPU Programming • 191
파이프라이닝 예제:
두수의 곱

여러개의 ALU와 넓은 메모리 버스를 연결하여,
거대한 데이터에 가하는 연산이 병렬화 됨
다중 ALU 예제:
두수의 곱
a1 a2 a3 a4
a5 a6 a7 a8
a9
a10
a11
a12
a13 a14 a15 a16
b1 b2 b3 b4
b5
b6
b7
b8
b9 b10 b11 b12
b13 b14 b15 b16
a9
a10
a11
a12
b9
b10
b11
b12
c1 c2 c3 c4
c5 c6 c7 c8
c9 c10 c11 c12
c13 c14 c15 c16
c9
c10
c11
c12
Figure 12—Large Amounts of Data Parallelized with Multiple ALUs
manufacturer then provides its own compilers and drivers that allow that
program to be compiled and run on its hardware.
Our First OpenCL Program
To parallelize our array multiplication task with OpenCL, we need to divide
it up into work-items that will then be executed in parallel.
Work-Items
If you’re used to writing parallel code, you will be used to worrying about the
Chapter 7. Data Parallelism • 192

Open CL 프로그램 작성 순서
1. 커널(작업 항목이 어떻게 처리되어야 하는지) 작성
2. 커널이 명령어 큐와 함께 실행되는 문맥 생성
3. 커널 컴파일
4. 입력과 출력 데이터를 위한 버퍼 생성
5. 커널을 한번씩 실행하는 명령어를 큐에 넣음
6. 결과를 로드

inputA inputB output
work-item 0
work-item 1
work-item 2
work-item 1023
Figure 13—Work Items for Pairwise Multiplication
Day 1: GPGPU Programmin
첫 번째 Open CL프로그램
(배열의 곱셈)

__kernel void multiply_arrays(__global const float* inputA,
__global const float* inputB,
__global float* output) {
int i = get_global_id(0);
output[i] = inputA[i] * inputB[i];
}
병렬 처리를 위한 커널

// 컨텍스트 생성
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
// 명령어 큐 만들기
cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);
// 커널 컴파일
char* source = read_source("multiply_arrays.cl");
cl_program program = clCreateProgramWithSource(context, 1,
(const char**)&source, NULL, NULL); 
free(source); 
clBuildProgram(program, 0, NULL, NULL, NULL, NULL); 
cl_kernel kernel = clCreateKernel(program, "multiply_arrays", NULL);

// 버퍼 만들기
#define NUM_ELEMENTS 1024
cl_float a[NUM_ELEMENTS], b[NUM_ELEMENTS];
random_fill(a, NUM_ELEMENTS);
random_fill(b, NUM_ELEMENTS);
cl_mem inputA = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * NUM_ELEMENTS, a, NULL); 
cl_mem inputB = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * NUM_ELEMENTS, b, NULL); 
cl_mem output = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
sizeof(cl_float) * NUM_ELEMENTS, NULL, NULL); 
// 임의의 배열 생성
void random_fill(cl_float array[], size_t size) {
for (int i = 0; i < size; ++i)
array[i] = (cl_float)rand() / RAND_MAX;
}

// 작업항목 실행하기
clSetKernelArg(kernel, 0, sizeof(cl_mem), &inputA);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &inputB);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &output);
size_t work_units = NUM_ELEMENTS;
clEnqueueNDRangeKernel(queue, kernel, 1/*차원*/, NULL, &work_units, NULL,
0, NULL, NULL);
// 결과 읽기
cl_float results[NUM_ELEMENTS]; 
clEnqueueReadBuffer(queue, output, CL_TRUE, 0, sizeof(cl_float)
* NUM_ELEMENTS, results, 0, NULL, NULL);
// 정리 하기
clReleaseMemObject(inputA);
clReleaseMemObject(inputB);
clReleaseMemObject(output);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);

// 비교 샘플
for (int i = 0; i < NUM_ELEMENTS; ++i)
results[i] = a[i] * b[i];
CPU 코드와 성능 비교
저자의 맥북 프로에서 실행한 결과(100,000번의 곱셈)
GPU 프로그램: 43,000 ns
CPU 프로그램: 400,000 ns

OpenCL
다차원과 작업 그룹

다차원 예제: 행렬 곱셈
// 순차적인 방식
#define WIDTH_OUTPUT WIDTH_B
#define HEIGHT_OUTPUT HEIGHT_A
float a[HEIGHT_A][WIDTH_A] = «initialize a»;
float b[HEIGHT_B][WIDTH_B] = «initialize b»;
float r[HEIGHT_OUTPUT][WIDTH_OUTPUT];
for (int j = 0; j < HEIGHT_OUTPUT; ++j) {
for (int i = 0; i < WIDTH_OUTPUT; ++i) {
float sum = 0.0; 
for (int k = 0; k < WIDTH_A; ++k) {
sum += a[j][k] * b[k][i];
}
r[j][i] = sum;
}
}

__kernel void matrix_multiplication(uint widthA, __global const float* inputA,
__global const float* inputB, __global float* output) {
int i = get_global_id(0);
int j = get_global_id(1);
int outputWidth = get_global_size(0);
int outputHeight = get_global_size(1);
int widthB = outputWidth;
float total = 0.0;
for(int k=0; k < widthA; ++k){
total += inputA[j * widthA + k] * inputB[k * widthB + i];
}
output[j * outputWidth + i] = total;
}
병렬 행렬 처리를 위한 커널

// 작업항목 실행하기
size_t work_units[] = {WIDTH_OUTPUT, HEIGHT_OUTPUT};
clEnqueueNDRangeKernel(queue, kernel, 2/* 차원 */, NULL, work_units,
NULL, 0, NULL, NULL);
행렬 연산의 성능 비교
저자의 맥북 프로에서 실행한 결과(200x400 * 300x200)
GPU 프로그램이 순차 프로그램에 비해 20배 향상됨.

플랫폼 모델
Host
Devices Compute Unit
Processing Element
Figure 14—The OpenCL Platform Model
Memory Model
A work-item executing a kernel has access to four different memory regions:
작업항목
- 프로세싱 요소에서 실행됨
작업그룹
- 작업 항목의 컬렉션
- 하나의 계산 유닛에서 실행됨
- 작업 항목들은 지역 메모리 공유

메모리 모델
1. 전역 메모리: 장치내의 모든 작업 항목에서 접근 가능
2. 상수 메모리: 전역 메모리중 커널을 실행하는 동안 남아 있는 부분
3. 지역 메모리: 작업 그룹에 할당한 메모리
4. 사적 메모리: 하나의 작업 항목만 접근할수 있는 메모리

데이터 병렬 축소 예제: min
// 순차적인 방식
cl_float acc = FLT_MAX; 
for (int i = 0; i < NUM_VALUES; ++i)
acc = fmin(acc, values[i]);

// 단일 작업 그룹 축소
__kernel void find_minimum(__global const float* values, __global float* result,
__local float* scratch) {
int i = get_global_id(0); 
int n = get_global_size(0);
scratch[i] = values[i];
barrier(CLK_LOCAL_MEM_FENCE);
for(int j=n/2; j>0; j/=2){ // 축소 실행
if (i < j)
scratch[i] = min(scratch[i], scratch[i + j]);
}
if(i==0)
*result = scratch[0];
}
// 실행
clSetKernelArg(kernel, 2, sizeof(cl_float) * NUM_VALUES, NULL);
1. It copies the array from global to local (scratch) memory
2. It performs the reduce (lines 8–12).
3. It copies the result to global memory (line 14).
The reduce operation proceeds by creating a reduce tree ver
one we saw when looking at Clojure’s reducers (see Divide a
page 67):
After each loop iteration, half the work-items become inact
items for which i < j is true perform any work (this is why we’r
the number of elements in the array is a power of two—so we
halve the array). The loop exits when only a single work-ite
Each active work-item performs a min() between its value and th

// 여러개의 작업 그룹 축소
__kernel void find_minimum(__global const float* values, __global float* results,
__local float* scratch) {
int i = get_local_id(0);
int n = get_local_size(0);
scratch[i] = values[get_global_id(0)];
for (int j = n / 2; j > 0; j /= 2) {
if (i < j)
scratch[i] = min(scratch[i], scratch[i + j]);
}
if (i == 0)
results[get_group_id(0)] = scratch[0];
}
// 실행
size_t work_units[] = {NUM_VALUES};
size_t workgroup_size[] = {WORKGROUP_SIZE};
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, work_units, workgroup_size,
0, NULL, NULL));
Figure 15—Extending the Reduce across Multiple Work-Groups
group id 0 group id 1 group id 2 group id n
local id 0
global id 0
local size
global size
Figure 16—The Local ID with a Work-Group
Here’s a kernel that makes use of local IDs:

정리
• 장점
• 커다란 용량의 데이터를 대상으로 수치 계산이 필요한 경우 이상적
• GPU는 강력한 병렬 프로세서인 동시에 전력 효율이 좋음
• 단점
• 수치가 아닌 문제에 적용이 가능하지만, 여러움
• OpenCL의 커널 최적화는 매우 어려움
• 호스트 프로그램과 장치 사이에 데이터 복사로 인해 시간을 많이 소요

References
•Paul Butcher, 7가지 동시성 모델(임백준 옮김). 서울
시 마포구 양화로 한빛미디어, 2016.

7가지 동시성 모델 - 데이터 병렬성

More Related Content

What's hot (20)

Similar to 7가지 동시성 모델 - 데이터 병렬성 (20)

More from HyeonSeok Choi (20)

7가지 동시성 모델 - 데이터 병렬성