Cluster GPU InstanceにPyOpenCL環境とかも構築

Cluster GPU InstanceにPyCUDA環境を構築の続きで，PyOpenCLとGPU Computing SDK code samplesもインストールしましょう．まずはpipでPyOpenCLをインストールします．

CPLUS_INCLUDE_PATH=/usr/local/cuda/include PATH=/opt/local/bin:$PATH pip install pyopencl

コマンド一つでした．サンプルプログラムなどを実行して動作確認しましょう．
次にGPU Computing SDK code samplesです．現時点でバージョン3.1のToolkitがプリインストールされているのですが，これより新しい3.2のcode samplesなどを落としてきてもmakeが通らないのでご注意ください．

# wget http://developer.download.nvidia.com/compute/cuda/3_1/sdk/gpucomputingsdk_3.1_linux.run
# sh gpucomputingsdk_3.1_linux.run

ここまで私とまったく同じやり方であればインストール時の質問はそのままEnterで大丈夫です．コンパイルにいくつかのライブラリが必要なのでyumでインストールします．

# yum install libGLU-devel libXi-devel libXmu-devel freeglut-devel

あとはmakeでエラーが起きなければ大丈夫です．

# cd NVIDIA_GPU_COMPUTING_SDK/C
# make

deviceQueryを実行してみましょう．Cluster GPU Instanceに搭載されているTesla M2050の情報が２枚分見えるはずです．

# bin/linux/release/deviceQuery
bin/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "Tesla M2050"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 2817982464 bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes

Device 1: "Tesla M2050"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 2817982464 bytes
  Number of multiprocessors:                     14
  Number of cores:                               448
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla M2050, Device = Tesla M2050


PASSED

Press <Enter> to Quit...
-----------------------------------------------------------

続いてOpenCLです．こちらもmakeして異常がなければoclDeviceQueryを実行してみましょう．

# cd ../OpenCL
# make
# bin/linux/release/oclDeviceQuery