Cluster GPU InstanceにPyOpenCL環境とかも構築
Cluster GPU InstanceにPyCUDA環境を構築の続きで,PyOpenCLとGPU Computing SDK code samplesもインストールしましょう.まずはpipでPyOpenCLをインストールします.
CPLUS_INCLUDE_PATH=/usr/local/cuda/include PATH=/opt/local/bin:$PATH pip install pyopencl
コマンド一つでした.サンプルプログラムなどを実行して動作確認しましょう.
次にGPU Computing SDK code samplesです.現時点でバージョン3.1のToolkitがプリインストールされているのですが,これより新しい3.2のcode samplesなどを落としてきてもmakeが通らないのでご注意ください.
# wget http://developer.download.nvidia.com/compute/cuda/3_1/sdk/gpucomputingsdk_3.1_linux.run # sh gpucomputingsdk_3.1_linux.run
ここまで私とまったく同じやり方であればインストール時の質問はそのままEnterで大丈夫です.コンパイルにいくつかのライブラリが必要なのでyumでインストールします.
# yum install libGLU-devel libXi-devel libXmu-devel freeglut-devel
あとはmakeでエラーが起きなければ大丈夫です.
# cd NVIDIA_GPU_COMPUTING_SDK/C # make
deviceQueryを実行してみましょう.Cluster GPU Instanceに搭載されているTesla M2050の情報が2枚分見えるはずです.
# bin/linux/release/deviceQuery bin/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) There are 2 devices supporting CUDA Device 0: "Tesla M2050" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.10 CUDA Capability Major revision number: 2 CUDA Capability Minor revision number: 0 Total amount of global memory: 2817982464 bytes Number of multiprocessors: 14 Number of cores: 448 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Clock rate: 1.15 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Concurrent kernel execution: Yes Device has ECC support enabled: Yes Device 1: "Tesla M2050" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.10 CUDA Capability Major revision number: 2 CUDA Capability Minor revision number: 0 Total amount of global memory: 2817982464 bytes Number of multiprocessors: 14 Number of cores: 448 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Clock rate: 1.15 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Concurrent kernel execution: Yes Device has ECC support enabled: Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla M2050, Device = Tesla M2050 PASSED Press <Enter> to Quit... -----------------------------------------------------------
続いてOpenCLです.こちらもmakeして異常がなければoclDeviceQueryを実行してみましょう.
# cd ../OpenCL # make # bin/linux/release/oclDeviceQuery