Parallel Computing

With Python



Jason Champion

@Xangis

github.com/Xangis


Why GPGPU?


Supercomputing with high performance per watt.


Physics simulations, medical, oil and gas exploration.


Weather, particle systems, gravity.

Gigaflops Performance

(Approximate)

  • Intel Core 2 Duo: 20 GFLOPs
  • Intel i5, AMD Phenom II x4: 50 GFLOPs
  • Intel i7: 100 GFLOPs
  • NVIDIA Geforce 630M: 300 GFLOPs
  • NVIDIA Geforce 640, 650M, 750M: 700 GLFOPs
  • Intel Xeon Phi: 1000-1200 GFLOPs
  • NVIDIA GTX 760, AMD Radeon 6950: 2200 GFLOPS
  • NVIDIA Geforce GTX Titan: 4500 GFLOPS
  • AMD Radeon 7990: 8200 GFLOPS (face melt!)

Speed of the FASTEST Supercomputer

1993: 60 GFLOPs (~ Intel Core i5)

1995: 220 GFLOPs (~ Modern Dual Xeon)

1997: 1.3 TFLOPs (~ Intel Xeon Phi)

1999: 2.4 TFLOPs (~ Radeon 6950)

2002: 35 TFLOPs (~ Quad Radeon 7990)

2005: 280 TFLOPs

2008: 1 PFLOP

2010: 10 PFLOPs

2013: 33 PFLOPs

A $100 million 1998-era supercomputer can be had for $200.

Toys!

Intel Xeon Phi


Toys!

NVIDIA Geforce GTX Titan

Toys!

AMD Radeon 7990

Before GPGPU


Win32 Threads

The "dinner from a diaper" of parallelism. In C.

Being Replaced by C++ AMP, but still Windows-only.


Pthreads

"Old reliable" works great for non-GPU uses once you know it.


OpenMP

Makes life easy and in the multicore CPU world.

(If you're doing multiprocessor in notPython, it's well worth learning)

Early GPGPU



No dedicated computing language or APIs.

People used programmable graphics shaders
 to perform calculations.

Slightly less fun than writing assembly language.

NVIDIA CUDA


Modified C programming language.


First general-purpose GPU computing API.


Only runs on NVIDIA hardware.

OpenCL


More complex than CUDA.


General-purpose C-based computing API.


Standards by the Khronos group (same as OpenGL, similar API).


Runs on CPUs and GPUs.


PyOpenCL was used for poclbm OpenCL bitcoin miner.

Python for Supercomputing


Fortran, C, and Assembly consistently benchmark as

the fastest programming languages.


Global interpreter lock prevents CPU-based

implementations of OpenMP from being good in Python.


Python is widely known as being slow,

so why use it for supercomputing?

Python for Supercomputing


It's optimized for the developer, and programmer time often costs more than CPU time.


Python has awesome libraries, especially for

data visualizaton.


Fast prototyping.


Can easily (-ish) port to C if more speed is needed

after the idea is proven, but you won't need to.

Prerequisites


AMD:

The AMD APP SDK

NVIDIA:

The NVIDIA CUDA SDK

Intel:

The Intel Xeon Phi SDK


Get them at the manufacturer website (see resources)


NumPy, SciPy (pip handles these)

Installing PyOpenCL


For the lazy Linux user:

sudo apt-get install python-pyopencl


It's on PyPI:

https://pypi.python.org/pypi/pyopencl

(pip install pyopencl)


Apple + AMD users are out of luck.

Pitfalls / Hassles


Ubuntu 13.10 w/NVIDIA Optimus (bumblebee) not awesome.  One of the many reasons Linus gave NVIDIA the finger.

Installing PyCUDA


For the lazy Linux user:

sudo apt-get install python-pycuda


It's on PyPI:

https://pypi.python.org/pypi/pycuda

(pip install pycuda)

Trivial Example:

Mandelbrot Set Fractal

PyCUDA


[Source in Console]


From:

http://craneium.net/index.php?option=com_content&view=category&layout=blog&id=37&Itemid=97

Less Trivial Examples:

N-body Gravity Simulator


CUDA


Ocean Simulator

OpenCL


[See video]

Cryptocurrency on the GPU:

Don't Do It!

GPUs are much better for cryptographic calculations than CPUs.


Bitcoin doesn't use GPUs anymore, it uses custom hardware.


Litecoins and Feathercoins use GPU reasonably well.


Your return on investment, including hardware amortization and electricity costs, will be breakeven at best.


Just buy the coins on an exchange. Or create a better cryptocurrency that doesn't depend on environment-damaging power use.

OpenGL Interop


Your data is already on the GPU.  Why not just render it?


Memory copies are suddenly not a problem.


That's what's happening with the ocean simulator.

Pitfalls


You have to know where your memory is.


You have to learn a good deal about GPU architecture to use it well.


You have to think way too much about what memory you're using and what data you're copying where.


The actual GPU programs are still in C.


Did I mention thinking too much about memory?

Resources


Lots of Books. Pretty much all of them focus on C:

CUDA by Example

The CUDA Handbook

OpenCL Programming Guide


Website of Andreas Klöcker, creator of PyCUDA and PyOpenCL 

http://mathema.tician.de/software/

Good documentation and examples on the wiki.


This presentation and trivial test apps for CUDA and OpenCL:

https://slid.es/xangis/parallel-computing/ ; http://github.com/Xangis

More Resources


NVIDIA CUDA SDK:

https://developer.nvidia.com/cuda-downloads

LOTS of great code samples in the SDK.


AMD APP SDK (OpenCL):

http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/


Intel OpenCL SDK:

http://software.intel.com/en-us/vcsource/tools/opencl-sdk


Parallel Computing

By xangis

Parallel Computing

  • 2,513