When in 2010 I lived in Singapore I crossed my life path with some great guys working in the field of High-Performance Computing: Łukasz Orłowski, Marek T. Michalewicz, Iain Bell from Quadrant Capital. They both pointed my attention towards GPU computations utilizing Nvidia CUDA architecture. There was one problem. Everything was wrapped up with C syntax with a promise to do it more efficiently in C++. Soon.

So I waited and studied C/C++ at least at the level allowing me to understand some CUDA codes. Years were passing by until the day when I discovered an article of Mark Harris, NumbaPro: High-Performance Python with CUDA Acceleration, delivering Python-friendly CUDA solutions to all my nightmares involving C/C++ coding. At this point you just need to understand one thing: not every quant or algo trader is a fan of C/C++ the same as some people prefer Volvo to Audi, including myself ;)

Let’s have a sincere look at what the game is about. It is more than tempting to put your hands on a piece of code that allows you to speed-up some quantitative computations in Python making use of a new library.

**Accelerate your Python**

I absolutely love Continuum Analytics for the mission they stand for: making Python language easily accessible and used by everyone worldwide! It is a great language with a great syntax, easy to pick up, easy to be utilised in the learning process of the fundamentals of programming. Thanks to them, now you can download and install Python’s distribution of Anaconda for Windows, Mac OS X, or Linux just in few minutes (see my earlier post on Setting up Python for Quantitative Analysis in OS X 10.10 Yosemite as an example).

When you visit their webpage you can spot Anaconda’s Add-Ons, three additional software packages to their Python distribution. Among them, they offer *Accelerate* module containing **NumbaPro** library. Once you read the description, and I quote,

Accelerateis an add-on to Continuum’s free enterprise Python distribution, Anaconda. It opens up the full capabilities of your GPU or multi-core processor to Python.Accelerateincludes two packages that can be added to your Python installation:NumbaProandMKL Optimizations. MKL Optimizations makes linear algebra, random number generation, Fourier transforms, and many other operations run faster and in parallel.NumbaProbuilds fast GPU and multi-core machine code from easy-to-read Python and NumPy code with a Python-to-GPU compiler.

NumbaPro Features

– NumbaPro compiler targets multi-core CPU and GPUs directly from

simple Python syntax

– Easily move vectorized NumPy functions to the GPU

– Multiple CUDA device support

– Bindings for CUDA libraries, including cuBlas, cuRand, cuSparse, and cuFFT

– Support for array slicing and fast array math

– Use multiple threads without worrying about the GIL

– Supported on NVIDIA CUDA-enabled GPUs with compute capability 2.0

or above on Intel/AMD (x86) processors.

your blood pressure increases and the level of endorphins skyrockets. Why? Simply because of the promise to do some tasks faster utilising GPU in a parallel mode! If you are new to GPU or CUDA I recommend you to read some well written posts on Mike’s website, for instance, Installing Nvidia CUDA on Mac OSX for GPU-based Parallel Computing or Monte Carlo Simulations in CUDA – Barrier Option Pricing. You will grasp the essence of what is all about. In general, much ado about CUDA is still around making use of your GPU and proving this extra upmhhh in speedup. If you have any quantitative problem in mind and it can be executed in the parallel mode, NumbaPro is a tool you need to look at but – not every engine sounds the same. Hold on till the end of this post. It will be worth it.

**Selling the Speed**

When you approach a new concept or a new product and someone tries to sell it to you, he needs to impress you to win your attention and boost your curiosity. Imagine for a moment that you have no idea about GPU or CUDA and you want to add two vectors. In Python you can do it as follows:

import numpy as np from timeit import default_timer as timer def VectorAdd(a,b,c): for i in xrange(a.size): c[i]=a[i]+b[i] def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() VectorAdd(A,B,C) totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main() |

We all know that once you run the code, Python does not compile it, it goes line by line and interprets what it reads. So, we aim at adding two vectors, $A$ and $B$, containing 32 millions of elements. I get $C=A+B$ matrix on my MacBook Pro (2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM, NVIDIA GeForce GT 650M 1GB) after:

CPU time: 9.89753 sec |

Can we do it better? With **NumbaPro** the required changes to the code itself are minor. All we need to add is a function decorator that tells how and where the function should be executed. In fact, what NumbaPro does is that it “compiles” *VectorAdd* function on-the-go and deploys computations to the GPU unit:

import numpy as np from timeit import default_timer as timer from numbapro import vectorize @vectorize(["float32(float32,float32)"], target="gpu") def VectorAdd(a,b): return a+b def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=VectorAdd(A,B) totaltime=timer()-start print("\nGPU time: %g sec" % totaltime) if __name__ == '__main__': main() |

We get

GPU time: 0.286101 sec |

i.e. 34.6x speed-up. Not bad, right?! Not bad if you’re a sale person indeed! But, hey, what’s that?:

import numpy as np from timeit import default_timer as timer def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=A+B totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main() |

Run it to discover that:

CPU time: 0.0592878 sec |

i.e. 4.82x faster than using GPU. Oh, boy! CUDA:NumPy (0:1).

**Perfect Pitch**

When Nvidia introduced CUDA among some exemplary C codes utilising CUDA programming we could find an immortal **Black-Scholes model** for **option pricing**. In this Nobel-prize winning solution, we derive a call option price for non-dividend-paying underlying stock:

$$

C(S,t) = N(d_1)S – N(d_2)Ke^{-r(T-t)}

$$

where $(T-t)$ is the time to maturity (scalar), $r$ is the risk free rate (scalar), $S$ is the spot price of the underlying asset, $K$ is the strike price, and $\sigma$ is the volatility of returns of the underlying asset. $N(\dot)$ is the cumulative distribution function (cnd) of the standard normal distribution and has an analytical form. A classical way to code it in Python is:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | import numpy as np import time RISKFREE = 0.02 VOLATILITY = 0.30 def cnd(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * np.abs(d)) ret_val = (RSQRT2PI * np.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) return np.where(d > 0, 1.0 - ret_val, ret_val) def black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, Riskfree, Volatility): S = stockPrice X = optionStrike T = optionYears R = Riskfree V = Volatility sqrtT = np.sqrt(T) d1 = (np.log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd(d1) cndd2 = cnd(d2) expRT = np.exp(- R * T) callResult[:] = (S * cndd1 - X * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 if len(args) >= 2: iterations = int(args[0]) callResult = np.zeros(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) time0 = time.time() for i in range(iterations): black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() print("Time: %f msec per option" % ((time1-time0)/iterations/OPT_N*1000)) if __name__ == "__main__": import sys main(*sys.argv[1:]) |

what returns

Time: 0.000192 msec per option |

The essence of this code is to derive 4 million *independent* results based on feeding the * function with random stock prices, option strike prices, and times to maturity. They enter the game under cover as row vectors with randomised values (see lines #45-47). Anaconda Accelerate’s CUDA solution for the same code is:*

import numpy as np import math import time from numba import * from numbapro import cuda from blackscholes import black_scholes # save the previous code as # black_scholes.py RISKFREE = 0.02 VOLATILITY = 0.30 A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 callResultNumpy = np.zeros(OPT_N) putResultNumpy = -np.ones(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) callResultNumba = np.zeros(OPT_N) putResultNumba = -np.ones(OPT_N) callResultNumbapro = np.zeros(OPT_N) putResultNumbapro = -np.ones(OPT_N) # Numpy ---------------------------------------------------------------- time0 = time.time() for i in range(iterations): black_scholes(callResultNumpy, putResultNumpy, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() dtnumpy = ((1000 * (time1 - time0)) / iterations)/OPT_N print("\nNumpy Time %f msec per option") % (dtnumpy) # CUDA ----------------------------------------------------------------- time0 = time.time() blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = cuda.stream() d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) time2 = time.time() for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize() time3 = time.time() dtcuda = ((1000 * (time3 - time2)) / iterations)/OPT_N print("Numbapro CUDA Time %f msec per option (speed-up %.1fx) \n") % (dtcuda, dtnumpy/dtcuda) # print(callResultNumbapro) if __name__ == "__main__": import sys main(*sys.argv[1:]) |

returning

Numpy Time 0.000186 msec per option Numbapro CUDA Time 0.000024 msec per option (speed-up 7.7x) |

In order to understand why CUDA wins over NumPy this time is not so difficult. First we have a programmable analytical form of the problem. We deploy it to GPU and perform exhaustive calculations involving *cnd_cuda* function for the estimation of the cumulative distribution function of the standard normal distribution. Splitting the task into many concurrently running threats on GPU reduces the time. Again, it’s possible because all option prices can be computed independently. CUDA:NumPy (1:1).

**Multiplied Promises**

In finance, the concept of **portfolio optimization** is well established (see my ebook on that, Applied Portfolio Optimization with Risk Management, as an example). The idea standing behind is to find such a vector of weights, $w$, for all assets that the derived estimated portfolio risk ($\sigma_P$) and return ($\mu_P$) meets our needs or expectations.

An alternative (but not greatly recommended) approach would involve optimization through randomisation of $w$ vectors. We could generate a big number of them, say $N$, in order to obtain:

$$

\mu_{P,i} = m w_i^T \ \ \ \mbox{and} \ \ \ \sigma_{P,i} = w_iM_2w_i^T

$$ for $i=1,…,N$. Here, $m$ is a row-vector holding estimated expected returns for all assets in portfolio $P$ and based on return-series we end up with $M_2$ covariance matrix $(MxM$ where $M$ is a number of assets in $P$). In the first case, we aim at the multiplication of row-vector with a transposed row-vector of weight whereas for the latter we perform the multiplication of the row-vector with the square matrix (as the first operation). If $N$ is really big, say a couple of millions, the computation could be somehow accelerated using GPU. Therefore, we need a code for matrix multiplication on GPU in Python.

Let’s consider a more advanced concept of ($K\times K$)$\times$($K\times K$) matrix multiplication. If that works faster, than our **random portfolios** problem should be even faster. Continuum Analytics provides with a ready-to-use solution:

import numpy as np from numbapro import cuda import numba from timeit import default_timer as timer from numba import float32 bpg = 32 tpb = 32 n = bpg * tpb shared_mem_size = (tpb, tpb) griddim = bpg, bpg blockdim = tpb, tpb @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def naive_matrix_mult(A, B, C): x, y = cuda.grid(2) if x >= n or y >= n: return C[y, x] = 0 for i in range(n): C[y, x] += A[y, i] * B[i, x] @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def optimized_matrix_mult(A, B, C): # Declare shared memory sA = cuda.shared.array(shape=shared_mem_size, dtype=float32) sB = cuda.shared.array(shape=shared_mem_size, dtype=float32) tx = cuda.threadIdx.x ty = cuda.threadIdx.y x, y = cuda.grid(2) acc = 0 for i in range(bpg): if x < n and y < n: # Prefill cache sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x] # Synchronize all threads in the block cuda.syncthreads() if x < n and y < n: # Compute product for j in range(tpb): acc += sA[ty, j] * sB[j, tx] # Wait until all threads finish the computation cuda.syncthreads() if x < n and y < n: C[y, x] = acc # Prepare data on the CPU A = np.array(np.random.random((n, n)), dtype=np.float32) B = np.array(np.random.random((n, n)), dtype=np.float32) print "(%d x %d) x (%d x %d)" % (n, n, n, n) # Prepare data on the GPU dA = cuda.to_device(A) dB = cuda.to_device(B) dC = cuda.device_array_like(A) # Time the unoptimized version s = timer() naive_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() unopt_ans = dC.copy_to_host() tcuda_unopt = e - s # Time the optimized version s = timer() optimized_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() opt_ans = dC.copy_to_host() tcuda_opt = e - s assert np.allclose(unopt_ans, opt_ans) print "CUDA without shared memory:", "%.2f" % tcuda_unopt, "s" print "CUDA with shared memory :", "%.2f" % tcuda_opt, "s" s = timer() np.dot(A,B) e = timer() npt=e-s print "NumPy dot product :", "%.2f" % npt, "s" |

what returns

(1024 x 1024) x (1024 x 1024) CUDA without shared memory: 0.76 s CUDA with shared memory : 0.25 s NumPy dot product : 0.06 s |

and leads to CUDA:NumPy (1:2) score of the game. The natural questions arise. Is it about the matrix size? Maybe it is too simple problem we try to solve it with an improper tool? Or the way how we approach matrix allocation and deployment to GPU itself?

The last question made me digging deeper. In Python you can create one big matrix holding a number of smaller matrices. The following code tries to perform 4 million $(2\times 2)$ matrix multiplications where matrix $B$ is randomised every single time (see our random portfolio problem). In Anaconda Accelerate we achieve it as follows:

import numbapro import numba.cuda import numpy as np from timeit import default_timer as timer # Use the builtin matrix_multiply in NumPy for CPU test import numpy.core.umath_tests as ut @numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m, n),(n, p)->(m, p)', target='gpu') def batch_matrix_mult(a, b, c): for i in range(c.shape[0]): for j in range(c.shape[1]): tmp = 0 for n in range(a.shape[1]): tmp += a[i, n] * b[n, j] c[i, j] = tmp def main(): n = 4000000 dim = 2 sK=0 KN=10 for K in range(KN): # Matrix Multiplication: c = a x b a = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) c = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) # NUMPY ------------------------------------------------------------- start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) d=ut.matrix_multiply(a, b) np_time=timer()-start # CUDA -------------------------------------------------------------- dc = numba.cuda.device_array_like(c) da = numba.cuda.to_device(a) start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) db = numba.cuda.to_device(b) batch_matrix_mult(da, db, out=dc) numba.cuda.synchronize() dc.copy_to_host(c) cuda_time=timer()-start sK += np_time/cuda_time del da, db print("\nThe average CUDA speed-up: %.5fx") % (sK/KN) if __name__ == '__main__': main() |

leading us to

The average CUDA speed-up: 0.79003x |

i.e. deceleration. Playing with the sizes of matrices and their number may result in error caused by GPU memory required for matrix $A$ and $B$ allocation on GPU. It seems we have CUDA:NumPy (1:3).

**Black Magic in Black Box**

I approached NumbaPro solution as a complete rookie. I spent a considerable amount of time searching for Anaconda Accelerate’s GPU codes demonstrating massive speed-ups as promised. I found different fragments in different places across the Web. With the best method known among all beginners, namely, copy and paste, I re-ran what I found. Then modified and re-ran again. And again, and again. I felt the need, the need for speed! But failed finding my tail wind.

This post may demonstrate my lack of understanding of what is going on or reveal a blurred picture standing behind: a magic that works if you know all the tricks. I hope that at least you enjoyed the show!