# Benchmarking Python, Cython, and Cython+MPI

Updated on July 12, 2012

## Introduction

For better or worse, I'm primarily a Python programmer. However, I occasionally run into a severe need for speed. Luckily, I'm not alone and many great features and libraries are available to help get my code "closer to metal". In this article, I'll detail the performance of pure Python, Cython, and distributed Cython (via mpi4py) in numerical processing. The machine I'm running on is a Toshiba laptop with a quad-core i7 (8 threads).

## The Test: Naïve Numerical Integration

To test the approaches, I'll calculate the Riemann integral of sin(x2) over [0,1]. I'll vary the number of sub-intervals from 10,000,000 to 100,000,000.

## First Approach: Pure Python

For the base case, we'll use pure Python with the math module's sine function. Code is shown below.

## Pure Python Implementation

```import time, math

def f(x):
return math.sin(x**2)

def integrate(a,b,N):
s = 0
dx = (b-a)/float(N)
for i in range(N):
s += f(a + i*dx)
return s

for n in range(10000000,100000001,10000000):
start = time.time()
integrate(0,1,n)
tme = time.time() - start
print '%s,%s'%(n, tme)```

## Second Approach: Cython

Cython is an excellent and simple method for writing and calling C-functions in Python. Even if you have never programmed C before, you should be able to pick up the basics quickly and immediate add value to your more computationally intense tasks (assuming they can't be solved via Numpy).

To implement, I wrote a "a.pyx" file with C definitions and directly call C's math library. To build a Python-accessible module, I've supplied a setup.py file. To compile, run the following at the command-line:

`python setup.py build_ext --inplace`

## Cython File, a.pyx

```cdef extern from "math.h":
double sin(double)

cdef double f(double x) except *:
return sin(x**2)

def integrate(double a, double b, int N):
cdef int i
cdef double s, dx
s = 0
dx = (b-a)/N
for i in range(N):
s += f(a + i*dx)
return s*dx
```

## Cython Build File, setup.py

```from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("a", ["a.pyx"])]

setup(
name = 'Cython Example',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
```

## Cython Testing Script

```import a as App # integration is in "a.so"
import time

for n in range(10000000,100000001,10000000):
start = time.time()
v = App.integrate(0,1,n)
tme = time.time() - start
print '%s,%s,%s'%(n, tme, v)
```

## Third Approach: Cython with MPI

MPI stands for Message Passing Interface and is a standard library for running parallel tasks on high performance clusters and supercomputers. In this case, I'll use it to distribute the integration workload across 7 threads on my laptop. I'll use one thread as the "master"; the seven "slaves" will carry out integration over a sub-interval of the domain.

The specific MPI Python library to be used is called mpi4py, written by Lisandro Dalcin. MPI is very powerful, but involves a significant amount of I/O overhead, so we should expect relative performance to improve as the work volume increases.

## Cython + MPI: Master Script

```import time, sys, numpy
from mpi4py import MPI

PROC = 7 # seven slaves

for n in range(10000000,100000001,10000000):

start = time.time()

# we include the establishment/closing of connections within the timer
comm = MPI.COMM_SELF.Spawn(sys.executable, args=['slave.py'], maxprocs=PROC)
N = numpy.array(int(n/PROC+.5), 'i')
comm.Bcast([N, MPI.INT], root=MPI.ROOT)
AREA = numpy.array(0.0, 'd')
comm.Reduce(None, [AREA, MPI.DOUBLE], op=MPI.SUM, root=MPI.ROOT)
comm.Disconnect()

tme = time.time() - start

print '%s,%s'%(n, tme)
```

## Cython + MPI: Slave Script

```#!/usr/bin/env python
from mpi4py import MPI
import a as App # again, we use Cython integration from "a.so"
import numpy

comm = MPI.Comm.Get_parent()
size = comm.Get_size()
rank = comm.Get_rank()
N = numpy.array(0, dtype='i')
comm.Bcast([N, MPI.INT], root=0)
AREA = numpy.array(App.integrate(rank*0.1428571, (rank+1)*0.1428571, N), dtype='d')
comm.Reduce([AREA, MPI.DOUBLE], None, op=MPI.SUM, root=0)
comm.Disconnect()
```

## Analysis & Conclusion

We can see from the figures above that pure Python is, predictably, the slowest. For a little more effort, we can implement via Cython and gain a massive performance increase (a full order of magnitude). Finally, for large interval values, message passing can squeeze out even more performance. However, for smaller problems, the overhead introduced by setting up communication can be expensive.

Overall, while clearly a toy problem, I hope this illustrates how versatile Python can be and some of the lesser known tools for boosting performance.

9

0

31

3

5

0