# Benchmarking Python, Cython, and Cython+MPI

## Introduction

For better or worse, I'm primarily a Python programmer. However, I occasionally run into a severe need for speed. Luckily, I'm not alone and many great features and libraries are available to help get my code "closer to metal". In this article, I'll detail the performance of pure Python, Cython, and distributed Cython (via mpi4py) in numerical processing. The machine I'm running on is a Toshiba laptop with a quad-core i7 (8 threads).

## The Test: Naïve Numerical Integration

To test the approaches, I'll calculate the Riemann integral of sin(x^{2}) over [0,1]. I'll vary the number of sub-intervals from 10,000,000 to 100,000,000.

## First Approach: Pure Python

For the base case, we'll use pure Python with the math module's sine function. Code is shown below.

## Pure Python Implementation

import time, math def f(x): return math.sin(x**2) def integrate(a,b,N): s = 0 dx = (b-a)/float(N) for i in range(N): s += f(a + i*dx) return s for n in range(10000000,100000001,10000000): start = time.time() integrate(0,1,n) tme = time.time() - start print '%s,%s'%(n, tme)

## Second Approach: Cython

Cython is an excellent and simple method for writing and calling C-functions in Python. Even if you have never programmed C before, you should be able to pick up the basics quickly and immediate add value to your more computationally intense tasks (assuming they can't be solved via Numpy).

To implement, I wrote a "a.pyx" file with C definitions and directly call C's math library. To build a Python-accessible module, I've supplied a setup.py file. To compile, run the following at the command-line:

python setup.py build_ext --inplace

## Cython File, a.pyx

cdef extern from "math.h": double sin(double) cdef double f(double x) except *: return sin(x**2) def integrate(double a, double b, int N): cdef int i cdef double s, dx s = 0 dx = (b-a)/N for i in range(N): s += f(a + i*dx) return s*dx

## Cython Build File, setup.py

from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext ext_modules = [Extension("a", ["a.pyx"])] setup( name = 'Cython Example', cmdclass = {'build_ext': build_ext}, ext_modules = ext_modules )

## Cython Testing Script

import a as App # integration is in "a.so" import time for n in range(10000000,100000001,10000000): start = time.time() v = App.integrate(0,1,n) tme = time.time() - start print '%s,%s,%s'%(n, tme, v)

## Third Approach: Cython with MPI

MPI stands for Message Passing Interface and is a standard library for running parallel tasks on high performance clusters and supercomputers. In this case, I'll use it to distribute the integration workload across 7 threads on my laptop. I'll use one thread as the "master"; the seven "slaves" will carry out integration over a sub-interval of the domain.

The specific MPI Python library to be used is called mpi4py, written by Lisandro Dalcin. MPI is very powerful, but involves a significant amount of I/O overhead, so we should expect relative performance to improve as the work volume increases.

## Cython + MPI: Master Script

import time, sys, numpy from mpi4py import MPI PROC = 7 # seven slaves for n in range(10000000,100000001,10000000): start = time.time() # we include the establishment/closing of connections within the timer comm = MPI.COMM_SELF.Spawn(sys.executable, args=['slave.py'], maxprocs=PROC) N = numpy.array(int(n/PROC+.5), 'i') comm.Bcast([N, MPI.INT], root=MPI.ROOT) AREA = numpy.array(0.0, 'd') comm.Reduce(None, [AREA, MPI.DOUBLE], op=MPI.SUM, root=MPI.ROOT) comm.Disconnect() tme = time.time() - start print '%s,%s'%(n, tme)

## Cython + MPI: Slave Script

#!/usr/bin/env python from mpi4py import MPI import a as App # again, we use Cython integration from "a.so" import numpy comm = MPI.Comm.Get_parent() size = comm.Get_size() rank = comm.Get_rank() N = numpy.array(0, dtype='i') comm.Bcast([N, MPI.INT], root=0) AREA = numpy.array(App.integrate(rank*0.1428571, (rank+1)*0.1428571, N), dtype='d') comm.Reduce([AREA, MPI.DOUBLE], None, op=MPI.SUM, root=0) comm.Disconnect()

## Performance Comparison

## Performance Comparison - Log Time

## Analysis & Conclusion

We can see from the figures above that pure Python is, predictably, the slowest. For a little more effort, we can implement via Cython and gain a massive performance increase (a full order of magnitude). Finally, for large interval values, message passing can squeeze out even more performance. However, for smaller problems, the overhead introduced by setting up communication can be expensive.

Overall, while clearly a toy problem, I hope this illustrates how versatile Python can be and some of the lesser known tools for boosting performance.