Make Your Program Faster!

There are many methods your disposal to improve the run-time performance of your program. While using pure python is typically fast to write and implement, some actions can lead your program to become prohibitively slow when performing many complicated actions.

Test case: Trig Operations on a 1D Array

We are going to perform a fairly basic action. We will define a function that takes two 1D arrays, computes the sine of all values in the first array, the cosine of all values in the second array, adds them togther, and returns them. We will also learn how to use %timeit and %prun to determine the performance of our functions.

Setup

In [1]:
array_1 = []
array_2 = []
for i in range(1000000):
    array_1.append(i)
    array_2.append(i*i)
In [2]:
print array_2[100:110]
[10000, 10201, 10404, 10609, 10816, 11025, 11236, 11449, 11664, 11881]

We now have two, 1D arrays, filled with 1 million values each. We will use these arrays in all of our upcoming tests.

Pure Python

Let's put together a pure python program that does our required task in the most straightforward and naive (and poorly optimized) way:

In [3]:
from math import sin,cos

def pure_python_func_v1(A,B):
    result = []
    A_sin = []
    B_cos = []
    for x in A:
        A_sin.append(sin(x))
    for x in B:
        B_cos.append(cos(x))
    for i in range(len(A)):
        result.append(A_sin[i]+B_cos[i])
    
    return result

We will use the ipython magic command %timeit to gauge how fast this particular function runs:

In [4]:
%timeit pure_python_func_v1(array_1,array_2)
1 loops, best of 3: 513 ms per loop

That's not great. If we start working with larger datasets, this operation can easily bottleneck our analysis and waste precious cpu time. There's a few methods we can use to reduce time immediately. First, we should get rid of extraneous loops:

In [5]:
def pure_python_func_v2(A,B):
    result = []
    for i in range(len(A)):
        result.append(sin(A[i])+cos(B[i]))
    
    return result
In [6]:
%timeit pure_python_func_v2(array_1,array_2)
1 loops, best of 3: 322 ms per loop

We're using inherent symmetries in the task at hand to reduce extra function calls. Since the dimensions are the same between the A, B, and result, we can eliminate extra loops. We can also remove redundant function reevaluations by "removing the dots"

In [7]:
def pure_python_func_v3(A,B):
    result = []
    append = result.append
    for i in range(len(A)):
        append(sin(A[i])+cos(B[i]))
    
    return result
In [8]:
%timeit pure_python_func_v3(array_1,array_2)
1 loops, best of 3: 271 ms per loop

Pure Python with list comprehension and map

Python optimizes list operations when we use list comprehension ([func(i) for i in my_list]). This removes some of the list creation overhead, and thus can reduce our total runtime. We have to be careful when using these optimizations, however. Consider the following:

In [9]:
def list_comp_func_v1(A,B):
    return [sin(x)+cos(y) for x,y in zip(A,B)]
In [10]:
%timeit list_comp_func_v1(array_1,array_2)
1 loops, best of 3: 342 ms per loop

Oh no, this is slower than the last iteration! I lied to you! What's going on in this simple-looking line of code? Let's take a look at a breakdown of two of our functions using prun:

In [11]:
%prun pure_python_func_v3(array_1,array_2)
 
    3000005 function calls in 0.777 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.461    0.461    0.771    0.771 <ipython-input-8-775bf4226fba>:1(pure_python_func_v3)
  1000000    0.118    0.000    0.118    0.000 {math.cos}
  1000000    0.112    0.000    0.112    0.000 {math.sin}
  1000000    0.071    0.000    0.071    0.000 {method 'append' of 'list' objects}
        1    0.009    0.009    0.009    0.009 {range}
        1    0.007    0.007    0.777    0.777 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

After running prun, we can see that there are 3 calls made in the loop body (math.cos, math.sin, and append), along with a a single range call, among others.

In [12]:
%prun list_comp_func_v1(array_1,array_2)
 
    2000004 function calls in 0.640 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.323    0.323    0.634    0.634 <ipython-input-10-6fcd029b7b31>:1(list_comp_func_v1)
  1000000    0.105    0.000    0.105    0.000 {math.cos}
        1    0.105    0.105    0.105    0.105 {zip}
  1000000    0.101    0.000    0.101    0.000 {math.sin}
        1    0.006    0.006    0.640    0.640 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

For our list comprehension, we remove both the range call and 1,000,000 calls to append. But we add a call to zip, which creates an entire new object of 1,000,000 pairs of tuples before the internal loop starts. We don't need to create that entire object before we start, instead we should just generate pairs as we need them:

In [13]:
from itertools import izip
def list_comp_func_v2(A,B):
    return [sin(x)+cos(y) for x,y in izip(A,B)]
In [14]:
%timeit list_comp_func_v2(array_1,array_2)
1 loops, best of 3: 239 ms per loop

Excellent! This is substantially faster than our for-loop methods. Let's see what prun says:

In [15]:
%prun list_comp_func_v2(array_1,array_2)
 
    2000003 function calls in 0.520 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.307    0.307    0.512    0.512 <ipython-input-14-86d19d4a71c8>:2(list_comp_func_v2)
  1000000    0.104    0.000    0.104    0.000 {math.cos}
  1000000    0.101    0.000    0.101    0.000 {math.sin}
        1    0.008    0.008    0.520    0.520 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

We no longer have a call to zip. Instead, our call to izip is a generator function; it only produces iterables as we need them. This greatly reduces runtime when we are iterating over a large number of values. In fact, we could use a similar trick when generating array_1 and array_2 back at the top of this page. Instead of calling range, we could call xrange (a generator function). Try it out!

As one last foray into pure python, let's take a look at map and lambda:

In [16]:
def map_func_v1(A,B):
    return map(lambda x,y:sin(x)+cos(y), A, B)
In [17]:
%timeit map_func_v1(array_1,array_2)
1 loops, best of 3: 312 ms per loop
In [18]:
%prun list_comp_func_v1(array_1,array_2)
 
    2000004 function calls in 0.660 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.342    0.342    0.654    0.654 <ipython-input-10-6fcd029b7b31>:1(list_comp_func_v1)
  1000000    0.108    0.000    0.108    0.000 {math.cos}
  1000000    0.105    0.000    0.105    0.000 {math.sin}
        1    0.099    0.099    0.099    0.099 {zip}
        1    0.006    0.006    0.660    0.660 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

lambda is the temporary function keyword. In our case, we make a function that takes in two variables (x,y), and returns sin(x)+cos(y). The map keyword applies this function to pairwise elements of the arrays A and B. If our operation is too complex to easily be converted into list comprehension syntax, using map and lambda can still improve performance over a simple for-loop.

Using Numpy

Numpy is optimized for mathematical operations on arrays of values. Numpy's core is written in C, which means the numpy lists are actually C-arrays. They are static and typed, unlike almost all objects in pure Python. Numpy core math functions are optimized to work extremely quickly over data arrays, and their inputs can be arrays of any number of dimensions. This is called vectorization, and the core numpy math functions are ufuncs (universal functions) (http://docs.scipy.org/doc/numpy/reference/ufuncs.html).

In [19]:
import numpy as np

def numpy_func_v1(A,B):
    return np.sin(A)+np.cos(B)
In [20]:
%timeit numpy_func_v1(array_1,array_2)
10 loops, best of 3: 139 ms per loop
In [21]:
array_1_np = np.asarray(array_1) # convert the lists to numpy arrays
array_2_np = np.asarray(array_2)
%timeit numpy_func_v1(array_1_np,array_2_np)
10 loops, best of 3: 46.8 ms per loop

This is a significant improvement, a factor of ~6 over the best pure python we could manage. Keep in mind that numpy is optimized for vectorization, not for-loops! Look at what happens when we loop over numpy arrays:

In [22]:
def numpy_func_v2(A,B):
    result = np.empty(len(A))
    for i in range(len(A)):
        result[i]= np.sin(A[i])+np.cos(B[i])
    return result
In [23]:
%timeit numpy_func_v2(array_1_np,array_2_np)
1 loops, best of 3: 2.17 s per loop

That's horrible. Numpy is not to be used like pure python!

Even More!

There are even more packages available to speed up code. Try them out, but make sure to use tools like timeit and prun to find and fix obvious bottlenecks and flaws in your code before you jump straight into powerful optimization tools! Here's an example of another tool, numexpr.

In [24]:
import numexpr as ne
In [25]:
def numexpr_func(A,B):
    return ne.evaluate('sin(A)+cos(B)')
In [26]:
%timeit numexpr_func(array_1_np,array_2_np)
100 loops, best of 3: 8.95 ms per loop

How do I get this and how does it work? Check it out here: https://github.com/pydata/numexpr