There are many methods your disposal to improve the run-time performance of your program. While using pure python is typically fast to write and implement, some actions can lead your program to become prohibitively slow when performing many complicated actions.
We are going to perform a fairly basic action. We will define a function that takes two 1D arrays, computes the sine of all values in the first array, the cosine of all values in the second array, adds them togther, and returns them.
We will also learn how to use %timeit
and %prun
to determine the performance of our functions.
array_1 = []
array_2 = []
for i in range(1000000):
array_1.append(i)
array_2.append(i*i)
print array_2[100:110]
We now have two, 1D arrays, filled with 1 million values each. We will use these arrays in all of our upcoming tests.
Let's put together a pure python program that does our required task in the most straightforward and naive (and poorly optimized) way:
from math import sin,cos
def pure_python_func_v1(A,B):
result = []
A_sin = []
B_cos = []
for x in A:
A_sin.append(sin(x))
for x in B:
B_cos.append(cos(x))
for i in range(len(A)):
result.append(A_sin[i]+B_cos[i])
return result
We will use the ipython magic command %timeit
to gauge how fast this particular function runs:
%timeit pure_python_func_v1(array_1,array_2)
That's not great. If we start working with larger datasets, this operation can easily bottleneck our analysis and waste precious cpu time. There's a few methods we can use to reduce time immediately. First, we should get rid of extraneous loops:
def pure_python_func_v2(A,B):
result = []
for i in range(len(A)):
result.append(sin(A[i])+cos(B[i]))
return result
%timeit pure_python_func_v2(array_1,array_2)
We're using inherent symmetries in the task at hand to reduce extra function calls. Since the dimensions are the same between the A, B, and result, we can eliminate extra loops. We can also remove redundant function reevaluations by "removing the dots"
def pure_python_func_v3(A,B):
result = []
append = result.append
for i in range(len(A)):
append(sin(A[i])+cos(B[i]))
return result
%timeit pure_python_func_v3(array_1,array_2)
map
¶Python optimizes list operations when we use list comprehension ([func(i) for i in my_list]
). This removes some of the list creation overhead, and thus can reduce our total runtime. We have to be careful when using these optimizations, however. Consider the following:
def list_comp_func_v1(A,B):
return [sin(x)+cos(y) for x,y in zip(A,B)]
%timeit list_comp_func_v1(array_1,array_2)
Oh no, this is slower than the last iteration! I lied to you! What's going on in this simple-looking line of code? Let's take a look at a breakdown of two of our functions using prun
:
%prun pure_python_func_v3(array_1,array_2)
3000005 function calls in 0.777 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.461 0.461 0.771 0.771 <ipython-input-8-775bf4226fba>:1(pure_python_func_v3)
1000000 0.118 0.000 0.118 0.000 {math.cos}
1000000 0.112 0.000 0.112 0.000 {math.sin}
1000000 0.071 0.000 0.071 0.000 {method 'append' of 'list' objects}
1 0.009 0.009 0.009 0.009 {range}
1 0.007 0.007 0.777 0.777 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
After running prun
, we can see that there are 3 calls made in the loop body (math.cos
, math.sin
, and append
), along with a a single range
call, among others.
%prun list_comp_func_v1(array_1,array_2)
2000004 function calls in 0.640 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.323 0.323 0.634 0.634 <ipython-input-10-6fcd029b7b31>:1(list_comp_func_v1)
1000000 0.105 0.000 0.105 0.000 {math.cos}
1 0.105 0.105 0.105 0.105 {zip}
1000000 0.101 0.000 0.101 0.000 {math.sin}
1 0.006 0.006 0.640 0.640 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
For our list comprehension, we remove both the range
call and 1,000,000 calls to append
. But we add a call to zip
, which creates an entire new object of 1,000,000 pairs of tuples before the internal loop starts. We don't need to create that entire object before we start, instead we should just generate pairs as we need them:
from itertools import izip
def list_comp_func_v2(A,B):
return [sin(x)+cos(y) for x,y in izip(A,B)]
%timeit list_comp_func_v2(array_1,array_2)
Excellent! This is substantially faster than our for-loop methods. Let's see what prun
says:
%prun list_comp_func_v2(array_1,array_2)
2000003 function calls in 0.520 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.307 0.307 0.512 0.512 <ipython-input-14-86d19d4a71c8>:2(list_comp_func_v2)
1000000 0.104 0.000 0.104 0.000 {math.cos}
1000000 0.101 0.000 0.101 0.000 {math.sin}
1 0.008 0.008 0.520 0.520 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
We no longer have a call to zip
. Instead, our call to izip
is a generator function; it only produces iterables as we need them. This greatly reduces runtime when we are iterating over a large number of values. In fact, we could use a similar trick when generating array_1
and array_2
back at the top of this page. Instead of calling range
, we could call xrange
(a generator function). Try it out!
As one last foray into pure python, let's take a look at map
and lambda
:
def map_func_v1(A,B):
return map(lambda x,y:sin(x)+cos(y), A, B)
%timeit map_func_v1(array_1,array_2)
%prun list_comp_func_v1(array_1,array_2)
2000004 function calls in 0.660 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.342 0.342 0.654 0.654 <ipython-input-10-6fcd029b7b31>:1(list_comp_func_v1)
1000000 0.108 0.000 0.108 0.000 {math.cos}
1000000 0.105 0.000 0.105 0.000 {math.sin}
1 0.099 0.099 0.099 0.099 {zip}
1 0.006 0.006 0.660 0.660 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
lambda
is the temporary function keyword. In our case, we make a function that takes in two variables (x,y), and returns sin(x)+cos(y). The map
keyword applies this function to pairwise elements of the arrays A and B. If our operation is too complex to easily be converted into list comprehension syntax, using map
and lambda
can still improve performance over a simple for-loop.
Numpy is optimized for mathematical operations on arrays of values. Numpy's core is written in C, which means the numpy lists are actually C-arrays. They are static and typed, unlike almost all objects in pure Python. Numpy core math functions are optimized to work extremely quickly over data arrays, and their inputs can be arrays of any number of dimensions. This is called vectorization, and the core numpy math functions are ufuncs (universal functions) (http://docs.scipy.org/doc/numpy/reference/ufuncs.html).
import numpy as np
def numpy_func_v1(A,B):
return np.sin(A)+np.cos(B)
%timeit numpy_func_v1(array_1,array_2)
array_1_np = np.asarray(array_1) # convert the lists to numpy arrays
array_2_np = np.asarray(array_2)
%timeit numpy_func_v1(array_1_np,array_2_np)
This is a significant improvement, a factor of ~6 over the best pure python we could manage. Keep in mind that numpy is optimized for vectorization, not for-loops! Look at what happens when we loop over numpy arrays:
def numpy_func_v2(A,B):
result = np.empty(len(A))
for i in range(len(A)):
result[i]= np.sin(A[i])+np.cos(B[i])
return result
%timeit numpy_func_v2(array_1_np,array_2_np)
That's horrible. Numpy is not to be used like pure python!
There are even more packages available to speed up code. Try them out, but make sure to use tools like timeit
and prun
to find and fix obvious bottlenecks and flaws in your code before you jump straight into powerful optimization tools! Here's an example of another tool, numexpr
.
import numexpr as ne
def numexpr_func(A,B):
return ne.evaluate('sin(A)+cos(B)')
%timeit numexpr_func(array_1_np,array_2_np)
How do I get this and how does it work? Check it out here: https://github.com/pydata/numexpr