Translating the Julia Benchmarks into Cython (part 1)
Hi everyone! ๐
If you've ever heard anything about the Julia programming language, you probably heard that it is fast. They've got some pretty impressive microbenchmarks on their website to show this. In comparison, the Python programming language seems really slow. Up to 100x slower in this specific benchmark set, in fact. But can we make Python faster?
In this post (my first one on Hashnode ๐, by the way), I'll be going over each benchmark in this set and trying to make Python fast by sprinkling some Cython love over it. With Cython, we'll effectively translate Python into C and then compile it into a Python extension module.
But first, let's plan out our steps, starting with some preliminary comparisons.
Time required to run each benchmark
Below I've gathered the time required to run each benchmark (try the code out from the original repository). All times are in milliseconds and are based on the minimum time out of five runs. I've ordered rows by the time Python took to run them.
Benchmark | Python (original, ms) | Julia (original, ms) |
iteration_pi_sum | 430.432 | 6.096 |
matrix_statistics | 62.421 | 7.265 |
print_to_file | 35.892 | 14.598 |
matrix_multiply | 32.873 | 17.522 |
recursion_quicksort | 7.845 | 0.331 |
userfunc_mandelbrot | 4.579 | 0.051 |
recursion_fibonacci | 2.323 | 0.036 |
parse_integers | 1.107 | 0.110 |
Versions used:
- Python 3.8.10
- Julia 1.6.0
As you can see, Julia has impressive numbers. Since the slowest Python
benchmark seems to be the iteration_pi_sum
example. Let's take a look at it
first.
The iteration_pi_sum
example
This microbenchmark calculates an approximation to the following series:
$$ \sum_{k=1}^{\infty} \frac{1}{k^2} $$
Only the first 10000 terms are considered, and the calculation is repeated 500 times.
First, here's the Python version:
def pisum():
sum = 0.0
for j in range(1, 501):
sum = 0.0
for k in range(1, 10001):
sum += 1.0/(k*k)
return sum
And here's the Julia one:
function pisum()
sum = 0.0
for j = 1:500
sum = 0.0
for k = 1:10000
sum += 1.0/(k*k)
end
end
sum
end
That's pretty much a direct translation of each other, but the Python version runs over 70x slower. What's going on?
The thing is that, while Julia infers types and can produce very efficient machine code, Python makes an object out of everything. And that adds a lot of indirection and overhead.
In fact, you can look at the exact Assembly code that is being generated by
Julia (by running @code_native pisum()
in the REPL):
.text
movabsq $.rodata.cst8, %rcx
movl $1, %eax
vmovsd (%rcx), %xmm1 # xmm1 = mem[0],zero
nopw %cs:(%rax,%rax)
L32:
vxorpd %xmm0, %xmm0, %xmm0
movl $1, %ecx
nopl (%rax)
L48:
; Below we do the `k * k` multiplication
movq %rcx, %rdx
imulq %rcx, %rdx
; Increment the inner loop counter (where's the outer one? ๐)
incq %rcx
; Some conversion and float division
vcvtsi2sd %rdx, %xmm3, %xmm2
vdivsd %xmm2, %xmm1, %xmm2
; Update the sum accumulator
vaddsd %xmm2, %xmm0, %xmm0
; Is the *inner* loop done yet?
cmpq $10001, %rcx # imm = 0x2711
jne L48
leaq 1(%rax), %rcx
; Is the *outer* loop done yet?
cmpq $500, %rax # imm = 0x1F4
movq %rcx, %rax
jne L32
retq
That's pretty small. (Comments above are mine.)
From Python to C code through Cython
Cython can translate any Python code into C code. It might not produce the best
code, since it won't type everything for free, but it's a good start. In a
Jupyter notebook, all we have to do is to load the cython
extension using
%load_ext cython
and decorate a cell with %%cython
:
# %%
%load_ext cython
# %%
%%cython
def pisum():
sum = 0.0
for j in range(1, 501):
sum = 0.0
for k in range(1, 10001):
sum += 1.0 / (k * k)
return sum
When benchmarked, the above code takes around 246 ms, over 40% faster than the pure Python version, but it is still over 40x slower than Julia! What did we do wrong?
Let's take a look at the generated C code. We can get an annotated version of it
using %%cython -a
in a Jupyter notebook, but here I'll go through the
generated C code directly (hold your breath). All comments below are my own:
// Observe that we're returning a Python object, not a C double.
static PyObject *__pyx_pf_5pisum_pisum(
CYTHON_UNUSED PyObject *__pyx_self
) {
// The `sum` is a Python object, not a C double, & starts as `NULL`.
// In fact, all variables seem to be initialized to `NULL`.
PyObject *__pyx_v_sum = NULL;
// Cython realizes `j` is unused
CYTHON_UNUSED long __pyx_v_j;
// And `k`, although it's used, is a Python object, not an integer
PyObject *__pyx_v_k = NULL;
// `__pyx_r` is the default return variable
PyObject *__pyx_r = NULL;
// Boring housekeeping stuff below
__Pyx_RefNannyDeclarations
long __pyx_t_1;
long __pyx_t_2;
PyObject *__pyx_t_3 = NULL;
PyObject *__pyx_t_4 = NULL;
int __pyx_lineno = 0;
const char *__pyx_filename = NULL;
int __pyx_clineno = 0;
__Pyx_RefNannySetupContext("pisum", 0);
// Now `sum` is initialized to zero.
// Observe we keep track of reference counting as well.
__Pyx_INCREF(__pyx_float_0_0);
__pyx_v_sum = __pyx_float_0_0;
// A dummy variable is used for the loop counter.
// It is then assigned to `j`.
for (__pyx_t_1 = 1; __pyx_t_1 < 0x1F5; __pyx_t_1+=1) {
__pyx_v_j = __pyx_t_1;
// More reference counting and zero assignment
__Pyx_INCREF(__pyx_float_0_0);
__Pyx_DECREF_SET(__pyx_v_sum, __pyx_float_0_0);
// Here comes the second loop
for (__pyx_t_2 = 1; __pyx_t_2 < 0x2711; __pyx_t_2+=1) {
// Right in the **hottest** part of the code: a lot of work.
// For instance, converting a C long to a Python integer.
// This includes error checking and reference counting.
__pyx_t_3 = __Pyx_PyInt_From_long(__pyx_t_2);
if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 5, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_3);
__Pyx_XDECREF_SET(__pyx_v_k, __pyx_t_3);
__pyx_t_3 = 0;
// The code below is equivalent to `sum += 1.0 / (k * k)`.
// Let that sink in.
__pyx_t_3 = PyNumber_Multiply(__pyx_v_k, __pyx_v_k);
if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_3);
__pyx_t_4 = __Pyx_PyFloat_DivideCObj(__pyx_float_1_0,
__pyx_t_3, 1.0, 0, 1);
if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_4);
__Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
__pyx_t_3 = PyNumber_InPlaceAdd(__pyx_v_sum, __pyx_t_4);
if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_3);
__Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
__Pyx_DECREF_SET(__pyx_v_sum, __pyx_t_3);
__pyx_t_3 = 0;
}
}
// Some juggling to return the result
__Pyx_XDECREF(__pyx_r);
__Pyx_INCREF(__pyx_v_sum);
__pyx_r = __pyx_v_sum;
goto __pyx_L0;
// Finally the function exit code.
// This is accessed through `goto` statements.
// There are two ways to exit here, one is to raise an exception:
__pyx_L1_error:;
__Pyx_XDECREF(__pyx_t_3);
__Pyx_XDECREF(__pyx_t_4);
__Pyx_AddTraceback("pisum.pisum", __pyx_clineno, __pyx_lineno,
__pyx_filename);
__pyx_r = NULL;
// The other way is to return the Python object:
__pyx_L0:;
__Pyx_XDECREF(__pyx_v_sum);
__Pyx_XDECREF(__pyx_v_k);
__Pyx_XGIVEREF(__pyx_r);
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
Well, that's long and boring. And it has too many calls to the Python C API. What can we do with this? Well, one way is to type stuff.
Adding C types
The Julia microbenchmarks include a C version of the same code, so before going any further, let's look at it. Maybe we can use it to help us type things in Cython.
double pisum() {
double sum = 0.0;
for (int j=0; j<500; ++j) {
sum = 0.0;
for (int k=1; k<=10000; ++k) {
sum += 1.0/(k*k);
}
}
return sum;
}
Pretty straightforward, right? How could we tell Cython to do things like this? We can:
- Ensure the function returns a C double
- Ensure the accumulator (
sum
) is a C double as well - Ensure both loop counters (
j
andk
) are C integers
All of these can be done in four lines of code. See the updated Cython code below:
# %%
%%cython
import cython
def pisum() -> cython.double: # <1>
sum: cython.double = 0.0 # <2>
j: cython.int # <3>
k: cython.int
for j in range(1, 501):
sum = 0.0
for k in range(1, 10001):
sum += 1.0 / (k * k)
return sum
Now the function runs in under 7 ms! ๐ This is within 10% of the Julia version. But how?
Let's take a look at the generated C code. Again, comments are mine:
// We still get a Python object as the return value
static PyObject *__pyx_pf_5pisum_pisum(
CYTHON_UNUSED PyObject *__pyx_self
) {
// But now the accumulator is a C double, as expected
double __pyx_v_sum;
// The loop counters are C integers.
// And `j` is marked as unused as before.
CYTHON_UNUSED int __pyx_v_j;
int __pyx_v_k;
// Everything else is pretty much the same here
PyObject *__pyx_r = NULL;
__Pyx_RefNannyDeclarations
int __pyx_t_1;
int __pyx_t_2;
int __pyx_t_3;
PyObject *__pyx_t_4 = NULL;
int __pyx_lineno = 0;
const char *__pyx_filename = NULL;
int __pyx_clineno = 0;
__Pyx_RefNannySetupContext("pisum", 0);
// Look ma, no reference counting here!
__pyx_v_sum = 0.0;
// Now the loops
for (__pyx_t_1 = 1; __pyx_t_1 < 0x1F5; __pyx_t_1+=1) {
__pyx_v_j = __pyx_t_1;
__pyx_v_sum = 0.0;
for (__pyx_t_2 = 1; __pyx_t_2 < 0x2711; __pyx_t_2+=1) {
__pyx_v_k = __pyx_t_2;
// Cython prepared a temporary for us here
__pyx_t_3 = (__pyx_v_k * __pyx_v_k);
// It then does the division and checks for division by zero.
if (unlikely(__pyx_t_3 == 0)) {
PyErr_SetString(PyExc_ZeroDivisionError, "float division");
__PYX_ERR(0, 10, __pyx_L1_error)
}
__pyx_v_sum = (__pyx_v_sum + (1.0 / __pyx_t_3));
}
}
// First reference count decrement happens for the return value ๐คฏ
__Pyx_XDECREF(__pyx_r);
__pyx_t_4 = PyFloat_FromDouble(__pyx_v_sum);
if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 11, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_4);
__pyx_r = __pyx_t_4;
__pyx_t_4 = 0;
goto __pyx_L0;
// Exceptions are still checked at exit as before
__pyx_L1_error:;
__Pyx_XDECREF(__pyx_t_4);
__Pyx_AddTraceback("pisum.pisum", __pyx_clineno, __pyx_lineno,
__pyx_filename);
__pyx_r = NULL;
__pyx_L0:;
__Pyx_XGIVEREF(__pyx_r);
__Pyx_RefNannyFinishContext();
return __pyx_r;
}
We've got a considerably smaller code here. The only possible overhead in the inner loop is the check for division by zero, but as the numbers suggest, this is pretty much negligible. Thus, although there's a lot of code at the beginning and end of the function, this code is outside the hot path.
So there you have it. It seems that we can make Python run as fast as C or Julia with very little additional Cython code. In fact, all we've done is use Cython and add type annotations to only four lines of pure Python code. This led to a 60x speedup in this particular microbenchmark.
The end
Thanks for reaching this point. I hope you enjoyed this post!
In the future, I'll be going over the Cython version of the other benchmarks in sequence and see if I can improve the time required to run them as well. Of course, we might end up learning one thing or two about Cython in the process.
If you have any questions, comments, or suggestions, please let me know!