Translating the Julia Benchmarks into Cython (part 1)

Hi everyone! ๐Ÿ‘‹

If you've ever heard anything about the Julia programming language, you probably heard that it is fast. They've got some pretty impressive microbenchmarks on their website to show this. In comparison, the Python programming language seems really slow. Up to 100x slower in this specific benchmark set, in fact. But can we make Python faster?

In this post (my first one on Hashnode ๐ŸŽ‰, by the way), I'll be going over each benchmark in this set and trying to make Python fast by sprinkling some Cython love over it. With Cython, we'll effectively translate Python into C and then compile it into a Python extension module.

But first, let's plan out our steps, starting with some preliminary comparisons.

Time required to run each benchmark

Below I've gathered the time required to run each benchmark (try the code out from the original repository). All times are in milliseconds and are based on the minimum time out of five runs. I've ordered rows by the time Python took to run them.

BenchmarkPython (original, ms)Julia (original, ms)
iteration_pi_sum430.4326.096
matrix_statistics62.4217.265
print_to_file35.89214.598
matrix_multiply32.87317.522
recursion_quicksort7.8450.331
userfunc_mandelbrot4.5790.051
recursion_fibonacci2.3230.036
parse_integers1.1070.110

Versions used:

  • Python 3.8.10
  • Julia 1.6.0

As you can see, Julia has impressive numbers. Since the slowest Python benchmark seems to be the iteration_pi_sum example. Let's take a look at it first.

The iteration_pi_sum example

This microbenchmark calculates an approximation to the following series:

$$ \sum_{k=1}^{\infty} \frac{1}{k^2} $$

Only the first 10000 terms are considered, and the calculation is repeated 500 times.

First, here's the Python version:

def pisum():
    sum = 0.0
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0/(k*k)
    return sum

And here's the Julia one:

function pisum()
    sum = 0.0
    for j = 1:500
        sum = 0.0
        for k = 1:10000
            sum += 1.0/(k*k)
        end
    end
    sum
end

That's pretty much a direct translation of each other, but the Python version runs over 70x slower. What's going on?

The thing is that, while Julia infers types and can produce very efficient machine code, Python makes an object out of everything. And that adds a lot of indirection and overhead.

In fact, you can look at the exact Assembly code that is being generated by Julia (by running @code_native pisum() in the REPL):

    .text
    movabsq    $.rodata.cst8, %rcx
    movl    $1, %eax
    vmovsd    (%rcx), %xmm1                   # xmm1 = mem[0],zero
    nopw    %cs:(%rax,%rax)
L32:
    vxorpd    %xmm0, %xmm0, %xmm0
    movl    $1, %ecx
    nopl    (%rax)
L48:
; Below we do the `k * k` multiplication
    movq    %rcx, %rdx
    imulq    %rcx, %rdx
; Increment the inner loop counter (where's the outer one? ๐Ÿ˜†)
    incq    %rcx
; Some conversion and float division
    vcvtsi2sd    %rdx, %xmm3, %xmm2
    vdivsd    %xmm2, %xmm1, %xmm2
; Update the sum accumulator
    vaddsd    %xmm2, %xmm0, %xmm0
; Is the *inner* loop done yet?
    cmpq    $10001, %rcx                    # imm = 0x2711
    jne    L48
    leaq    1(%rax), %rcx
; Is the *outer* loop done yet?
    cmpq    $500, %rax                      # imm = 0x1F4
    movq    %rcx, %rax
    jne    L32
    retq

That's pretty small. (Comments above are mine.)

From Python to C code through Cython

Cython can translate any Python code into C code. It might not produce the best code, since it won't type everything for free, but it's a good start. In a Jupyter notebook, all we have to do is to load the cython extension using %load_ext cython and decorate a cell with %%cython:

# %%
%load_ext cython

# %%
%%cython
def pisum():
    sum = 0.0
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0 / (k * k)
    return sum

When benchmarked, the above code takes around 246 ms, over 40% faster than the pure Python version, but it is still over 40x slower than Julia! What did we do wrong?

Let's take a look at the generated C code. We can get an annotated version of it using %%cython -a in a Jupyter notebook, but here I'll go through the generated C code directly (hold your breath). All comments below are my own:

// Observe that we're returning a Python object, not a C double.
static PyObject *__pyx_pf_5pisum_pisum(
    CYTHON_UNUSED PyObject *__pyx_self
  ) {
  // The `sum` is a Python object, not a C double, & starts as `NULL`.
  // In fact, all variables seem to be initialized to `NULL`.
  PyObject *__pyx_v_sum = NULL;
  // Cython realizes `j` is unused
  CYTHON_UNUSED long __pyx_v_j;
  // And `k`, although it's used, is a Python object, not an integer
  PyObject *__pyx_v_k = NULL;
  // `__pyx_r` is the default return variable
  PyObject *__pyx_r = NULL;
  // Boring housekeeping stuff below
  __Pyx_RefNannyDeclarations
  long __pyx_t_1;
  long __pyx_t_2;
  PyObject *__pyx_t_3 = NULL;
  PyObject *__pyx_t_4 = NULL;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __Pyx_RefNannySetupContext("pisum", 0);

  // Now `sum` is initialized to zero.
  // Observe we keep track of reference counting as well.
  __Pyx_INCREF(__pyx_float_0_0);
  __pyx_v_sum = __pyx_float_0_0;

  // A dummy variable is used for the loop counter.
  // It is then assigned to `j`.
  for (__pyx_t_1 = 1; __pyx_t_1 < 0x1F5; __pyx_t_1+=1) {
    __pyx_v_j = __pyx_t_1;

    // More reference counting and zero assignment
    __Pyx_INCREF(__pyx_float_0_0);
    __Pyx_DECREF_SET(__pyx_v_sum, __pyx_float_0_0);

    // Here comes the second loop
    for (__pyx_t_2 = 1; __pyx_t_2 < 0x2711; __pyx_t_2+=1) {
      // Right in the **hottest** part of the code: a lot of work.
      // For instance, converting a C long to a Python integer.
      // This includes error checking and reference counting.
      __pyx_t_3 = __Pyx_PyInt_From_long(__pyx_t_2);
      if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 5, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_XDECREF_SET(__pyx_v_k, __pyx_t_3);
      __pyx_t_3 = 0;

      // The code below is equivalent to `sum += 1.0 / (k * k)`.
      // Let that sink in.
      __pyx_t_3 = PyNumber_Multiply(__pyx_v_k, __pyx_v_k);
      if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 6, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __pyx_t_4 = __Pyx_PyFloat_DivideCObj(__pyx_float_1_0,
        __pyx_t_3, 1.0, 0, 1);
      if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 6, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_4);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
      __pyx_t_3 = PyNumber_InPlaceAdd(__pyx_v_sum, __pyx_t_4);
      if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 6, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_DECREF_SET(__pyx_v_sum, __pyx_t_3);
      __pyx_t_3 = 0;
    }
  }

  // Some juggling to return the result
  __Pyx_XDECREF(__pyx_r);
  __Pyx_INCREF(__pyx_v_sum);
  __pyx_r = __pyx_v_sum;
  goto __pyx_L0;

  // Finally the function exit code.
  // This is accessed through `goto` statements.
  // There are two ways to exit here, one is to raise an exception:
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_AddTraceback("pisum.pisum", __pyx_clineno, __pyx_lineno,
    __pyx_filename);
  __pyx_r = NULL;

  // The other way is to return the Python object:
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_sum);
  __Pyx_XDECREF(__pyx_v_k);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

Well, that's long and boring. And it has too many calls to the Python C API. What can we do with this? Well, one way is to type stuff.

Adding C types

The Julia microbenchmarks include a C version of the same code, so before going any further, let's look at it. Maybe we can use it to help us type things in Cython.

double pisum() {
    double sum = 0.0;
    for (int j=0; j<500; ++j) {
        sum = 0.0;
        for (int k=1; k<=10000; ++k) {
            sum += 1.0/(k*k);
        }
    }
    return sum;
}

Pretty straightforward, right? How could we tell Cython to do things like this? We can:

  1. Ensure the function returns a C double
  2. Ensure the accumulator (sum) is a C double as well
  3. Ensure both loop counters (j and k) are C integers

All of these can be done in four lines of code. See the updated Cython code below:

# %%
%%cython
import cython

def pisum() -> cython.double:  # <1>
    sum: cython.double = 0.0  # <2>
    j: cython.int  # <3>
    k: cython.int
    for j in range(1, 501):
        sum = 0.0
        for k in range(1, 10001):
            sum += 1.0 / (k * k)
    return sum

Now the function runs in under 7 ms! ๐ŸŽ‰ This is within 10% of the Julia version. But how?

Let's take a look at the generated C code. Again, comments are mine:

// We still get a Python object as the return value
static PyObject *__pyx_pf_5pisum_pisum(
    CYTHON_UNUSED PyObject *__pyx_self
  ) {
  // But now the accumulator is a C double, as expected
  double __pyx_v_sum;
  // The loop counters are C integers.
  // And `j` is marked as unused as before.
  CYTHON_UNUSED int __pyx_v_j;
  int __pyx_v_k;
  // Everything else is pretty much the same here
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  int __pyx_t_1;
  int __pyx_t_2;
  int __pyx_t_3;
  PyObject *__pyx_t_4 = NULL;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __Pyx_RefNannySetupContext("pisum", 0);

  // Look ma, no reference counting here!
  __pyx_v_sum = 0.0;

  // Now the loops
  for (__pyx_t_1 = 1; __pyx_t_1 < 0x1F5; __pyx_t_1+=1) {
    __pyx_v_j = __pyx_t_1;
    __pyx_v_sum = 0.0;
    for (__pyx_t_2 = 1; __pyx_t_2 < 0x2711; __pyx_t_2+=1) {
      __pyx_v_k = __pyx_t_2;

      // Cython prepared a temporary for us here
      __pyx_t_3 = (__pyx_v_k * __pyx_v_k);
      // It then does the division and checks for division by zero.
      if (unlikely(__pyx_t_3 == 0)) {
        PyErr_SetString(PyExc_ZeroDivisionError, "float division");
        __PYX_ERR(0, 10, __pyx_L1_error)
      }
      __pyx_v_sum = (__pyx_v_sum + (1.0 / __pyx_t_3));
    }
  }

  // First reference count decrement happens for the return value ๐Ÿคฏ
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_4 = PyFloat_FromDouble(__pyx_v_sum);
  if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 11, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_r = __pyx_t_4;
  __pyx_t_4 = 0;
  goto __pyx_L0;

  // Exceptions are still checked at exit as before
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_AddTraceback("pisum.pisum", __pyx_clineno, __pyx_lineno,
    __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

We've got a considerably smaller code here. The only possible overhead in the inner loop is the check for division by zero, but as the numbers suggest, this is pretty much negligible. Thus, although there's a lot of code at the beginning and end of the function, this code is outside the hot path.

So there you have it. It seems that we can make Python run as fast as C or Julia with very little additional Cython code. In fact, all we've done is use Cython and add type annotations to only four lines of pure Python code. This led to a 60x speedup in this particular microbenchmark.

The end

Thanks for reaching this point. I hope you enjoyed this post!

In the future, I'll be going over the Cython version of the other benchmarks in sequence and see if I can improve the time required to run them as well. Of course, we might end up learning one thing or two about Cython in the process.

If you have any questions, comments, or suggestions, please let me know!

ย