I keep forgetting to blog, so here goes. The vector expression compiler has gained a good number of specializations, such as
– strided (with a preference to C or Fortran order)
– inner dimension contiguous (C/Fortran)
– contiguous (C/Fotran)
– tiled (in the first two or last two dimensions)
Cython also takes care of correctly broadcasting operands, while avoiding recomputation when possible at compile time to avoid combinatorial code explosion (since you won’t know until runtime which dimensions have extent 1).
It also takes care of overlapping memory and detects read after writes in constant time (by checking pointers only, so it may assume a read after write to be safe, even if that’s not the case). This avoid temporaries in expressions like ‘a[…] = a ** 2’.
The expression compiler currently supports only native C datatypes, so a partial mapping from the Cython AST to the minivect AST is used to support operations on objects with types like complex numbers or Python objects. This gives code generation some challenge, since it needs to capture the generated code for Cython expressions and inject it in the right place in the code generated by minivect, which then returns the code for the entire function back to Cython.
The C compilers are able to autovectorize the contiguous specializations, but mixing strided loads with contiguous loads is not vectorized by most C compilers. So the next thing on the todo list is to create explicitly vectorized specializations.
There is also auto-tuning code for the tiling blocksize and the OpenMP parallel section size. Tile sizes are often between 128 and 1024 bytes, whereas the minimum data size for OpenMP to be beneficial is reasonably large (32786 with GCC 4.7 on my machine). It uses this size to compare with the data size of the operands and the number of computations in the expression in the OpenMP ‘if’ clause.
So anyway, below are some preliminary benchmarks. There are three expressions tested, the C code is compiled with icc (compiled with gcc gives similar results) and the fortran code with gfortran (4.2, so not really fair) and ifort (v12). The expressions are increasingly pathological. The Cython expressions run single threaded (multi-threaded, expression 1 runs in less than 0.9s). The specialization for these expressions are not auto-vectorized yet, so they could be improved still.
cython expr1 1.89504504204
cython expr2 4.28015899658
cython expr3 5.62392616272
numexpr expr1 5.28812909126
numexpr threaded expr1 1.47357201576
normal loop expr1 4.85790300369
numpy expr1 5.54052686691
fortran ifort expr1 2.55721211433411
fortran ifort expr2 5.47224402427673
fortran ifort expr3: Segmentation fault
fortran gfortran expr1: 5.84615400014445
fortran gfortran expr2: 14.4724319996312
fortran gfortran expr3: 21.1943530002609
As you can see, ifort is pretty good, but we’re still a good deal faster, and there is still room for improvement. After generating explicitly vectorized C code, we intend to generate vectorized LLVM code at runtime if the runtime data size is sufficiently large to make this a beneficial endeavor.
At runtime the code can detect exactly which operands are inner contiguous and generate a vectorized load/store for those and a strided load/store for the strided operands. It can then perform vectorized computations on the loaded vectors. It may even be able to generate optimal code depending on the respective alignments and their respective offsets from alignment boundaries.
No mapping is provided yet for elemental function calls and reductions are not supported yet. For now the focus is on performance, and these issues will be addressed later. Another thing to implement is also an nditer approach, where operands are sorted into an agreeable elementwise C-style order, which means we can lose all static Fortran specializations, reducing the amount of generated code.
More benchmarks will follow next week or the week after.