LLVM backend

A lot has happened in the meantime. Some things I’ve been working on:

  • An llvm code generation backend
  • Optimizing broadcasting (Fortran’s SPREAD) through loop-invariant code motion where possible
  • Finding optimal tiling parameters
  • Optimal strength reduction for index calculation
  • Ways to further eliminate array temporaries
  • Explicit vectorization for the C backend, for SSE2 and AVX (xmmintrin.h and smmintrin.h)
  • A lazy numpy evaluation demo that uses the LLVM backend (https://github.com/markflorisson88/minivect/blob/master/demo/lazy_numpy.py#L141)
  • Some other stuff like unit tests using XPath, generating Graphviz files, etc
Posted in Uncategorized | Leave a comment

Progress Update

I keep forgetting to blog, so here goes. The vector expression compiler has gained a good number of specializations, such as

– strided (with a preference to C or Fortran order)
– inner dimension contiguous (C/Fortran)
– contiguous (C/Fotran)
– tiled (in the first two or last two dimensions)

Cython also takes care of correctly broadcasting operands, while avoiding recomputation when possible at compile time to avoid combinatorial code explosion (since you won’t know until runtime which dimensions have extent 1).

It also takes care of overlapping memory and detects read after writes in constant time (by checking pointers only, so it may assume a read after write to be safe, even if that’s not the case). This avoid temporaries in expressions like ‘a[...] = a ** 2′.

The expression compiler currently supports only native C datatypes, so a partial mapping from the Cython AST to the minivect AST is used to support operations on objects with types like complex numbers or Python objects. This gives code generation some challenge, since it needs to capture the generated code for Cython expressions and inject it in the right place in the code generated by minivect, which then returns the code for the entire function back to Cython.

The C compilers are able to autovectorize the contiguous specializations, but mixing strided loads with contiguous loads is not vectorized by most C compilers. So the next thing on the todo list is to create explicitly vectorized specializations.

There is also auto-tuning code for the tiling blocksize and the OpenMP parallel section size. Tile sizes are often between 128 and 1024 bytes, whereas the minimum data size for OpenMP to be beneficial is reasonably large (32786 with GCC 4.7 on my machine). It uses this size to compare with the data size of the operands and the number of computations in the expression in the OpenMP ‘if’ clause.

So anyway, below are some preliminary benchmarks. There are three expressions tested, the C code is compiled with icc (compiled with gcc gives similar results) and the fortran code with gfortran (4.2, so not really fair) and ifort (v12). The expressions are increasingly pathological. The Cython expressions run single threaded (multi-threaded, expression 1 runs in less than 0.9s). The specialization for these expressions are not auto-vectorized yet, so they could be improved still.

cython expr1 1.89504504204
cython expr2 4.28015899658
cython expr3 5.62392616272
numexpr expr1 5.28812909126
numexpr threaded expr1 1.47357201576
normal loop expr1 4.85790300369
numpy expr1 5.54052686691

fortran ifort expr1 2.55721211433411
fortran ifort expr2 5.47224402427673
fortran ifort expr3: Segmentation fault

fortran gfortran expr1: 5.84615400014445
fortran gfortran expr2: 14.4724319996312
fortran gfortran expr3: 21.1943530002609

As you can see, ifort is pretty good, but we’re still a good deal faster, and there is still room for improvement. After generating explicitly vectorized C code, we intend to generate vectorized LLVM code at runtime if the runtime data size is sufficiently large to make this a beneficial endeavor.
At runtime the code can detect exactly which operands are inner contiguous and generate a vectorized load/store for those and a strided load/store for the strided operands. It can then perform vectorized computations on the loaded vectors. It may even be able to generate optimal code depending on the respective alignments and their respective offsets from alignment boundaries.

No mapping is provided yet for elemental function calls and reductions are not supported yet. For now the focus is on performance, and these issues will be addressed later. Another thing to implement is also an nditer approach, where operands are sorted into an agreeable elementwise C-style order, which means we can lose all static Fortran specializations, reducing the amount of generated code.

More benchmarks will follow next week or the week after.

Posted in Uncategorized | Leave a comment

break

I’m going to take a break for a week or so, to give my shoulders some rest, since they’ve been troubling me.

Posted in Uncategorized | Leave a comment

Vector expression compiler

I started a new project to compile vector expressions using any code generation backend. The project allows easy mapping of foreign ASTs and types onto the new project’s AST and type system, and then specializes the AST according to the project’s need for several data layouts and memory alignments. It takes care to only use elementary AST nodes, and the specializers specialize complex operations to a tree of simple, elementary operations, which makes it easy to create new code generation backends. The project can be found here: https://github.com/markflorisson88/minivect

Some simple functionality already works in Cython using this project: http://tinyurl.com/c7l7tqg

To see how the mapping process works, see the two transforms here: http://tinyurl.com/bljj7av

Posted in Uncategorized | Leave a comment

Refactoring

The refactoring has begun, buffer and memoryview indexing is now separated (http://tinyurl.com/cb86mn4), but at some point we need to rewrite all type analysis as a (set of) transforms. I’m going to split the type analysis some more, and refactor some other stuff as well.

Soon social obligations, the arch enemy of every worthy programmer, will call me away. Let’s see if I get some more refactoring time tonight :)

Posted in Uncategorized | Leave a comment

gsoc 2012

So the gsoc has started, I’m starting off with some (large) refactoring of several parts of Cython which have grown to an unmanageable state of needless complexity.

Posted in Uncategorized | Leave a comment

Memoryviews

I realized that I haven’t blogged in quite a while. So here is an update.

I’ve done quite a bit on memoryviews. Firstly, I finished Cython utility codes (this is already merged) and added a utility code loader for normal utility codes and cython utility codes. This means you can now write them in a file and automatically load them. You can now also use Tempita to make it a template.

Then I fixed memoryviews, there were still quite a lot of problems with acquisition counting and other small bugs that would make a program segfault. Then I added indexing, slicing, transposing and casting of pointers to cython.array.
You can index and slice a memoryview in GIL or nogil mode and you can coerce from slices to a Python memoryview object (which can also be indexed, sliced, transposed, etc). For transposing a memoryview slice you will need the GIL, although it’s wouldn’t be hard to drop that requirement. This all works in the
same way as NumPy (a.T gives a transposed view, you can index with Ellipsis, etc).

If you have a pointer, you can cast it by providing some shape information, like


cdef cython.array my_array = <int[:10, :10]> my_data_pointer

which will obtain a cython.array with shape 10 * 10 (so the data block must be at least of size 10 * 10 * sizeof(int)). You can then also register a callback to free the data.

Documentation is here: https://github.com/markflorisson88/cython/commit/72edb6ab995cf8a28b4f9a8946b5104fb121515d

In the future these slices and memoryview objects may be made more NumPy like by giving them a ‘.flat’ attribute, and a ‘.reshape’ method. We might also give them a ‘to_numpy()’ method to turn them into a numpy.ndarray.

You can now also convert a python dict to a C or Cython struct automatically (which means you can also have them as arguments in ‘def’ functions, etc).

Posted in Uncategorized | Leave a comment