LLVM backend

A lot has happened in the meantime. Some things I’ve been working on:

  • An llvm code generation backend
  • Optimizing broadcasting (Fortran’s SPREAD) through loop-invariant code motion where possible
  • Finding optimal tiling parameters
  • Optimal strength reduction for index calculation
  • Ways to further eliminate array temporaries
  • Explicit vectorization for the C backend, for SSE2 and AVX (xmmintrin.h and smmintrin.h)
  • A lazy numpy evaluation demo that uses the LLVM backend (https://github.com/markflorisson88/minivect/blob/master/demo/lazy_numpy.py#L141)
  • Some other stuff like unit tests using XPath, generating Graphviz files, etc
Posted in Uncategorized | Leave a comment

Progress Update

I keep forgetting to blog, so here goes. The vector expression compiler has gained a good number of specializations, such as

– strided (with a preference to C or Fortran order)
– inner dimension contiguous (C/Fortran)
– contiguous (C/Fotran)
– tiled (in the first two or last two dimensions)

Cython also takes care of correctly broadcasting operands, while avoiding recomputation when possible at compile time to avoid combinatorial code explosion (since you won’t know until runtime which dimensions have extent 1).

It also takes care of overlapping memory and detects read after writes in constant time (by checking pointers only, so it may assume a read after write to be safe, even if that’s not the case). This avoid temporaries in expressions like ‘a[…] = a ** 2’.

The expression compiler currently supports only native C datatypes, so a partial mapping from the Cython AST to the minivect AST is used to support operations on objects with types like complex numbers or Python objects. This gives code generation some challenge, since it needs to capture the generated code for Cython expressions and inject it in the right place in the code generated by minivect, which then returns the code for the entire function back to Cython.

The C compilers are able to autovectorize the contiguous specializations, but mixing strided loads with contiguous loads is not vectorized by most C compilers. So the next thing on the todo list is to create explicitly vectorized specializations.

There is also auto-tuning code for the tiling blocksize and the OpenMP parallel section size. Tile sizes are often between 128 and 1024 bytes, whereas the minimum data size for OpenMP to be beneficial is reasonably large (32786 with GCC 4.7 on my machine). It uses this size to compare with the data size of the operands and the number of computations in the expression in the OpenMP ‘if’ clause.

So anyway, below are some preliminary benchmarks. There are three expressions tested, the C code is compiled with icc (compiled with gcc gives similar results) and the fortran code with gfortran (4.2, so not really fair) and ifort (v12). The expressions are increasingly pathological. The Cython expressions run single threaded (multi-threaded, expression 1 runs in less than 0.9s). The specialization for these expressions are not auto-vectorized yet, so they could be improved still.

cython expr1 1.89504504204
cython expr2 4.28015899658
cython expr3 5.62392616272
numexpr expr1 5.28812909126
numexpr threaded expr1 1.47357201576
normal loop expr1 4.85790300369
numpy expr1 5.54052686691

fortran ifort expr1 2.55721211433411
fortran ifort expr2 5.47224402427673
fortran ifort expr3: Segmentation fault

fortran gfortran expr1: 5.84615400014445
fortran gfortran expr2: 14.4724319996312
fortran gfortran expr3: 21.1943530002609

As you can see, ifort is pretty good, but we’re still a good deal faster, and there is still room for improvement. After generating explicitly vectorized C code, we intend to generate vectorized LLVM code at runtime if the runtime data size is sufficiently large to make this a beneficial endeavor.
At runtime the code can detect exactly which operands are inner contiguous and generate a vectorized load/store for those and a strided load/store for the strided operands. It can then perform vectorized computations on the loaded vectors. It may even be able to generate optimal code depending on the respective alignments and their respective offsets from alignment boundaries.

No mapping is provided yet for elemental function calls and reductions are not supported yet. For now the focus is on performance, and these issues will be addressed later. Another thing to implement is also an nditer approach, where operands are sorted into an agreeable elementwise C-style order, which means we can lose all static Fortran specializations, reducing the amount of generated code.

More benchmarks will follow next week or the week after.

Posted in Uncategorized | Leave a comment

break

I’m going to take a break for a week or so, to give my shoulders some rest, since they’ve been troubling me.

Posted in Uncategorized | Leave a comment

Vector expression compiler

I started a new project to compile vector expressions using any code generation backend. The project allows easy mapping of foreign ASTs and types onto the new project’s AST and type system, and then specializes the AST according to the project’s need for several data layouts and memory alignments. It takes care to only use elementary AST nodes, and the specializers specialize complex operations to a tree of simple, elementary operations, which makes it easy to create new code generation backends. The project can be found here: https://github.com/markflorisson88/minivect

Some simple functionality already works in Cython using this project: http://tinyurl.com/c7l7tqg

To see how the mapping process works, see the two transforms here: http://tinyurl.com/bljj7av

Posted in Uncategorized | Leave a comment

Refactoring

The refactoring has begun, buffer and memoryview indexing is now separated (http://tinyurl.com/cb86mn4), but at some point we need to rewrite all type analysis as a (set of) transforms. I’m going to split the type analysis some more, and refactor some other stuff as well.

Soon social obligations, the arch enemy of every worthy programmer, will call me away. Let’s see if I get some more refactoring time tonight 🙂

Posted in Uncategorized | Leave a comment

gsoc 2012

So the gsoc has started, I’m starting off with some (large) refactoring of several parts of Cython which have grown to an unmanageable state of needless complexity.

Posted in Uncategorized | Leave a comment

Memoryviews

I realized that I haven’t blogged in quite a while. So here is an update.

I’ve done quite a bit on memoryviews. Firstly, I finished Cython utility codes (this is already merged) and added a utility code loader for normal utility codes and cython utility codes. This means you can now write them in a file and automatically load them. You can now also use Tempita to make it a template.

Then I fixed memoryviews, there were still quite a lot of problems with acquisition counting and other small bugs that would make a program segfault. Then I added indexing, slicing, transposing and casting of pointers to cython.array.
You can index and slice a memoryview in GIL or nogil mode and you can coerce from slices to a Python memoryview object (which can also be indexed, sliced, transposed, etc). For transposing a memoryview slice you will need the GIL, although it’s wouldn’t be hard to drop that requirement. This all works in the
same way as NumPy (a.T gives a transposed view, you can index with Ellipsis, etc).

If you have a pointer, you can cast it by providing some shape information, like


cdef cython.array my_array = <int[:10, :10]> my_data_pointer

which will obtain a cython.array with shape 10 * 10 (so the data block must be at least of size 10 * 10 * sizeof(int)). You can then also register a callback to free the data.

Documentation is here: https://github.com/markflorisson88/cython/commit/72edb6ab995cf8a28b4f9a8946b5104fb121515d

In the future these slices and memoryview objects may be made more NumPy like by giving them a ‘.flat’ attribute, and a ‘.reshape’ method. We might also give them a ‘to_numpy()’ method to turn them into a numpy.ndarray.

You can now also convert a python dict to a C or Cython struct automatically (which means you can also have them as arguments in ‘def’ functions, etc).

Posted in Uncategorized | Leave a comment

OpenMP exception propagation

While implementing OpenMP exception propagation I overlooked something: each OpenMP thread that is not a Python thread that executes a with gil block actually allocates and deallocates a new threadstate each time they are entered and exited. This is because first call to PyGILState_Ensure() allocates the threadstate and the last call to PyGILState_Release() deallocates the thread state. Normally Python has already done this for you, but any thread but the master thread in the outermost parallel section is not a Python thread. So if we have a parallel section with a with gil block, we need to ensure the gil and allocate the thread state and then immediately release it with PyEval_SaveThread():

{
    declare exc_info
    declare parallel_error_indicator

    #pragma omp parallel
    {
        /* ensure to allocate a threadstate for 
           the entire parallel section */
        PyGILState_STATE _gilstate = PyGILState_Ensure();

        /* release the GIL immediately */
        Py_BEGIN_ALLOW_THREADS

        #pragma omp for
        for (...) {
            if (!parallel_error_indicator) {

                loop body

                goto no_error;
            error_label:
                {
                    PyGILState_STATE _gilstate = PyGILState_Ensure();
                    save exception info in exc_info
                    PyGILState_Release(_gilstate);
                    set parallel error indicator
                }
            no_error:;
                #pragma omp flush(parallel_error_indicator)
            } // end skip-loop-body if
        } // end omp for
        
        Py_END_ALLOW_THREADS
        PyGILState_Release(_gilstate);

    } // end omp parallel

    // after parallel section
    if (parallel_error_indicator) {
        PyGILState_STATE _gilstate = PyGILState_Ensure();
        restore exception info
        PyGILState_Release(_gilstate);
        goto parent error label
    }
}

Here ‘loop body’ is the code from the user containing a with gil block with a possible exception. In case of an exception it will contain a goto to our error_label. We acquire and release the GIL in the surrounding parallel section to avoid doing it for every iteration. Hence we cannot use the concise #pragma omp parallel for anymore.

We also need to save and restore the filename, lineno and C-lineno, and make them private variables in each parallel section, as in the time between raising the exception and setting the shared variables and releasing the GIL, another OpenMP thread may acquire the GIL, raise another exception and overwrite the line information. We simply remember only the first exception that was raised and propagate that outwards.

Alternatively we could remember the last exception, and inititialize our exc_info variables to NULL and do a Py_XDECREF, and leave the line information shared. However, the OpenMP compiler on my OSX (gcc 4.2) does not support OpenMP v3.0 (it was released before the 3.0 specification) and seems to be a little buggy (feeding it a trivial 10-line OpenMP program with certain combinations of privatization I had it segfault (gcc itself!)). With this alternative approach it seems to generate code that often resets the line information (and the filename) to 0 and NULL after the parallel section, thus segfaulting the program when it tries to build a traceback with a NULL filename (a print shows that the shared variable is assigned to multiple times (and we have the GIL so there is no race)). Seeing that the later version generates code that can execute such a function 10 million times without any problem, I can only assume that 4.2 is indeed a little buggy and that my generated code was correct. However, setting the variables private and by propagating them we have code that works in all compilers, so it seems like the better choice.

Posted in Uncategorized | Leave a comment

Fused Types Syntax

I’m back to continue work on the gsoc! Today I changed the syntax for fused types according to the discussion on the mailing list. You can now write

    cdef fused my_fused_type:
        int
        float
        ...

and


    my_fused_type = cython.fused_type(int, float, ...)

in pure mode. The plan then is to merge the current fused function on top of Vitja's work on the new CyFunction. Until that is merged I'll try to support fused types as part of def functions and not just as part of cdef or cpdef functions.

Posted in Uncategorized | Leave a comment

Fused Types w/ Runtime Dispatch

If you have a fused function that is exposed to Python, it also needs to be specialized from Python (in Cython space all this functionality already works). There are two ways to do this:

  • indexing
  • directly calling

For indexing from Python space, you can use (Cython) types or strings. So if you have a function in Cython like


    def f(cython.integral x, cython.floating y):
        print x, y

then it can be indexed as follows:


    module.f[cython.long, cython.double]
    module.f[int, float]
    module.f["int, long double"]

Please note that this example is for demonstration only, one could just use the largest types that are needed here.
All these index operations return a new specialized version of the function.

You can also call the function immediately, like


    module.f(10, 2.0)

and the specialized version with the largest types will be called, if it can be inferred from the types of the arguments, otherwise a TypeError is raised.

In an attempt to support all this, “binding Cython functions” were also made to be a bit more like actual Python functions. Vitja has added quite a bit more than just a __dict__, and functions should then also be pickle-able.

Because these binding functions can currently only be used for Python (def and cpdef) functions and methods of normal classes (non-extensions classes), the fused version (which will be a subclass) will bind differently based on whether it’s in a normal class, or in an extension class. In a normal class the methods expect self to be in the args tuple, or depending on the signature, as the second argument to the C function. In an extension class however, it expects code>self to be passed in as part of the PyCFunctionObject, through the m_self attribute (i.e., PyMethod_New vs PyDescr_NewMethod). Unfortunately, we cannot just at binding time decide to use which, because we need to be subscriptable after binding. So we have to implement tp_call and for extension methods bind self as m_self (so for unbound calls we need to get the tail of the args tuple) and for normal methods we need to type check args[0] (‘self’) for unbound calls.

Posted in Uncategorized | Leave a comment