Friday, April 27, 2012

Rpython to LLVM - Part2

In the last post we saw that Rpython-to-LLVM can be 200X faster than Python in a tight loop. What happens when the loop gets more complicated? This next test introduces a Vector class, and using a new decorator, the LLVM backend can translate instances of this class into the SSE optimized LLVM vector type.
@rpy.vector( type='float32', length=4 )
class Vector(object):
 def __init__(self, x=.0, y=.0, z=.0):
  self.x = x
  self.y = y
  self.z = z

 def __getitem__(self, index):
  r = .0
  if index == 0: r = self.x
  elif index == 1: r = self.y
  elif index == 2: r = self.z
  return r

 def __setitem__(self, index, value):
  if index == 0: self.x = value
  if index == 1: self.y = value
  if index == 2: self.z = value

 def __add__( self, other ):
  x = self.x + other.x
  y = self.y + other.y
  z = self.z + other.z
  return Vector( x,y,z )

The new decorator is "rpy.vector( type, length )" and for best SSE performance it should be of type float32 with length 4 (even if you only use 3).

Test Function:

def test(x1, y1, z1, x2, y2, z2):
 a = Vector(x1, y1, z1)
 b = Vector(x2, y2, z2)
 i = 0
 c = 0.0
 while i < 16000000:
  v = a + b
  c += v[0] + v[1] + v[2]
  i += 1
 return c

Test Results:

  • Python2 = 51 seconds
  • Rpython-to-LLVM = 0.019 seconds
How could LLVM be 2,680X faster than standard Python? It turns out in this case LLVM is able to optimize the while-loop by moving many operations into the "function entry" and reducing the work the while-loop needs to do (see the optimized LLVM ASM below).

LLVM ASM

define float @test(float %x1_0, float %y1_0, float %z1_0, float %x2_0, float %y2_0, float %z2_0) {
entry:
  %0 = insertelement <4 x float> , float %x1_0, i32 0
  %1 = insertelement <4 x float> %0, float %y1_0, i32 1
  %2 = insertelement <4 x float> %1, float %z1_0, i32 2
  %3 = insertelement <4 x float> , float %x2_0, i32 0
  %4 = insertelement <4 x float> %3, float %y2_0, i32 1
  %5 = insertelement <4 x float> %4, float %z2_0, i32 2
  %vecadd = fadd <4 x float> %2, %5              
  %element = extractelement <4 x float> %vecadd, i32 0 
  %element3 = extractelement <4 x float> %vecadd, i32 1
  %v5 = fadd float %element, %element3           
  %element4 = extractelement <4 x float> %vecadd, i32 2
  %v7 = fadd float %v5, %element4                
  br label %while_loop

while_loop:                                    
  %st_c_0.0 = phi float [ 0.000000e+00, %entry ], [ %v8, %while_loop.while_loop_crit_edge ]
  %st_i_0.0 = phi i32 [ 0, %entry ], [ %v9, %while_loop.while_loop_crit_edge ]
  %v8 = fadd float %st_c_0.0, %v7                
  %v9 = add i32 %st_i_0.0, 1                     
  %v10 = icmp ult i32 %v9, 16000000               
  br i1 %v10, label %while_loop.while_loop_crit_edge, label %else

while_loop.while_loop_crit_edge:                  
  br label %while_loop

else:                                             
  %v8.lcssa = phi float [ %v8, %while_loop ]      
  ret float %v8.lcssa
}

Part2: Escaping the GIL

llvm-py contains an example "call-jit-ctypes.py" that shows you how to bypass the LLVM Execution Engine and instead call your compiled function via ctypes. The advantage of using ctypes over the Execution Engine is that ctypes will release the GIL and allows your Python threads to run in parallel. The next test simply calls the same function four times from four threads at the same time.

Test 4 Threads:

  • LLVM Execution Engine = 0.086 seconds
  • Ctypes = 0.025 seconds
As we can see in this test with 4 threads, ctypes is 3.4X faster on a quad core CPU. Note that another way to escape the GIL is the multiprocessing module, there are pros and cons for both processes and threads. Rpythonic now uses ctypes by default to call the compiled LLVM functions, so its up to you to decide if you want to take advantage of threads or not.

Sunday, April 22, 2012

Rpython to LLVM

Psyco and Unladen Swallow were the first to try to make a just-in-time compiler (JIT) for Python, but these projects have stopped, leaving standard Python with no good JIT solution. So I started investigating how hard would it be to make a JIT for Python using Rpython and LLVM. The results of my first highly experimental implementation of Rpython-to-LLVM show very fast JIT performance: 4x faster than PyPy, 200x faster than Python2, and 260x faster than Python3.

Test Function

def simple_test(a, b):
 c = 0
 while c < 100000*100000:
  c += a + b
 return c
The test function is simply a huge loop that adds-to and returns a 64bit integer. The test was performed on a AMD 2.4ghz Quad with 4GB of RAM, average test result times are:
  • Rpython-to-LLVM = 2 seconds
  • PyPy1.8 (with warm JIT) = 8 seconds
  • Python2.7.2 = 400 seconds
  • Python3.2.2 = 530 seconds

Building The JIT

The first challenge in this project was building the code that traverses the Rpython flow-graph ("flow object space") and converts it into LLVM format. For each Rpython flow-graph block a new LLVM basic-block is created, and for each operation in the block a new LLVM instruction is created. Blocks that loop and modify a variable require some extra work, these mutable variables are treated as stack allocations, and then the LLVM optimization pass PROMOTE_MEMORY_TO_REGISTER replaces the costly stack allocations with fast register memory. It is interesting to see what LLVM IR looks like for the simple function used in this test, before and after the PROMOTE_MEMORY_TO_REGISTER optimization.
Raw LLVM IR
define i64 @simple_test(i64 %a_1, i64 %b_1) {
entry:
  %st_a_1 = alloca i64                            ;  [#uses=2]
  store i64 %a_1, i64* %st_a_1
  %st_b_1 = alloca i64                            ;  [#uses=2]
  store i64 %b_1, i64* %st_b_1
  %st = alloca i64                                ;  [#uses=1]
  store i64 0, i64* %st
  %st_v2 = alloca i64                             ;  [#uses=4]
  store i64 %a_1, i64* %st_v2
  br label %while_loop

while_loop:                                       ; preds = %while_loop, %entry
  %a_0 = load i64* %st_a_1                        ;  [#uses=1]
  %b_0 = load i64* %st_b_1                        ;  [#uses=1]
  %v0 = add i64 %a_0, %b_0                        ;  [#uses=1]
  %v1 = load i64* %st_v2                          ;  [#uses=1]
  %v2 = add i64 %v1, %v0                          ;  [#uses=2]
  store i64 %v2, i64* %st_v2
  %v3 = icmp ult i64 %v2, 10000000000             ;  [#uses=1]
  br i1 %v3, label %while_loop, label %else_return

else_return:                                      ; preds = %while_loop
  %0 = load i64* %st_v2                           ;  [#uses=1]
  ret i64 %0
}
LLVM IR (after PROMOTE_MEMORY_TO_REGISTER)
define i64 @simple_test(i64 %a_1, i64 %b_1) {
entry:
  br label %while_loop

while_loop:                                       ; preds = %while_loop, %entry
  %st_v2.0 = phi i64 [ %a_1, %entry ], [ %v2, %while_loop ] ;  [#uses=1]
  %v0 = add i64 %a_1, %b_1                        ;  [#uses=1]
  %v2 = add i64 %st_v2.0, %v0                     ;  [#uses=3]
  %v3 = icmp ult i64 %v2, 10000000000             ;  [#uses=1]
  br i1 %v3, label %while_loop, label %else_return

else_return:                                      ; preds = %while_loop
  ret i64 %v2
}

LLVM Advantages

LLVM is more than just a JIT, because LLVM IR is platform independent, it becomes the best solution for making Python extension modules that need to support all platforms and all Python versions. A classic Python extension module is written in C, and must be compiled for each Python version, each OS, and each OS type (32bit and 64bits)! (Python2+Python3+PyPy)*(Linux+OSX+Windows)*(32bits+64bits) = 18 targets. How is anybody supposed to compile their Python extension for all 18 targets? LLVM IR can be generated on any platform any bit-depth, saved to a file, and later loaded and run on any target that PyLLVM supports. PyLLVM works with Python2 and Python3; and is easily portable to PyPy using cpyext. In other words, LLVM IR can easily hit all 18 targets - no problem.
Extra Advantages:

  • LLVM easily calls into C libraries
  • LLVM has a SIMD accelerated vector type
  • LLVM has powerful optimizations like: PROMOTE_MEMORY_TO_REGISTER
  • Rpython and LLVM are a natural fit
Still not convinced? Read what Intel has to say about LLVM.

source code

requires Mahadevan's PyLLVM

Monday, April 16, 2012

Neo-Rpython

Restricted Python (Rpython) has all the limitations you would expect from a static language,
and not very hard to adapt to. However, some key language features remain missing, like: managed attributes and operator overloading.
As you will see in this post, meta-programming can easily lift these limitations, and redefine what Rpython is.



The most direct type of meta-programming is to simply write Python code that generates other Python code and eval or exec the new code, I use this technique in this post. Another technique is to parse and modify byte-code directly, using a library like BytePlay.

The most powerful meta-programming technique is to use PyPy's flow-graph and directly modify it. When using the interactive translator (pypy.translator.interactive.Translation) it parses the byte-code of a function into a flow-graph. The flow-graph has a very simple design that was laid out almost decade ago in October 2003 at a PyPy sprint in Berlin.

The flow-graph is composed of: blocks, operations, and links. The primary feature of the flow-graph is that it is easy to modified in-place, and traverse it to find all the types of each instance. This provides a way to find the class of an instance, and introspect it to check for managed attributes and operator overloading methods like: __call__, __getitem__, __setitem__.

Example - Managed Attributes



class A(object):
def set_myattr(self,v): self.myattr = v
def get_myattr(self): return self.myattr
myattr = property( get_myattr, set_myattr )

def func(arg):
a = A()
a.myattr = 'foo'
s = a.myattr
a.myattr = s + 'bar'
return 1

T = pypy.translator.interactive.Translation( func )


Initial FlowGraph


When a Translation instance is created it parses the byte-code into the initial flow-graph, at this stage the flow-graph needs to be static, but it is not yet required to be strict-Rpython.

v0 = simple_call((type A))
v1 = setattr(v0, ('myattr'), ('foo'))
v2 = getattr(v0, ('myattr'))
v3 = add(v2, ('bar'))
v4 = setattr(v0, ('myattr'), v3)


Trying to translate the above flow-graph will fail at the annotation stage, throwing an error that "myattr" degenerated to SomeObject.
The degeneration error is caused by the class using "property( get_myattr, set_myattr )" to manage the "myattr" attribute.
To make this work we must do the following:

  1. traverse the initial flow-graph and check all setattr/getattr operations

  2. modify the flow-graph in place to make it strict-Rpython compatible

  3. delete the "property" from the class.



Modified FlowGraph


The flow-graph below has been modified to make it strict-Rpython.
The "setattr(v, myattr, foo)" operation is replaced by two operations:

  1. get a pointer to the setter method

  2. call the method passing "foo"



v0 = simple_call((type A))
v5 = getattr(v0, ('set_myattr'))
v1 = simple_call(v5, ('foo'))
v6 = getattr(v0, ('get_myattr'))
v2 = simple_call(v6)
v3 = add(v2, ('bar'))
v4 = simple_call(v5, v3)


The flow-graph is now ready to pass the next steps in the translation process: annotation and rtyping.
Using this same technique of operation swapping, it becomes easy to add support for operator overloading.

Example - Operator Overloading



class A(object):
def __getitem__(self, index): return self.array[ index ]
def __setitem__(self, index, value): self.array[ index ] = value
def __init__(self): self.array = [ 100.0 ]

def func(arg):
a = A()
a[0] = a[0] + a[0]
return 1


Initial FlowGraph



v0 = simple_call((type A))
v1 = getitem(v0, (0))
v2 = getitem(v0, (0))
v3 = add(v1, v2)
v4 = setitem(v0, (0), v3)


Modified FlowGraph



v0 = simple_call((type A))
v5 = getattr(v0, ('__getitem__'))
v1 = simple_call(v5, (0))
v2 = simple_call(v5, (0))
v3 = add(v1, v2)
v6 = getattr(v0, ('__setitem__'))
v4 = simple_call(v6, (0), v3)


source code