@rpy.vector( type='float32', length=4 ) class Vector(object): def __init__(self, x=.0, y=.0, z=.0): self.x = x self.y = y self.z = z def __getitem__(self, index): r = .0 if index == 0: r = self.x elif index == 1: r = self.y elif index == 2: r = self.z return r def __setitem__(self, index, value): if index == 0: self.x = value if index == 1: self.y = value if index == 2: self.z = value def __add__( self, other ): x = self.x + other.x y = self.y + other.y z = self.z + other.z return Vector( x,y,z )
The new decorator is "rpy.vector( type, length )" and for best SSE performance it should be of type float32 with length 4 (even if you only use 3).
Test Function:
def test(x1, y1, z1, x2, y2, z2): a = Vector(x1, y1, z1) b = Vector(x2, y2, z2) i = 0 c = 0.0 while i < 16000000: v = a + b c += v[0] + v[1] + v[2] i += 1 return c
Test Results:
- Python2 = 51 seconds
- Rpython-to-LLVM = 0.019 seconds
LLVM ASM
define float @test(float %x1_0, float %y1_0, float %z1_0, float %x2_0, float %y2_0, float %z2_0) { entry: %0 = insertelement <4 x float>, float %x1_0, i32 0 %1 = insertelement <4 x float> %0, float %y1_0, i32 1 %2 = insertelement <4 x float> %1, float %z1_0, i32 2 %3 = insertelement <4 x float> , float %x2_0, i32 0 %4 = insertelement <4 x float> %3, float %y2_0, i32 1 %5 = insertelement <4 x float> %4, float %z2_0, i32 2 %vecadd = fadd <4 x float> %2, %5 %element = extractelement <4 x float> %vecadd, i32 0 %element3 = extractelement <4 x float> %vecadd, i32 1 %v5 = fadd float %element, %element3 %element4 = extractelement <4 x float> %vecadd, i32 2 %v7 = fadd float %v5, %element4 br label %while_loop while_loop: %st_c_0.0 = phi float [ 0.000000e+00, %entry ], [ %v8, %while_loop.while_loop_crit_edge ] %st_i_0.0 = phi i32 [ 0, %entry ], [ %v9, %while_loop.while_loop_crit_edge ] %v8 = fadd float %st_c_0.0, %v7 %v9 = add i32 %st_i_0.0, 1 %v10 = icmp ult i32 %v9, 16000000 br i1 %v10, label %while_loop.while_loop_crit_edge, label %else while_loop.while_loop_crit_edge: br label %while_loop else: %v8.lcssa = phi float [ %v8, %while_loop ] ret float %v8.lcssa }
Part2: Escaping the GIL
llvm-py contains an example "call-jit-ctypes.py" that shows you how to bypass the LLVM Execution Engine and instead call your compiled function via ctypes. The advantage of using ctypes over the Execution Engine is that ctypes will release the GIL and allows your Python threads to run in parallel. The next test simply calls the same function four times from four threads at the same time.Test 4 Threads:
- LLVM Execution Engine = 0.086 seconds
- Ctypes = 0.025 seconds