A Raytracer in Python - Part 4 - Profiling
After having finally obtained a raytracer which produces antialiasing, it is now time to take a look at performance. We already saw some numbers in the last post. Rendering a 200x200 image with 16 samples per pixels (a grand total of 640.000 rays) takes definitely too much. I want to perform some profiling with python, find the hotspots in the code, and eventually devise a strategy to optimize them.
General profiling with cProfile
To perform basic profiling, I used the cProfile program provided in the standard library. It appears that the longest processing time is in the hit() function
$ python -m cProfile -s time test3.py
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2560000 85.329 0.000 108.486 0.000 Sphere.py:12(hit)
1960020 30.157 0.000 30.157 0.000 {numpy.core.multiarray.array}
7680000 23.115 0.000 23.115 0.000 {numpy.core._dotblas.dot}
1 19.589 19.589 195.476 195.476 World.py:25(render)
2560000 7.968 0.000 116.454 0.000 World.py:62(f)
640000 6.710 0.000 133.563 0.000 World.py:61(hit_bare_bones_object)
640025 4.438 0.000 120.902 0.000 {map}
640000 3.347 0.000 136.910 0.000 Tracer.py:5(trace_ray)
640000 3.009 0.000 3.009 0.000 {numpy.core.multiarray.zeros}
640000 2.596 0.000 2.613 0.000 {sorted}
640000 2.502 0.000 3.347 0.000 {filter}
640000 1.835 0.000 16.784 0.000 Ray.py:4(__init__)
This does not surprise me, as the main computation a raytracer performs is to test each ray for intersection on the objects in the scene, in this case multiple Sphere objects.
Profiling line by line for hot spots
Understood that most of the time is spent into hit(), I wanted to perform line-by-line profiling. This is not possible with the standard python cProfile module, therefore I searched and found an alternative, line_profiler:
$ easy_install-2.7 --prefix=$HOME line_profiler
$ kernprof.py -l test3.py
Wrote profile results to test3.py.lprof
$ python -m line_profiler test3.py.lprof
Before running the commands above, I added the \@profile decorator to the method I am interested in. This decorator is added by line_profiler to the __builtin__ module, so no explicit import statement is needed.
class Sphere(object):
<..>
@profile
def hit(self, ray):
<..>
The results of this profiling are
Line # Hits Time Per Hit % Time Line Contents
==============================================================
12 @profile
13 def hit(self, ray):
14 2560000 27956358 10.9 19.2 temp = ray.origin - self.center
15 2560000 17944912 7.0 12.3 a = numpy.dot(ray.direction, ray.direction)
16 2560000 24132737 9.4 16.5 b = 2.0 * numpy.dot(temp, ray.direction)
17 2560000 37113811 14.5 25.4 c = numpy.dot(temp, temp) \
- self.radius * self.radius
18 2560000 20808930 8.1 14.3 disc = b * b - 4.0 * a * c
19
20 2560000 10963318 4.3 7.5 if (disc < 0.0):
21 2539908 5403624 2.1 3.7 return None
22 else:
23 20092 75076 3.7 0.1 e = math.sqrt(disc)
24 20092 104950 5.2 0.1 denom = 2.0 * a
25 20092 115956 5.8 0.1 t = (-b - e) / denom
26 20092 83382 4.2 0.1 if (t > 1.0e-7):
27 20092 525272 26.1 0.4 normal = (temp + t * ray.direction)\
/ self.radius
28 20092 333879 16.6 0.2 hit_point = ray.origin + t * \
ray.direction
29 20092 299494 14.9 0.2 return ShadeRecord.ShadeRecord(
normal=normal,
hit_point=hit_point,
parameter=t,
color=self.color)
Therefore, it appears that most of the time is spent in this chunk of code:
temp = ray.origin - self.center
a = numpy.dot(ray.direction, ray.direction)
b = 2.0 * numpy.dot(temp, ray.direction)
c = numpy.dot(temp, temp) - self.radius * self.radius
disc = b * b - 4.0 * a * c
We cannot really optimize much. We could precompute self.radius * self.radius, but it does not really have an impact. Something we can observe is the huge amount of routine calls. Is the routine call overhead relevant ? Maybe: Python has a relevant call overhead, but a very simple program like this
def main():
def f():
return 0
a=0
for i in xrange(2560000):
if f():
a = a+1
print a
main()
is going to take 0.6 seconds, not small, but definitely not as huge as the numbers we see. Why is that ? And why is the raytracer so slow for the same task ? I think the bottleneck is somewhere else.
Finding the problem
I decided to profile World.render() to understand what's going on: this is the routine in charge of going through the pixels, shooting the rays, then delegating the task of finding intersections to Tracer.trace_ray, which in turns re-delegates the task to World.hit_bare_bone_object. I don't really like this design, but I stick to the book as much as possible, mostly because I don't know how things will become later on.
The profiling showed two hot spots in World.render(), in the inner loop:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
41 640000 18786192 29.4 29.2 ray = Ray.Ray(origin = origin,
direction = (0.0,0.0,-1.0))
42
43 640000 22414265 35.0 34.9 color += numpy.array(tracer.trace_ray(ray))
Why is it so slow to perform these two operations? It turns out that numpy is incredibly slow at creating arrays. This may indeed be the reason why it's so slow to instantiate a Ray object (two numpy.arrays), to add the color (another instantiation) and to perform operations in the Sphere.hit slow lines. At this point I'm not sure I can trust numpy.array, and I decide to remove it completely replacing arrays with tuples. The result is pleasing
$ time python test3.py
real 0m31.215s
user 0m29.923s
sys 0m2.355s
This is an important point: tuples are much faster than small arrays. numpy seems to be optimized for large datasets and performs poorly when handling small ones. This includes not only the creation of the arrays, but also any operation in numpy that may create numpy arrays as a consequence, such as calling numpy.dot on two tuples instead of a trivial implementation such as
def dot(a,b):
return a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
in fact, if I use numpy.dot on tuples in Sphere.hit():
a = numpy.dot(ray.direction, ray.direction)
b = 2.0 * numpy.dot(temp, ray.direction)
c = numpy.dot(temp, temp) - self.radius * self.radius
the total running time goes from 31 seconds to a staggering 316 seconds (5 minutes). My guess is that they are converted to numpy.arrays internally, followed by the actual vector-vector operation.
I call myself happy with a runtime of 30 seconds for now, and plan to optimize further when more complex operations are performed. You can find the version for this post at github.