# Micro-optimizing Python code for 3D

I enjoy optimizing code. These days I don't get the opportunity to do it very often, since I prefer working in high level languages now. Which is why I have been particularly enthusiastic in optimizing the 3D math classes in Game Objects.

Whenever the subject of optimization comes up, someone inevitably regurgitates the quote "premature optimization is the root of all evil", which they take to mean that you should not do any optimization until you have a working application and are able to profile it. In general this is good advice - but if you know about the domain you are working in, occasionally early optimization is a good thing. 3D math for games is one of these situations. The code is a black box which doesn't need to be built upon. Speeding it up tends to have a direct impact on the user experience (i.e. a better frame-rate). There is also limited scope for algorithmic improvements (which should be considered first), so micro-optimization is justifiable in this case (IMHO).

When I started looking in to optimizing my Matrix class, I compared it against another matrix class in PyEuclid, which is a quite comprehensive set of classes for 2D and 3D euclidean geometry. When I looked at the Matrix3 implementation, I was surprised that it stored the values of the matrix as 16 attributes (self.a, self.b etc.), because attribute access can be a bottle-neck for serious number crunching code. So I suspected that the matrix multiplication in PyEuclid would be slower than the equivalent function in Game Objects, as I used a single list of 16 values, which should be faster. But when I timed it, I found that PyEuclid was fractionally faster. Naturally I had a look at the Matrix3.__imul__ method in PyEuclid, I noticed that It copied one of the matrices in to 16 local references, and that made it a little faster because the attribute access was done only once. So I tried doing the same thing for the self matrix, which turned out to speed it up my almost 20%, which made it faster than my version. I was happy to submit the optimization back to Alex Halkner, the author of PyEuclid, but naturally I was determined to speed up my version by at least the same amount! I used the updated Matrix__imul__ from PyEuclid (below) as my benchmark.

def __imul__(self, other): #assert isinstance(other, Matrix4) a = self.a b = self.b c = self.c d = self.d e = self.e f = self.f g = self.g h = self.h i = self.i j = self.j k = self.k l = self.l m = self.m n = self.n o = self.o p = self.p oa = other.a ob = other.b oc = other.c od = other.d oe = other.e of = other.f og = other.g oh = other.h oi = other.i oj = other.j ok = other.k ol = other.l om = other.m on = other.n oo = other.o op = other.p self.a = a * oa + b * oe + c * oi + d * om self.b = a * ob + b * of + c * oj + d * on self.c = a * oc + b * og + c * ok + d * oo self.d = a * od + b * oh + c * ol + d * op self.e = e * oa + f * oe + g * oi + h * om self.f = e * ob + f * of + g * oj + h * on self.g = e * oc + f * og + g * ok + h * oo self.h = e * od + f * oh + g * ol + h * op self.i = i * oa + j * oe + k * oi + l * om self.j = i * ob + j * of + k * oj + l * on self.k = i * oc + j * og + k * ok + l * oo self.l = i * od + j * oh + k * ol + l * op self.m = m * oa + n * oe + o * oi + p * om self.n = m * ob + n * of + o * oj + p * on self.o = m * oc + n * og + o * ok + p * oo self.p = m * od + n * oh + o * ol + p * op return self

It may appear that the 32 assignments would slow it down, but it was faster over all because they reduce the amount of attribute access in the function. Knowing that attribute access was the enemy, I tried the same trick with Matrix44.__imul__ in Game Objects. My 16 values of the matrix were stored in a list called self._m, and the individual values were accessed as self._m[0] thru self self.m[15]. Copying these in to local references turned out to be very easy, because I could use Python's list unpacking. Doing this for both matrices yielded the following code.

def __imul__(self, rhs): """Multiplies this Matrix44 by another, called by the *= operator.""" m1_0, m1_1, m1_2, m1_3, m1_4, m1_5, m1_6, m1_7, m1_8, m1_9, m1_10, m1_11, m1_12, m1_13, m1_14, m1_15 = self._m m2_0, m2_1, m2_2, m2_3, m2_4, m2_5, m2_6, m2_7, m2_8, m2_9, m2_10, m2_11, m2_12, m2_13, m2_14, m2_15 = rhs._m self._m = [ m2_0 * m1_0 + m2_1 * m1_4 + m2_2 * m1_8 + m2_3 * m1_12, m2_0 * m1_1 + m2_1 * m1_5 + m2_2 * m1_9 + m2_3 * m1_13, m2_0 * m1_2 + m2_1 * m1_6 + m2_2 * m1_10 + m2_3 * m1_14, m2_0 * m1_3 + m2_1 * m1_7 + m2_2 * m1_11 + m2_3 * m1_15, m2_4 * m1_0 + m2_5 * m1_4 + m2_6 * m1_8 + m2_7 * m1_12, m2_4 * m1_1 + m2_5 * m1_5 + m2_6 * m1_9 + m2_7 * m1_13, m2_4 * m1_2 + m2_5 * m1_6 + m2_6 * m1_10 + m2_7 * m1_14, m2_4 * m1_3 + m2_5 * m1_7 + m2_6 * m1_11 + m2_7 * m1_15, m2_8 * m1_0 + m2_9 * m1_4 + m2_10 * m1_8 + m2_11 * m1_12, m2_8 * m1_1 + m2_9 * m1_5 + m2_10 * m1_9 + m2_11 * m1_13, m2_8 * m1_2 + m2_9 * m1_6 + m2_10 * m1_10 + m2_11 * m1_14, m2_8 * m1_3 + m2_9 * m1_7 + m2_10 * m1_11 + m2_11 * m1_15, m2_12 * m1_0 + m2_13 * m1_4 + m2_14 * m1_8 + m2_15 * m1_12, m2_12 * m1_1 + m2_13 * m1_5 + m2_14 * m1_9 + m2_15 * m1_13, m2_12 * m1_2 + m2_13 * m1_6 + m2_14 * m1_10 + m2_15 * m1_14, m2_12 * m1_3 + m2_13 * m1_7 + m2_14 * m1_11 + m2_15 * m1_15 ] return self

This optimized version removed all the list index operations from the function and reduced attribute access dramatically, making it considerably quicker; it could do 60,502 matrix multiplies per second, compared to PyEuclid's 47,594 per second. I think this must be the quickest way to do a full matrix multiply in straight Python. Other functions benefited from this to, transforming a Vector3 improved by about 20%, as did other operations. I also applied the same technique to the Vector3 class, which was worthwhile, but not quite as good an improvement as the matrix code.

These experiments have lead me to the following rules of thumb for micro-optimizing Python code, with the disclaimer that you should only micro-optimize if you know it will produce tangible benefits in the user experience!

- Prefer lists of values, rather than attributes.

- If you use an attribute more than once, copy it in to a local reference.

I'd like to point out that I'm not claiming that this is a good reason to use Game Objects over PyEuclid. They are intended for different things, Game Objects will eventually be a collection of general algorithms for games and other realtime apps. The speed advantage is not stellar, so you should use which ever interface you prefer - or even both!

I don't have your source code, but given what you're doing, are you using __slots__? That might help, for cheap.

The other thing I would look at in this instance is always storing your 16 values in an array (or tuple if appropriate), and writing properties that correctly retrieve the given index, avoiding the need to construct a new tuple of all 16 values in the assignment. Whether this is a win depends on the rest of your code. (In this case, a custom descriptor would be more convenient to write but may be slower... well, actually, there's a lot to benchmark, but if this is the time sink in your code it may make sense to make single-value access expensive for the benefit of speeding this operation up, I don't know.)

Oh, fudge, I misread and you're doing that. Never mind. (embarrassed)

Hi Will,

Is there a reason you aren't using numpy? Is it just to keep GameObjects from having any extra requirements and to keep it pure python? You'll find that with numpy you gain both speed and expressiveness.

Naveen

Precisely, and to give it a beginner friendly interface. I may eventualy write an alternative version that uses numpy under the hood though.