Hellope. I did some profiling on OpenXcom and I've been working on trying to speed up some critical sections.
I used callgrind (a valgrind tool) to do the profiling. It seems the easiest approach even though the actual game runs excruciatingly slowly under valgrind's emulation. Except single-digit FPS. Look at kcachegrind's pretty output, though:
This is a human turn plus an AI turn of a base assault:
https://bumba.net/~hmaon/OXC_callgrind_kcachegrind_one_turn_base_assault.pngThe method individually using the most CPU in the battlescape is obviously the shader. Then, curiously there's SavedBattleGame::getTile() and then _zoomSurfaceY().
I bet there's something that could be done to speed up the shader code but I actually don't understand it yet. I moved on to the other functions.
getTile() seemed to cry out to be inlined so that's what I did. I then inlined getTileIndex() along with it so the whole procedure can avoid a call. As you can see, getTile() gets called a lot. In this run, it was called over 68 million times. _zoomSurfaceY() gets called once per frame, I think; that makes 68513755 / 4031.0 = 16996.7 getTile() calls per frame on average. It's hard to say whether that's actually a lot; it's 1/4 of the pixels of a 320x200 window, though? It's not quite 5% of the CPU load. Then again, 5% CPU time in a single getter function, really?
Anyway, next I looked at _zoomSurfaceY(). It's responsible for stretching the 320x200 native resolution window to the display resolution (e.g., 640x400 or my preference of 1280x800). It's written to be a very general function to scale the image correctly to any arbitrary resolution given any pixel format. That allows for a lot of optimization in the special cases of x2 or x4 scale at 8bpp which seem like the most common use cases. I wrote two rescaling functions to read data as a 64 bit int and write it back as 64 bit ints (and then 32-bit versions of the same.) The results seems to have been an FPS increase anywhere from +10% to +100%. At 1280x800 on my particular laptop, the game went from ~70 fps to ~140 fps. Coincidentally the 32-bit versions of the zoom function are only slower by a couple of FPS. I'm not sure why -- write combining maybe? Could be the register spill I'm noticing in the assembly output on the 64-bit version? -- if anyone has some experience in this sort of analysis, please take a look.
Coincidentally, there's probably some opportunity to insert other filter functions here, perhaps copied from any of the many console emulators out there.
Finally, here's the profiler's output after my changes:
https://bumba.net/~hmaon/optimized_zoom_function_profile.pngAs you can see, getTile() is gone from the results and its most frequent callers from the TileEngine are the next in line. Also, _zoomSurfaceY() has fallen below TileEngine code in CPU use! From 4.68% CPU to 3.07% CPU seems like a nice change.
Of course, those figures are hardly scientific. I made hardly any effort to keep the two runs identical. There's also no demo that I could run the game through to help me repeat similar runs. I have to actually play the game at ~0 fps in valgrind's virtual CPU.
Oh yeah, _michal asked on IRC for a write-up of my profiling and optimization attempts.
The branch with my optimizations is here:
https://github.com/hmaon/OpenXcom/tree/optimization_attemptsI've submitted a pull request for whenever SupSuper is done working on actual important stuff.
Suggested points for discussion:
1) What is up with the Shader code? How does it work? Anyone? How can it be sped up?
2) What's the deal with my coding style? Why is it such a mess?
3) How about some optimizations that I missed?
4) Can those TileEngine methods be improved somehow?
5) Shouldn't we just OpenGL to scale and filter the output? (Perhaps?)
6) Does ANYONE have a working PowerPC Mac? I bet my code is broken on big-endian systems right now but I have no computer to test on!
tl;dr: I made the FPS number go up a little; maybe someone porting to really underpowered hardware (or running debug builds) will care.