Author Topic: Profiling OXC (and optimizing some code a little)  (Read 6601 times)

Offline hmaon

  • Sergeant
  • **
  • Posts: 40
  • C jockey
    • View Profile
Profiling OXC (and optimizing some code a little)
« on: February 05, 2013, 07:21:03 am »
Hellope. I did some profiling on OpenXcom and I've been working on trying to speed up some critical sections.

I used callgrind (a valgrind tool) to do the profiling. It seems the easiest approach even though the actual game runs excruciatingly slowly under valgrind's emulation. Except single-digit FPS. Look at kcachegrind's pretty output, though:

This is a human turn plus an AI turn of a base assault: https://bumba.net/~hmaon/OXC_callgrind_kcachegrind_one_turn_base_assault.png
The method individually using the most CPU in the battlescape is obviously the shader. Then, curiously there's SavedBattleGame::getTile() and then _zoomSurfaceY().

I bet there's something that could be done to speed up the shader code but I actually don't understand it yet. I moved on to the other functions.

getTile() seemed to cry out to be inlined so that's what I did. I then inlined getTileIndex() along with it so the whole procedure can avoid a call. As you can see, getTile() gets called a lot. In this run, it was called over 68 million times. _zoomSurfaceY() gets called once per frame, I think; that makes 68513755 / 4031.0 = 16996.7 getTile() calls per frame on average. It's hard to say whether that's actually a lot; it's 1/4 of the pixels of a 320x200 window, though? It's not quite 5% of the CPU load. Then again, 5% CPU time in a single getter function, really?

Anyway, next I looked at _zoomSurfaceY(). It's responsible for stretching the 320x200 native resolution window to the display resolution (e.g., 640x400 or my preference of 1280x800). It's written to be a very general function to scale the image correctly to any arbitrary resolution given any pixel format. That allows for a lot of optimization in the special cases of x2 or x4 scale at 8bpp which seem like the most common use cases. I wrote two rescaling functions to read data as a 64 bit int and write it back as 64 bit ints (and then 32-bit versions of the same.) The results seems to have been an FPS increase anywhere from +10% to +100%. At 1280x800 on my particular laptop, the game went from ~70 fps to ~140 fps. Coincidentally the 32-bit versions of the zoom function are only slower by a couple of FPS. I'm not sure why -- write combining maybe? Could be the register spill I'm noticing in the assembly output on the 64-bit version? -- if anyone has some experience in this sort of analysis, please take a look.

Coincidentally, there's probably some opportunity to insert other filter functions here, perhaps copied from any of the many console emulators out there.

Finally, here's the profiler's output after my changes: https://bumba.net/~hmaon/optimized_zoom_function_profile.png
As you can see, getTile() is gone from the results and its most frequent callers from the TileEngine are the next in line. Also, _zoomSurfaceY() has fallen below TileEngine code in CPU use! From 4.68% CPU to 3.07% CPU seems like a nice change.

Of course, those figures are hardly scientific. I made hardly any effort to keep the two runs identical. There's also no demo that I could run the game through to help me repeat similar runs. I have to actually play the game at ~0 fps in valgrind's virtual CPU.

Oh yeah, _michal asked on IRC for a write-up of my profiling and optimization attempts.

The branch with my optimizations is here: https://github.com/hmaon/OpenXcom/tree/optimization_attempts
I've submitted a pull request for whenever SupSuper is done working on actual important stuff.

Suggested points for discussion:
1) What is up with the Shader code? How does it work? Anyone? How can it be sped up?
2) What's the deal with my coding style? Why is it such a mess?
3) How about some optimizations that I missed?
4) Can those TileEngine methods be improved somehow?
5) Shouldn't we just OpenGL to scale and filter the output? (Perhaps?)
6) Does ANYONE have a working PowerPC Mac? I bet my code is broken on big-endian systems right now but I have no computer to test on!

tl;dr: I made the FPS number go up a little; maybe someone porting to really underpowered hardware (or running debug builds) will care.
« Last Edit: February 05, 2013, 07:38:45 am by hmaon »

Offline Yankes

  • Global Moderator
  • Commander
  • *****
  • Posts: 3349
    • View Profile
Re: Profiling OXC (and optimizing some code a little)
« Reply #1 on: February 05, 2013, 08:38:53 pm »
shader function was written by me :)
Main goal was made is fast as possible but without loosing reusability. Only drawback it that heavy depends on optimization to work property (inline calls & remove some useless data & function calls). Most of lines in that function (`ShaderDraw`) is to prepare data before drawing. This remove some work, that is need to be done in inner loop. Thanks to that I need do only one comparison to break this loop.
In your case most work is done in `StandartShade::func`, that is call for every pixel. Its changing brightness of graphics.


I doubt that is possible to speed up this, without using assembler or hardware.

Offline michal

  • Commander
  • *****
  • Posts: 629
    • View Profile
Re: Profiling OXC (and optimizing some code a little)
« Reply #2 on: February 16, 2013, 08:43:59 am »
Kyzrati from x@com written on his blog about such profiling tool for windows:

https://www.codersnotes.com/sleepy

Quote
It supports any native Windows app, if it has standard PDB or DWARF2 debugging information. No recompilation is necessary – it can just attach to any app as it’s running.

I thought it may be useful for openxcom too.

Offline hmaon

  • Sergeant
  • **
  • Posts: 40
  • C jockey
    • View Profile
Re: Profiling OXC (and optimizing some code a little)
« Reply #3 on: February 17, 2013, 12:15:50 am »
Thanks, _michal; that sounds handy. Also, I'm going to need more changes to the gitbuilder makefile at some point, please! Namely, -msse2 when compiling and -lopengl32 when linking.

Yankes, I tried to vectorize StandartShade to at least read 64 bits at a time but I didn't see any performance gains. You're probably right! Thanks for your insight.

What I've done instead has been to copy the OpenGL output code from bsnes and jam it into OXC. The result is 400 fps at 1280x800 while filtering the image using hardware shaders. Code is here: https://github.com/hmaon/OpenXcom/tree/opengl Screenshots of shaders in other thread.