Author Topic: Converting OXCE to rapidyaml (Read 11149 times)

Yankes · « **Reply #165 on:** December 03, 2024, 12:56:24 am »

`mutable`?

[ps]

btw, for now we should keep all speedups to after deploy. Stash this change for later.
for now we need weed out all bugs that could linger in new version.

I planed today prepare new RC but I have very busy day and I did not find enough time to sit down and prepare all changes.

Delian · « **Reply #166 on:** December 03, 2024, 12:57:07 pm »

That's it! I didn't know that keyword existed lol. And it's a perfect fit for this use case. Well, I should've made _index mutable as well, so I wouldn't need to make a copy of the reader every time I needed to make one. Ok, I'll leave that change for later then, together with script caching, which I put on a separate branch. Although, doesn't it make sense to try to cram in as many changes as possible before everything is retested? I mean, if everything has to be retested anyways...
Oh, I also merged everything except for the alias stuff from the rc0 branch to mine.

Yankes · « **Reply #167 on:** December 03, 2024, 01:30:39 pm »

if too many changes will be done at once then multiple bugs can creep in at once. And in worse case make everything "work" but when you try fix some corner cases then all bugs will surface again in other places.
When we include biggest change and test it fully, then small optimization could be tested locally without worrying that big change could interfere with small one.

btw I do not think merging is good strategy in this case, as I make rebase of your changes, next RC1 will be another rebase (I will push changes on RC0 too but without rebase, for you to see what I changed). I do this mainly because I try with Meridian keep linear hisotry and atomic commits as this allow easy bisection of bugs (as we can find exactly what commit break some mods).

Yankes · « **Reply #168 on:** December 04, 2024, 02:27:31 am »

New version for testing: https://github.com/Yankes/OpenXcom/tree/stash/rapidyaml-rc1
I stashed all speed improvements and only keep basic yaml changes + clean warning.

@Delian
For now I renamed `sansRoot` to `toBase` as your name at least for me is not clear.
But more I think about this, do we need this for anything? Slicing is only dangerous
when we have polymorphic types but in our case this is not true, sliced or not behavior is same.

Do you have any case when implicit slice would break our code?

Delian · « **Reply #169 on:** December 04, 2024, 11:41:00 am »

sans = without. For instance, Sans-serif Font is a font without serifs.
YamlRootNodeReader without Root = YamlNodeReader. Get it? You're just removing the word "Root".
I know it's not a standard programming term, but I thought it was funny...

I don't think slicing causes any crashes, but Visual Studio is throwing warnings whenever it happens. So, it's mostly just to make the compiler/intellisense happy.

I tested rc1 and no issues for me. I added some comments to the rc0/rc1 commits tho.

Yankes · « **Reply #170 on:** December 06, 2024, 01:10:21 pm »

@Delian

Are you fine with current state of RC1? If yes, then Meridian will start preparing to create new 8.0 release with it.
All speedups and other improvements will go to nightly versions like 8.0.1, 8.0.2, ... and when all will be tested then 8.1 will be published.

Delian · « **Reply #171 on:** December 06, 2024, 01:35:30 pm »

Yeah. I'm not finding any issues that would be related to the migration itself. In other words, the parsing and round trips work, so it's fine and you can start.

Yankes · « **Reply #172 on:** December 19, 2024, 10:31:03 pm »

After deeply analyzing code I find couple place that could improve overall performance, some places are iterate tree times to allocate vector two times,
other times iterations are `O(n^2)` (not large list but still 10 vs 100 it a bit). Thanks to blazing fast and effective rapidyaml this places are hard to notice.
But if we add some helper structures I guess that overall performance could be improved two times (like reduce some operations to `O(n)` and avoid cache misses).

This will be done after main change get into OXCE.

Change will look like:

To `YamlRootNodeReader` add two vectors:
`std::vector`with struct that have `childsStartsPtr` and `childsSize`
And another
`std::vector` with struct `nodeKey` and `nodeId`
Size of both vectors will be equal to size of whole node list.
First vector points to sub-lists of nodes in second vector, we gather all node children and put then in second vector.
Now when we ask for children we check this both vectors instead of navigating nodes itself.
this should change this effective change current `list<node>` to `vector<node*>`
and it should have better cache characteristics.
We could to sort map keys to have `O(log(n))` performance too instead of current `O(n)` (unless we use index).

I need make test if my idea will have real benefits but I expect good grains from approach like this.

Delian · « **Reply #173 on:** December 20, 2024, 04:04:31 am »

Rapidyaml doesn't use list<node>. Internally it already uses vector<NodeData> (It's array*, same thing; NodeData = 72 byte struct), and "NodeId" is just offset in this array.

The cache-locality of this array is already great. It would be very difficult to improve it. You can try, but I think constructing your vector would take more time than the time gained from theoretically improved cache-locality. Sorting... that's worse than hashing, isn't it? That's why std::map never beats std::unordered_map.

I can give you some current timing data. Let's say game startup (debug build with XPZ; Options::mute = true) takes 5000ms
Of these 5000ms, operations related to rapidyaml take:
- 1300ms to parse raw yaml into node trees
- 333ms searching for node keys
- 263ms deserializing node values
- 87+35ms building+destroying unordered_map node key indexes
- 32+13ms building+destroying YamlNodeReader.children() vectors
- 20ms building+destroying YamlNodeReader objects
- other negligible stuff

Of the 333ms of node key searching there is:
- 220ms (10ms in release build) of indexed searching
- 72ms (4ms in release build) of unindexed searching (with stored child iterator optimization)

What about game loading? In the release build, loading a 12MB sav file, all the unindexed searching together takes <2ms.

If your proposed optimization is targeting unindexed key searching, well, there's very little left to gain there. The majority of search calls (3 to 1) are already indexed. At least for game startup. For game-loading, the majority is unindexed, however that's mostly (80%) inside BattleUnitKills::load(), where the stored child iterator optimization gives it O(1).

Btw, what are the places that you say you found where there's too much iteration?

Yankes · « **Reply #174 on:** December 20, 2024, 02:24:22 pm »

Quote from: Delian on December 20, 2024, 04:04:31 am

Rapidyaml doesn't use list<node>. Internally it already uses vector<NodeData> (It's array*, same thing; NodeData = 72 byte struct), and "NodeId" is just offset in this array.

The cache-locality of this array is already great. It would be very difficult to improve it. You can try, but I think constructing your vector would take more time than the time gained from theoretically improved cache-locality. Sorting... that's worse than hashing, isn't it? That's why std::map never beats std::unordered_map.

I can give you some current timing data. Let's say game startup (debug build with XPZ; Options::mute = true) takes 5000ms
Of these 5000ms, operations related to rapidyaml take:
- 1300ms to parse raw yaml into node trees
- 333ms searching for node keys
- 263ms deserializing node values
- 87+35ms building+destroying unordered_map node key indexes
- 32+13ms building+destroying YamlNodeReader.children() vectors
- 20ms building+destroying YamlNodeReader objects
- other negligible stuff

Of the 333ms of node key searching there is:
- 220ms (10ms in release build) of indexed searching
- 72ms (4ms in release build) of unindexed searching (with stored child iterator optimization)

What about game loading? In the release build, loading a 12MB sav file, all the unindexed searching together takes <2ms.

If your proposed optimization is targeting unindexed key searching, well, there's very little left to gain there. The majority of search calls (3 to 1) are already indexed. At least for game startup. For game-loading, the majority is unindexed, however that's mostly (80%) inside BattleUnitKills::load(), where the stored child iterator optimization gives it O(1).

It use "list" as nodes are not sequenced in memory, all exits in one big vector but you need jump between elements. it could be couple of cache lines between siblings. CPU can't preload next after sibling as it need know value in next sibling node first. With more "dense" layout, CPU can preload arbitrary sibling.

> That's why std::map never beats std::unordered_map

`std::flat_map` not `std::map`, even when we have unordered one, some people still need it and add it to standard:
https://en.cppreference.com/w/cpp/container/flat_map
There are use cases where it could be better.

overall thanks for detailed timings, at least I would need reevaluate my initial optimistic predictions.
Still I think that we can grain "more" time that even could be concluded from that only "yaml" use.
As memory cache is shared resource, improving one part can improve other unrelated parts too.

Delian · « **Reply #175 on:** December 20, 2024, 10:00:55 pm »

I did a small test. It's 100ms just to visit all the nodes of all the trees (no vector building). So, good luck~

>flat_map
flat_map still never beats unordered_map in insertion and random searching performance. The reason people need flat_map is because it's faster than map, and, like map, it gives you something unordered_map can't, which is, ordered traversal.
Even Adaptive Radix Tree, which is the ultimate tree-like structure for strings, is still 2-3 times slower than the hash map.

Anyway, if we absolutely needed to make some hot path fast, there's is a way to do that. Let's say we're currently doing:

Code: [Select]

reader.tryRead("foo", _foo)
reader.tryRead("bar", _bar)
reader.tryRead("baz", _baz)

Building an index has an overhead. But without an index it's problematic if we try looking for a key that doesn't exist. If a mapping node doesn't have any children with a key, it means you spent a lot of time looping through the children and doing string comparisons and returned empty-handed. To avoid that, a more optimal way of deserializing would be:

Code: [Select]

for (const auto& child : reader.children())
{
  switch(child.key())
  {
    case("foo"): child.tryReadVal(_foo); break;
    case("bar"): child.tryReadVal(_bar); break;
    case("boo"): child.tryReadVal(_boo); break;
  }
}

So instead of searching for the keys that node's children may not have, we instead loop through the children. The obvious problem is, we can't do switch(string) in C++. So we would have to create a switch tree comparing one character at a time:

Code: [Select]

for (const auto& child : reader.children())
{
  std::string_view key = child.key();
  switch(key[0])
  {
    case('f'): if (key == "foo") child.tryReadVal(_foo); break;
    case('b'):
      switch(key[1])
      {
        case('a'): if (key == "bar") child.tryReadVal(_bar); break;
        case('o'): if (key == "boo") child.tryReadVal(_boo); break;
      }
  }
}

Heh, it's like we're simulating a Trie data structure. This would work, but not only does it look ugly, it's a nightmare to construct and nightmare to maintain. Unless we used some sort of fancy library or whatever for it.

Lastly, I'd like to mention hashing. Hashing to the rescue!

Code: [Select]

for (const auto& child : reader.children())
{
  switch(somePortableHash<uint8_t>(child.key())) //1 byte makes it likely the switch will be compiled into a jump table
  {
    case(17): if (key == "foo") child.tryReadVal(_foo); break;
    case(183): if (key == "bar") child.tryReadVal(_bar); break;
    case(211): if (key == "boo") child.tryReadVal(_boo); break;
    default: break;
  }
}

This would probably be the fastest performance-wise. While it does suffer from having all those magic numbers that we calculated to be hashing results, it wouldn't be too difficult to maintain - if a new possible key is added, then just compute its hash and add it to the switch.
Right, there's also a thing called Perfect Minimal Hash, but that wouldn't be maintainable.

Author Topic: Converting OXCE to rapidyaml (Read 11149 times)

Yankes

Re: Converting OXCE to rapidyaml

Delian

Re: Converting OXCE to rapidyaml

Yankes

Re: Converting OXCE to rapidyaml

Yankes

Re: Converting OXCE to rapidyaml

Delian

Re: Converting OXCE to rapidyaml

Yankes

Re: Converting OXCE to rapidyaml

Delian

Re: Converting OXCE to rapidyaml

Yankes

Re: Converting OXCE to rapidyaml

Delian

Re: Converting OXCE to rapidyaml

Yankes

Re: Converting OXCE to rapidyaml

Delian

Re: Converting OXCE to rapidyaml