Adventures in XML Parsing

I think pretty much everyone has realized at this point that XML is not very easy to parse. With a little documentation and a helpful parsing library it should be, at the very least, managable right?

That’s what I thought when I attempted to write a TMX parser for the first time. I quickly found out how much of a pain it is to parse XML even with the format documentation right in front of me and a robust library to work with.

Seemingly random blank tags

I think the main issue with the XML standard is just how many quirks a file can possibly have. Things like this:

<doc>
        <element>data</element>
        <- There's a blank tag here! ->
        <element>more data</element>
</doc>

That blank tag counted for the whitespace that supposedly exists there. When this is detected by LibXML2, a blank <text> tag is placed in between two, otherwise valid, XML tags. Now when parsed by Python or Javascript or other languages where pointers are essentially non-existant, this should never come up. When parsed with a language like C however…well

zsh: segmentation fault (core dumped)   ./test

So, like any good developer, I ran it under Valgrind, and soon discovered that this is no ordinary memory fault.

==2373696== Invalid read of size 8
==2373696==     by _parse_layer(void*)
==2373696==     at xmlStrEqual(nodePtr*, xmlChar*)
==2373696==  Address 0x0 not stack'd, malloc'd or (recently) free'd

Now at this point, I’m thinking “Wait the bug is in LibXML? That can’t be right.” GDB, with liberal use of bt and print pointed at the same result: that the bug resided with LibXML. None of this made sense in the slightest. Why on earth would a professionally written software library that was essentially a standard fixture on many Unix-like systems have a major memory bug in it? The answer would not reveal itself until observing the program with hardware watchpoints.

(gdb) watch *node
Hardware watchpoint 2: *node
(gdb)
...

Hardware watchpoing 2: node

Old value = (nodePtr *) 0x5555...
New value = (nodePtr *) 0x0

There you are! This was that pesky <text> tag. Apparently, any, and I do mean any, whitespace detected by LibXML, including the space inserted by my level editor, produces this strange <text> tag that seems to be there for no clear reason. Three new helper functions and judicious use of nodePtr = nodePtr->next later and that problem is solved.

Comma separated values inside XML tags

XML can be a beast to parse, but CSV? I can parse that very easily using standard library functions like strtok. The problem came when the values in this list of values did not match the values inside the level editor.

Tiled Map Editor

<data encoding="csv">
49,50,50...50,51
97,
.
.
.
145,146...146,147
</data>

That first tile in the upper left corner? It has a GID of 48 inside the editor. In the file, it has GID of 49. This difference is not readily apparent from the editor or the file itself unless you know its there. This created the interesting case where my level looked pretty good in the editor but looked like someone placed tiles seemingly at random when my engine loaded the file into memory.

The documentation failed to mention this too, making this all the more difficult to track down. Eventually, I did manage to figure out that when the tile GID is extracted from the file to just decrement the value.

for (int i = 0; i < count; i++)
        ret->tile_gids[i] = vals[i]-1;

Thankfully, it didn’t require an hour worth of debugging inside GDB to find this.

Conclusion

I think that if you can get away with it, try to parse a binary file of your own creation rather than try to parse an existing standard. Why binary? Because you can more finely control how the data is formatted and parsing becomes a non-issue thanks to the ease of use of a standard C FILE pointer. I think the next thing I’ll do is write a script that converts these XML files into something much more terse and less quirky.