I think pretty much everyone has realized at this point that XML is not very easy to parse. With a little documentation and a helpful parsing library it should be, at the very least, managable right?
That’s what I thought when I attempted to write a TMX parser for the first time. I quickly found out how much of a pain it is to parse XML even with the format documentation right in front of me and a robust library to work with.
Seemingly random blank tags
I think the main issue with the XML standard is just how many quirks a file can possibly have. Things like this:
<doc>
<element>data</element>
<- There's a blank tag here! ->
<element>more data</element>
</doc>
That blank tag counted for the whitespace that supposedly exists there. When
this is detected by LibXML2, a blank <text>
tag is placed in between two,
otherwise valid, XML tags. Now when parsed by Python or Javascript or other
languages where pointers are essentially non-existant, this should never come
up. When parsed with a language like C however…well
zsh: segmentation fault (core dumped) ./test
So, like any good developer, I ran it under Valgrind, and soon discovered that this is no ordinary memory fault.
==2373696== Invalid read of size 8
==2373696== by _parse_layer(void*)
==2373696== at xmlStrEqual(nodePtr*, xmlChar*)
==2373696== Address 0x0 not stack'd, malloc'd or (recently) free'd
Now at this point, I’m thinking “Wait the bug is in LibXML? That can’t be
right.” GDB, with liberal use of bt
and print
pointed at the same result:
that the bug resided with LibXML. None of this made sense in the slightest. Why
on earth would a professionally written software library that was essentially
a standard fixture on many Unix-like systems have a major memory bug in it? The
answer would not reveal itself until observing the program with hardware
watchpoints.
(gdb) watch *node
Hardware watchpoint 2: *node
(gdb)
...
Hardware watchpoing 2: node
Old value = (nodePtr *) 0x5555...
New value = (nodePtr *) 0x0
There you are! This was that pesky <text>
tag. Apparently, any, and I do mean
any, whitespace detected by LibXML, including the space inserted by my level
editor, produces this strange <text>
tag that seems to be there for no clear
reason. Three new helper functions and judicious use of
nodePtr = nodePtr->next
later and that problem is solved.
Comma separated values inside XML tags
XML can be a beast to parse, but CSV? I can parse that very easily using
standard library functions like strtok
. The problem came when the values in
this list of values did not match the values inside the level editor.
<data encoding="csv">
49,50,50...50,51
97,
.
.
.
145,146...146,147
</data>
That first tile in the upper left corner? It has a GID of 48 inside the editor. In the file, it has GID of 49. This difference is not readily apparent from the editor or the file itself unless you know its there. This created the interesting case where my level looked pretty good in the editor but looked like someone placed tiles seemingly at random when my engine loaded the file into memory.
The documentation failed to mention this too, making this all the more difficult to track down. Eventually, I did manage to figure out that when the tile GID is extracted from the file to just decrement the value.
for (int i = 0; i < count; i++)
ret->tile_gids[i] = vals[i]-1;
Thankfully, it didn’t require an hour worth of debugging inside GDB to find this.
Conclusion
I think that if you can get away with it, try to parse a binary file of your own
creation rather than try to parse an existing standard. Why binary? Because you
can more finely control how the data is formatted and parsing becomes a
non-issue thanks to the ease of use of a standard C FILE
pointer. I think the
next thing I’ll do is write a script that converts these XML files into
something much more terse and less quirky.