The uGON Parser

The uGON parser is simple implementation of a file parser for the GON (Glaiel Object Notation) file format. This format is similar to JSON, but it is much more terse. The original implementation of this format was created by Tyler Glaiel, and if you want to learn more about the format itself, you should refer to this link to the original repository.


An Introduction to the Format

Every field in a gon file consists of a name and value pair (unless the object is in an array, in which case it has no name). Each token (a name or value) is separated by whitespace, unless encased in quotation marks, in which case the content of the quotation marks is treated as a single token. There are three types of gon field:

Below is a breif example of what the format looks like:

sample_object {
    file_name test.gon
    number    35.35
    string    "this is a string"
    array     [ 1 [ 2.1 2.2 ] 3 ]
    nested_object {
        number 53.53
        string "this is a string"
        array  [ 1 2 " three " 4 five 6 7 8 ]
    }
}

In this example, we have a root object called "sample_object" which contains several fields, including an array and another nest object. Objects and arrays can be nested as deeply as you wish without issue. Because all of the data in a gon file is plain text, there is no inherent issue with mixing data/fields/objects of different types. You can also placed comments in a file with '#'. Everything after this token will be ignored until the end of the line.

I find that the extremely minimal syntax of the format makes it ideal for situations requiring manual data entry. I currently use the format for entering variable values in my upcoming game's level editor in leiu of a gui interface, since it is extremely easy to edit the text and reload the file whenever I need to update the data. I also use the format for all of my configuration files as it is extremely easy for users to edit and requires very little explanation.

Motivation

I became familiar with the GON format because it is used heavily for storing game data in The End is Nigh, a game by Tyler Glaiel and Edmund McMillen. Through modding this game, I have used the format extensively and have had an affinity for its extreme simplicity compared to many other text-based data file formats (such as JSON or XML). During development of my The End is Nigh Randomizer, I developed a very straightforward C# port of the original parser, which helped familiarize me with how it functioned. After butting heads with the limitations of the TEiN game engine for long enough, I set out to create a new engine from scratch, and as part of that I wanted to write a GON parser from scratch to address some of the inadequacies I recognized in the original version.

My goal with this implementation of the format was to create a much faster, more efficient parser which will greatly cut down on load times and simplify the data access model. After several revisions, I believe this goal has largely been acheived. My uGON parser is roughly 150x faster (yes, 150 times, not percent, faster) than the original GON implementation, and significantly faster than rapidxml for a file containing roughly equivalent infromation. Currently, on my machine, the runtime of the parser is on average equivalent to around 3x the runtime of the strlen function on the same input file.

High-Level Design

This parser gets its speed primarily from its simplicity. The entire uGON parser is available as a single C header file, and it is only ~330 lines total. Nearly the entire parsing process is done in a single 120-line loop, with no function calls or recursion. The main data structure of the parser is the GonField. This represents each field in the file, which may be of type field, object, or array. This struct is treated as a tagged union, reducing the memory footprint as much as possible without compromising speed. The main loop reads each name/value pair which defines a GonField, with minimal control branching to handle object and array field types. In place of using recursion for parsing nested objects, a simple stack is used to push and pop parent object indices.

uGON avoids many of the pitfalls of other parsers simply by doing things the old fashioned way, keeping in mind the basics of how computers really function. Many parsers for similar formats construct a tree structure of nodes or objects which represent the data in the file (similar to a DOM). These individual nodes are often allocated one by one on the heap, and then linked to one another through pointers. This results in slow data access because of indirection and the need to constantly allocate small chunks of memory. By contrast, the uGON parser stores the data for each field sequentially in a flat array, were the structure can be derived from the field size and count attributes. This enables traversal of the parsed file in a tree-like manner, but without the performance penalties of a true, pointer-based tree structure.

This has still been quite a cursory explaination of the program, but hopefully it has communicated some basic elements of the parser's structure. More detailed documentation is yet to come.

To-Do

This parser is not quite completely ready to see use in your next big project, but it should be seeing some useful updates in the weeks and months to come. I'm activeley developing the parser for use in my other project, OpenEnded, which uses the GON format for all of its data files and as a means for editting levels through markup text. I plan to finalize a proper API for interfacing with the parser soon, as well as adding better methods for accessing data, including a simple hash table to enable fast data retreivals. I may look into options for further optimization using mutli-threading or SIMD instructions, though I anticipate that such solutions may have very diminishing returns.