Friday, January 25, 2013

JSON: It's Time to Move On

I love JSON. I love it because it's not XML. I used to think XML was a pretty good idea compared to unpacking structs, but the more it was used for generic tasks like RPC and config file formats, the more it became clear that it was really only suitable for documents. This makes sense, as that's what it was designed to do. XML was being used to represent data structures, and the problem with that is that there is a mismatch between what XML is good at expressing and the sort of data structures you generally want to encode for computational tasks.

JSON is obviously a better choice for a number of common data types and structures such as floats, strings, maps, and lists. The syntax is easier to read and more concise for encoding these types. More importantly, however, is that there is a clear mapping between the data structures and their encoding. This was something you had to invent in XML or use one of a number of incompatible standards, such that XML became a proliferation of different languages speaking about the same things.

JSON has served us well, but much like XML, as it's been used for more and more things, it's shortcomings are becoming apparent. JSON suffers from essentially the same problem as XML, a like of universal mappings for common items that need to be encoded and decoded consistently.

The missing type which most commonly causes me trouble with JSON is byte strings. Javascript only has one string type, while other languages often have two: one for byte strings and one for unicode strings. To be honest, I'm not totally sure if Javascript strings are supposed to be unicode strings. String literals can include unicode escape sequences. However, I'm not clear on if you can have pure byte strings (i.e. with invalid unicode sequences) and I don't know, for instance, if String.charAt(x) counts bytes or unicode characters. However, most JSON encoders assume all strings to be unicode. Therefore, JSON in practice has only a unicode string type and does not support byte strings.

Many applications, however, have byte strings. The most common solution is to base64 encode your byte strings into ASCII characters and encode them into JSON as unicode strings. In addition to being slower, it requires increased semantic complexity. Both the sender and receiver of the JSON now need to know where the base64 encoded strings are in the nested JSON data structure so that they can be encoded and decoded between byte strings and base64.

This has caused people to invent their own protocols on top of or around JSON. For instance, you can tag every string as to whether it needs to be base64 decoded or not. Another solution is to remove all byte strings from the JSON and instead include tagged offsets. The binary data can then be appended to the end of the JSON data as a packed binary blob and the offsets used to extract individual byte strings. A very simple solution I've seen is to encode the whole data structure using a binary-friendly format such as BSON or MessagePack, base64 encode the entire result, and send it as a single JSON string.

The advantage to building something on top of or around JSON is that the encoder and decoder do all of the work of analyzing the data structure and patching incompatibilities with standard JSON. The disadvantage is that now you're using a nonstandard protocol which is going to need to be implemented for both the sender and receiver, for every language you want to use.

The best solution overall is to realize the limitations of JSON and decide on a new protocol which fixes these limitations. There are several alternatives to JSON already, but they focus more on efficiency on encoding and decoding rather than on the more fundamental semantic mismatch issues. Of the binary formats I've looked at (BSON, BJSON, MessagePack), only BSON has separate data types for unicode strings and byte strings. I'm not specifically advocating BSON, but at least they have the right idea on that front.

This new protocol doesn't even necessarily need to be a binary protocol. It just needs to support byte strings as a semantic type. In the end, everything needs to be JSON-compatible in order to be browser-compatible, so building something on top of JSON would probably be a fine solution. People are already doing this, as I mentioned above. The next step is to give it a name and release it on github so that everyone can use it and start adding support for more languages.

Here is my minimum feature least for a new encoding:
  • It should be JSON-style where you just give it a data structure and it serializes it, and you give it a string and it deserializes it into a data structure. (As opposed to Protobuf/Thrift style with schemas)
  • Support for all the JSON datatypes - string, float, map, list, boolean, null
  • Add support for byte strings in addition to unicode strings
  • Add support for integers in addition to floats
  • Add support for dates
  • Browser-compatible, which probably means encoded as JSON between the client and server
Nice-to-have optional features:
  • Sets as well as lists
  • Ordered maps as well as unordered maps
In the meantime, I've switched to using BSON when not in browsers and I'm still using JSON in the browser. This is not a good solution, but it's the best available at the moment that doesn't require inventing a custom protocol.