This is why popular data serialization formats like JSON and Protocol Buffers are wasteful.

Guillaume Fortin-Debigaré
10 mai 2023
4 min de lecture

Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So much so, that we're assuming these technologies are properly adapted for today's needs.

Specifically, we're going to look at the cloud computing costs of data serialization, and question whether current data serialization technologies are adapted for them. (Spoiler: They're not.)

The money problem

Let's consider a scenario where we would like to offer a service that would send and receive data over the Internet. We would have to deal with the following expenses:

Implementation and maintenance costs
Processing power for data serialization and deserialization
Bandwidth and storage consumption

As such, we would like to minimize the total sum of these costs over the lifetime of the service. In addition, we would also like to minimize these same costs for our consumers to give ourselves a competitive advantage.

Picking optimal data serialization formats is therefore critical to achieving this objective, because it will have an impact on all of these costs.

For implementation and maintenance, we also have to consider that once a data serialization format becomes popular, there's going to be a bunch of people that will have already done the base work, and thus shall not be considered here.

Current technologies

Human-readable formats

CSV, XML, JSON, YAML... those are all great data serialization formats because anyone can read them and modify them using a simple text editor. In terms of compactness however, they are pretty terrible because they are very verbose by design.

Let's say, for example, that you would like to represent an object with 5 boolean properties. Simply writing the values would require multiple bytes simply for writing "True" or "False" and delimiters between them. Similarly, if the name of the properties must be included in the format, that's more bytes to be consumed for writing them.

As such, not only does it take a bunch of space, but it also requires parsing text to deserialize the data, which is not very efficient. Removing some of the optional padding may help, but doing so has its limits.

Data compression

One quick fix in terms of bandwidth and storage consumption is to apply data compression over text data. However, the results are relatively generic and generally not optimal. Also, while they may save in bandwidth and storage, they also require additional processing power, although the net result is usually worth it in terms of raw expenses.

As for the existing data compression algorithm themselves, some common issues include:

Byte as the smallest component
Upper size limit
Equivalent values written differently
Limited predefined dictionary

Protocol Buffers

As a need for pure binary data serialization arose from the above issues, Protocol Buffers rose to fill the need. While not the only binary serialization solution, it became popular thanks to its open-source nature, its versatile data encoding, the powerful object definition, and the possibility of extending it using gRPC to define full web services. However, the encoding of Protocol Buffers is a bit strange, which may lead to some unexpected issues. For example:

Definition of data requires transforming it into an API using an external tool, then embed that API in the main code, which may be problematic for compatibility and maintenance. This is especially a problem when having to deal with consumers stuck with legacy systems.
Data types do not match between definition (scalar value types) and serialization (wire types), probably to simplify the conversion to common variable types in popular programming languages.
Integers may be serialized longer than necessary, due to a base 128 encoding whose digits are bytes. This issue also affects the encoding of the data type and field ID.
Strings are encoded as UTF-8, even when a better encoding may exist. This is especially true if strings do not require the full range of Unicode characters, or even ASCII characters.
Repeated values or simple patterns are not compressed. While this may be partially mitigated by implementing data compression over the serialized data, this will likely not be done optimally.

As such, it's not a surprise that Protocol Buffers became popular, as each potential issue also have related advantages. Still, there is room for potential improvements.

The future has arrived

Based on the above, here are ideas that I could identify as potential optimizations for the original objective of minimizing costs:

Concatenate data at the bit level instead of the byte level
Use a data compression algorithm that is specifically designed for the serialized data format
Define a data serialization negotiation algorithm for simpler implementation and maintenance
Allow dynamic data serialization within the same stream
Use artificial intelligence to improve optimization of data compression

This is far from an exhaustive list, but we at TS-Alpha considered such ideas for new designs and prototypes. While artificial intelligence is not something we are currently considering until we can find a way to reduce its resource requirements to an acceptable level, we were able to build promising prototypes from the other ideas I listed, which gave birth to Pipeline-D, our flagship cloud savings solution.

If you are producing or consuming serialized data for your APIs, we hope that you will consider Pipeline-D as a cost-effective way to manage your cloud data.

Disclaimer: This blog post is a revised version of a previously-unpublished independent article I wrote back in 2020-10-12, which has been adapted to include the latest developments at TS-Alpha since I became affiliated with them.