Consider using CSV

At my current job, I write a lot of microservices that communicate via HTTPS. Microservice architectures are super hyped at the moment. Very often, the data format of choice is JSON. And that makes sense: JSON is ubiquitous, more lightweight than XML but still flexible enough to represent any data structure you typically need.

However, JSON has some disadvantages too: It still has some overhead, and it's not well suited to streaming. Consider this JSON list:

[
  {"productId": "5710031efdfe", "quantity": 1, "customerId": "8fe96b88"},
  {"productId": "479cd9744e5c", "quantity": 2, "customerId": "526ba6f5"}
]

I've added whitespace and newlines to make it easier to read, but without them, this JSON would be 133 characters long (66 per entry + 1 overhead). If the list had 100,000 entries, that would make 6.6 MB.

If you were to request this list from an API, you might use fetch with the response.json() method (or the equivalent in your language of choice). Those methods will wait until the entire response has been downloaded, and then parse it. That's because JSON is not a streaming format: If you took any prefix of a JSON list, it wouldn't be valid JSON itself, so you couldn't use methods like JSON.parse. Of course, there are ways to consume JSON in a streaming way (like the jq tool, or this library by yours truly), but they can be complicated.

CSV is a format that's more lightweight than JSON and super well suited to streaming. Here's what the above list would look like in CSV:

productId,quantity,customerId
5710031efdfe,1,8fe96b88
479cd9744e5c,2,526ba6f5

It's only 77 bytes, with 29 for the header and 24 for each line. At 100,000 entries, this list would be 2.4 MB (that's ~63% less than the JSON). And you can stream it without weird tricks, because the first n lines of a CSV file are themselves valid CSV.

Of course, CSV is less flexible than JSON. It's suitable when you have a list of items with mostly the same properties, and no nested structures. If you do need nested data, it's possible to use JSON objects inside CSV:

productId,quantity,customerId,metadata
5710031efdfe,1,8fe96b88,"{""campaign"":""christmas2022""}"
479cd9744e5c,2,526ba6f5,"{""newCustomer"":true,""ref"":""search""}"

Of course this makes parsing a bit more complicated, because now you need to use a CSV parser and then a JSON parser for individual values. But it can still help you save data compared to using only JSON.

If you use node.js, I cannot recommend highly enough the csv for Node.js library, which supports generating and parsing with different modes like callbacks, streams and async generators.

Of course, CSV isn't perfect either. If you only deal with small lists (on the order of thousands or fewer items), the benefits might be small. If your data is very complex, it might not make much sense.

That's why this post is called Consider using CSV: The next time you design an API, don't use JSON just because it's "the default". Think about what your data looks like and whether it might benefit from CSV!