JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

Benchmarking JSON libraries for large payloads Imagine that the marketing campaign you set up for Black Friday was a massive success, and customers start pouring into your website. Your Mixpanel setup which would usually have around 1000 customer events an hour ends up having millions of customer events within an hour. Thereby, your data pipeline is now tasked with parsing vast amounts of JSON data and storing it in your database. You see that your standard JSON parsing library is not able to scale up to the sudden data growth, and your near real-time analytics reports fall behind. This is when you realize the importance of an efficient JSON parsing library. In addition to handling large payloads, JSON parsing libraries should be able to serialize and deserialize highly nested JSON payloads.In this article, we explore Python parsing libraries for large payloads. We specifically look at the capabilities of ujson, orjson, and ijson. We then benchmark the standard JSON library (stdlib/json), ujson, and orjson for serialization and deserialization performance. As we use the terms serialization and deserialization throughout the article, here’s a refresher on the concepts. Serialization involves converting your Python objects to a JSON string, whereas Deserialization involves rebuilding the JSON string from your Python data structures.As we progress through the article, you will find a decision flow diagram to help decide on the parser to use based on your workflow and unique parsing needs. In addition to this, we also explore NDJSON and libraries to parse NDJSON payloads. Let’s get started.Stdlib JSON supports serialization for all basic Python data types, including dicts, lists, and tuples. When the function json.loads() is called, it loads the entire JSON into memory at once. This is fine for smaller payloads, but for larger payloads, json.loads() can cause critical performance issues such as out-of-memory errors and choking of downstream workflows. with open(“large_payload.json”, “r”) as f: json_data = json.loads(f) # loads entire file into memory, all tokens at onceFor payloads that are in the order of hundreds of MBs, it is advisable to use ijson. ijson, short for ‘iterative json’, reads files one token at a time without the memory overhead. In the code below, we compare json and ijson. import ijson with open("json_data.json", "r") as f: for record in ijson.items(f, "items.item"): # fetch one dict from the array process(record) #The ijson library reads records one token at a timeAs you can see, ijson fetches one element at a time from the JSON and loads it into a Python dict object. This is then fed to the calling function, in this case, the process(record) function. The overall working of ijson has been provided in the illustration below.Ujson has been a widely used library in many applications involving large JSON payloads, as it was designed to be a faster alternative to the stdlib JSON in Python. The speed of parsing is great since the underlying code of ujson has been written in C, with Python bindings that connect to the Python interface. The areas that needed improvement in the standard JSON library were optimized in Ujson for speed and performance. But, Ujson is no longer used in newer projects, as the makers themselves have mentioned on PyPI that the library has been placed in maintenance-only mode. Below is an illustration of ujson’s processes at a high-level. import ujson taxonomy_data = ‘{“id”:1, “genus”:“Thylacinus”, “species”:“cynocephalus”, “extinct”: true}’ data_dict = ujson.loads(taxonomy_data) #Deserializewith open(“taxonomy_data.json”, “w”) as fh: #Serialize ujson.dump(data_dict, fh) with open(“taxonomy_data.json”, “r”) as fh: #Deserialize data = ujson.load(fh)We move to the next potential library named ‘orjson’.Since Orjson is written in Rust, it is optimized not only for speed but also has memory-safe mechanisms to prevent buffer overflows that developers face while using C-based JSON libraries like ujson. Moreover, Orjson supports serialization of several additional datatypes beyond the standard Python datatypes, including dataclass and datetime objects. Another key difference between orjson and the other libraries is that orjson’s dumps() function returns a bytes object, whereas the others return a string. Returning the data as a bytes object is one of the main reasons for orjson’s fast throughput. import orjson book_payload = '{"id":1,"name":"The Great Gatsby","author":"F. Scott Fitzgerald","Publishing House":"Charles Scribner\'s Sons"}' data_dict = orjson.loads(book_payload) #Deserializewith open("book_data.json", "wb") as f: # Serialize f.write(orjson.dumps(data_dict)) # Returns bytes objectwith open("book_data.json", "rb") as f:# Deserialize book_data = orjson.loads(f.read())Now that we’ve explored some JSON parsing libraries, let’s test their serialization capabilities. Testing Serialization Capabilities of JSON, ujson and orjson We create a sample dataclass object with an integer, string and a datetime variable. from dataclasses import dataclass from datetime import datetime@dataclass class User: name: stru = User(id=1, name=“Thomas”, created=datetime.now()) We then pass it to each of the libraries to see what happens. We begin with the stdlib JSON.import json try: print(“json:”, json.dumps(u)) except TypeError as e:As expected, we get the following error. (The standard JSON library doesn’t support serialization of “dataclass” objects and datetime objects.)Next, we test the same with the ujson library. import ujson print(“json:”, ujson.dumps(u)) print(“json error:”, e)As we see above, ujson is not able to serialize the data class object and the datetime datatype. Lastly, we use the orjson library for serialization. import orjson print(“orjson:”, orjson.dumps(u)) print(“orjson error:”, e)We see that orjson was able to serialize both the dataclass and the datetime datatypes. Working with NDJSON (A special Mention) We’ve seen the libraries for JSON parsing, but what about NDJSON? NDJSON (Newline Delimited JSON), as you might know, is a format in which each line is a JSON object. In other words, the delimiter is not a comma but a newline character. As an example, this is what NDJSON looks like. {“id”: “A13434”, “name”: “Ella”} {“id”: “A13455”, “name”: “Charmont”} {“id”: “B32434”, “name”: “Areida”}NDJSON is mostly used for logs and streaming data, and hence, NDJSON payloads are excellent candidates for being parsed using the ijson library. For small to moderate NDJSON payloads, it is recommended to use the stdlib JSON. Other than ijson and stdlib JSON, there’s a dedicated NDJSON library. Below are code snippets showing each approach. NDJSON using stdlib JSON & ijson As NDJSON is not delimited by commas, it doesn’t qualify for a bulk load, because stdlib json expects to see a list of dicts. In other words, stdlib JSON’s parser looks for a single valid JSON element, but is instead given several JSON elements in the payload file. Therefore, the file has to be parsed iteratively, line by line, and sent to the caller function for further processing. import json ndjson_payload = """{"id": "A13434", "name": "Ella"} {"id": "A13455", "name": "Charmont"} {"id": "B32434", "name": "Areida"}"""with open("json_lib.ndjson", "w", encoding="utf-8") as fh: for line in ndjson_payload.splitlines(): # Split string into JSON obj fh.write(line.strip() + "\n") # Write each JSON object as its linewith open("json_lib.ndjson", "r", encoding="utf-8") as fh: for line in fh: if line.strip(): #Remove new lines item= json.loads(line) #Deserialize print(item) #or send it to the caller functionWith ijson, the parsing is done as shown below. With standard JSON, we have just one root element, which is either a dictionary if it’s a single JSON or an array if it is a list of dicts. But with NDJSON, each line is its own root element. The argument “” in ijson.items() tells the ijson parser to look at each root element. The arguments “” and multiple_values=True let the ijson parser know that there are multiple JSON root elements in the file, and to fetch one line (each JSON) at a time. import ijson ndjson_payload = “”“{“id”: “A13434”, “name”: “Ella”} {“id”: “A13455”, “name”: “Charmont”} {“id”: “B32434”, “name”: “Areida”}“”“with open(“ijson_lib.ndjson”, “w”, encoding=“utf-8”) as fh: fh.write(ndjson_payload) # Writing payload to be processedwith open(“ijson_lib.ndjson”, “r”, encoding=“utf-8”) as fh: for item in ijson.items(fh, “”, multiple_values=True): Lastly, we have the dedicated library NDJSON. It basically converts the NDJSON format to standard JSON. import ndjson ndjson_payload = “”“{“id”: “A13434”, “name”: “Ella”} {“id”: “A13455”, “name”: “Charmont”} {“id”: “B32434”, “name”: “Areida”}“”“with open(“ndjson_lib.ndjson”, “w”, encoding=“utf-8”) as fh: fh.write(ndjson_payload)with open(“ndjson_lib.ndjson”, “r”, encoding=“utf-8”) as fh: ndjson_data = ndjson.load(fh) #returns a list of dictsAs you have seen, NDJSON file formats can usually be parsed using stdlib json and ijson. For very large payloads, ijson is the best choice as it is memory-efficient. But if you are looking to generate NDJSON payloads from other Python objects, the NDJSON library is the ideal choice. This is because the function ndjson.dumps() automatically converts python objects to NDJSON format without having to iterate over these data structures.Now that we’ve explored NDJSON, let’s pivot back to benchmarking the libraries stdlib json, ujson, and orjson. The reason IJSON is not considered for Benchmarking ‘ijson’ being a streaming parser is very different from the bulk parsers that we looked at. If we benchmarked ijson along with these bulk parsers, we would be comparing apples to oranges. Even if we benchmarked ijson alongside the other parsers, we would get the false impression that ijson is the slowest, when in fact it serves a different purpose altogether. ijson is optimized for memory efficiency and therefore has lower throughput than bulk parsers. Generating a Synthetic JSON Payload for Benchmarking Purposes We generate a large synthetic JSON payload having 1 million records, using the library ‘mimesis’. This data will be used to benchmark the libraries. The below code can be used to create the payload for this benchmarking, if you wish to replicate this. The generated file would be between 100 MB and 150 MB in size, which I believe, is large enough to conduct tests on benchmarking. from mimesis import Person, Address person_name = Person("en") complete_address = Address("en")with open("large_payload.json", "w") as fh: #Streaming to a file fh.write("[") #JSON array for i in range(1_000_000): payload = { "id": person_name.identifier(), "name": person_name.full_name(), "email": person_name.email(), "address": { "street": complete_address.street_name(), "city": complete_address.city(), "postal_code": complete_address.postal_code() } json.dump(payload, fh) if i < 999_999: #To prevent a comma at the last entry fh.write(",") fh.write("]") #end JSON arrayBelow is a sample of what the generated data would look like. As you can see, the address fields are nested to ensure that the JSON is not just large in size but also represents real-world hierarchical JSONs. [ “id”: “8177”, “email”: “showers1819@yandex.com”, “address”: { “street”: “Emerald Cove”, “city”: “Crown Point”, } { “name”: “Quinn Greer”,professional2038@outlook.com“, “address”: { “city”: “Bridgeport”, } ]Let’s start with benchmarking.**Benchmarking Pre-requisites **We use the read() function to store the JSON file as a string. We then use the loads() function in each of the libraries (json, ujson, and orjson) to deserialize the JSON string into a Python object. First of all, we create the payload_str object from the raw JSON text. with open("large_payload1.json", "r") as fh: payload_str = fh.read() #raw JSON text We then create a benchmarking function with two arguments. The first argument is the function that is being tested. In this case, it is the loads() function. The second argument is the payload_str constructed from the file above. def benchmark_load(func, payload_str): start = time.perf_counter() for _ in range(3): end = time.perf_counter() We use the above function to test for both serialization and deserialization speeds. Benchmarking Deserialization Speed We load the three libraries being tested. We then run the function benchmark_load() against the loads() function of each of these libraries. import json, ujson, orjson, timeresults = { "json.loads": benchmark_load(json.loads, payload_str), "ujson.loads": benchmark_load(ujson.loads, payload_str), "orjson.loads": benchmark_load(orjson.loads, payload_str),for lib, t in results.items(): print(f"{lib}: {t:.4f} seconds")As we can see, orjson has taken the least amount of time for deserialization. Benchmarking Serialization Speed Next, we test the serialization speed of these libraries. import json import orjsonresults = { “json.dumps”: benchmark(“json”, json.dumps, payload_str), “ujson.dumps”: benchmark(“ujson”, ujson.dumps, payload_str), “orjson.dumps”: benchmark(“orjson”, orjson.dumps, payload_str),for lib, t in results.items(): print(f“{lib}: {t:.4f} seconds“)On comparing run times, we see that orjson takes the least amount of time to serialize Python objects to a JSON object. Choosing the Best JSON library for your Workflow Clipboard & Workflow Hacks for JSON Let’s suppose that you’d like to view your JSON in a text editor such as Notepad++ or share a snippet (from a large payload) on Slack with a teammate. You’ll quickly run into clipboard or text editor/IDE crashes. In such situations, one could use Pyperclip or Tkinter. Pyperclip works well for payloads within 50 MB, whereas Tkinter works well for medium-sized payloads. For large payloads, you could write the JSON to a file to view the data.JSON can seem effortless, but the larger the payload and the more nesting, the more these payloads can quickly turn into a performance bottleneck. This article aimed to highlight how each Python parsing library addresses this challenge. While selecting JSON parsing libraries, speed and throughput are not always the main criteria. It is the workflow that determines whether throughput, memory efficiency, or long-term scalability is needed for parsing payloads. In short, JSON parsing shouldn’t be a one-size-fits-all approach.

Similar Posts