How to convert a JSON into a HDF5 file

You scraped a bunch of data from a cryptocurrency exchange API into JSON but you figured that it’s taking too much disk space ? Switching to HDF5 will save you some space and make the access very fast, as it’s optimized for I/O operations. The HDF5 format is supported by major tools like Pandas, Numpy and Keras, data integration will be smooth, if you want to do some analysis.

Flattening the JSON

Most of the time JSON data is a giant dictionary with a lot of nested levels, the issue is that HDF5 doesn’t understand that. If we take the below JSON:

json_dict = {'Name':'John', 'Location':{'City':'Los Angeles','State':'CA'}, 'hobbies':['Music', 'Running']}

The result will look like this in a DataFrame:

Nested DataFrame
Nested DataFrame

We need to flatten the JSON to make it look like a classic table:

Flatten DataFrame
Flatten DataFrame

We’re going to use the flatten_json() function (more info here):

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

Loading into a HDF5 file

Now the idea is to load the flattened JSON dictionary into a DataFrame that we’re going to save in a HDF5 file.

I’m assuming that during scraping we appended each record to the JSON, so we have one dictionary per line:

def json_to_hdf(input_file, output_file):
    
    with pd.HDFStore(output_file) as store:
        with open(input_file, "r") as json_file:
            for i, line in enumerate(json_file):
                try:
                    flat_data = flatten_json(ujson.loads(line))
                    df = pd.DataFrame.from_dict([flat_data])
                    store.append('observations', df)
                except:
                    pass

Let’s break this down.

Line 3: we initialize the HDFStore, this is the HDF5 file, it’s handling the file writing and everything.

Lines 4 & 5: we open the file and read it line per line

Line 7: we transform the line into a JSON dictionary and then we flatten it

Line 8: we transform the flatten dictionary into a Pandas DataFrame

Line 9: we append this DataFrame into the HDFStore

Et voilà, you now have your data in a single HDF5 file, ready to be loaded for your statistical analysis or maybe to generate trading signals, remember, it’s optimized for Pandas and Numpy so it’ll be faster than reading from the original JSON file.