You scraped a bunch of data from a cryptocurrency exchange API into JSON but you figured that it’s taking too much disk space ? Switching to HDF5 will save you some space and make the access very fast, as it’s optimized for I/O operations. The HDF5 format is supported by major tools like Pandas, Numpy and Keras, data integration will be smooth, if you want to do some analysis.
Flattening the JSON
Most of the time JSON data is a giant dictionary with a lot of nested levels, the issue is that HDF5 doesn’t understand that. If we take the below JSON:
json_dict = {'Name':'John', 'Location':{'City':'Los Angeles','State':'CA'}, 'hobbies':['Music', 'Running']}
The result will look like this in a DataFrame:
We need to flatten the JSON to make it look like a classic table:
We’re going to use the flatten_json() function (more info here):
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
Loading into a HDF5 file
Now the idea is to load the flattened JSON dictionary into a DataFrame that we’re going to save in a HDF5 file.
I’m assuming that during scraping we appended each record to the JSON, so we have one dictionary per line:
def json_to_hdf(input_file, output_file):
with pd.HDFStore(output_file) as store:
with open(input_file, "r") as json_file:
for i, line in enumerate(json_file):
try:
flat_data = flatten_json(ujson.loads(line))
df = pd.DataFrame.from_dict([flat_data])
store.append('observations', df)
except:
pass
Let’s break this down.
Line 3: we initialize the HDFStore, this is the HDF5 file, it’s handling the file writing and everything.
Lines 4 & 5: we open the file and read it line per line
Line 7: we transform the line into a JSON dictionary and then we flatten it
Line 8: we transform the flatten dictionary into a Pandas DataFrame
Line 9: we append this DataFrame into the HDFStore
Et voilà, you now have your data in a single HDF5 file, ready to be loaded for your statistical analysis or maybe to generate trading signals, remember, it’s optimized for Pandas and Numpy so it’ll be faster than reading from the original JSON file.