I occasionally see references to the HDF5 file format, but I have never encountered it in the wild.
Traditionally, I have mostly stuck to flat files as data format. My data sets are typically not excessively large (less than 10GB, usually much less). Flat files work well with Unix tools, are easy to parse, and easy to share. (They are also what is expected by gnuplot which is why I started using them in the first place.)
But a recent project generated multiple data sets simultaneously, in addition to metadata. Was there a better way than maintaining a collection of flat files (or, worse, having multiple data sets in a single text file that is no longer so “flat”)?
This prompted me to take a look at HDF5. This post is mostly a collection and summary of various third-party resources that I found helpful when researching and evaluating HDF5.
What is HDF5?
HDF5 is a file format that allows to collect multiple data sets in a single file. Within that file, data sets can be arranged into a hierarchy of “groups”, which are comparable to directories in a filesystem. In many ways, HDF5 is “a filesystem in a file”.
The data format is binary and quite complex, and therefore requires special tools to read and write, or to even view, a data set. HDF5 supports compression of the contained data sets. It does not appear to promote any particular semantic data model or workflow.
Development of the HDF format began in the late 80s, as an effort to support scientific computation. The spec and implementation continue to be supported and developed, but HDF5 does not seem to have garnered broad adoption (some, possibly large and important, niche applications not withstanding).
HDF5: Yes or No?
Trying to read up on it, I fairly quickly decided not to pursue
HDF5 any further, mostly because I don’t see a compelling advantage
of using it, compared to relying on the filesystem itself. I don’t
see much (if anything) that I can’t do already using only the filesystem,
possibly packed into a
tar archive, if I want to bundle a directory
hierarchy into a single file. (Although,
to my surprise, mounting a packed
tar archive of a filesystem is
apparently less common and seamless than I would have expected.)
The ease of having all data sets in a single file is counterbalanced by a number of objections:
The most serious one is the inability to use generic tools, even for basic operations. This is a major inconvenience. One can adapt to such workflows, but the advantages have to be sufficiently massive to make it worthwhile. Relational databases and image file formats are a case in point. With HDF5, I don’t see comparable upsides.
I am very concerned about a complicated, opaque data format that at the same time seems to exist only in a relatively small niche or subculture. Experience shows that, no matter what marginal benefit a specialized tool has originally, over time the cumulative effort applied to mainstream solutions will win out. In the present case, filesystems benefit from decades of effort, across hundreds of developers and millions of users, to improve their reliability, efficiency, and, last not least, their defect rate. HDF5 simply can’t compete.
There are reports that HDF5 is susceptible to catastrophic corruption, rendering the entire file unreadable. For a storage technology, that is almost the worst thing that can be said about it. When working with data it is often (not always!) acceptable to lose some of the records to corruption, but losing an entire data set (or, as with HDF5, a collection of sets) is intolerable.
The community response is also telling. Reading experience reports (see the list of links at the bottom), I have found no outright fans, one very detailed and thoughtful critic, and a handful of rather guarded and circumspect supporters. The set of users also seems to be comparatively small, overall.
My overall impression is that HDF5 is a solution whose time (for the general user) has never come. I guess it addressed problems that were real or anticipated at the time it was developed, but that have simply dissolved by now. At the same time, it fails to take advantage of techniques that have come into existence in the meantime: metadata, for example, today is likely to be JSON or YAML, which HDF5 of course knows nothing about. Not to mention cloud storage.
Unfortunately, this brings me back to Square One, when it comes to ways for organizing data and result sets. What I am talking about here is the small, individual work situation, which, I am sure, still accounts for the vast majority of data projects, with small data sets (less than, say, 10GB, often much less), and at most a handful of machines. Not “Big Data”, no data engineering, no data pipelines. Are there interesting, non-trivial “best practices” for this, beyond plain common sense?
My primary concern is one of organization: keeping related data sets together. In fact, I have gotten to the point where I like to have (say) a simulation write the simulation parameters to the same file as the results, in order to ensure that data and metadata in fact belong together.
Another is denormalization: sometimes a parameter value is the same for all data points in a set (or simulation run), but will of course vary across runs. Does one repeatedly store the constant value for each record, or treat it as “metadata” (which then needs to be married to the actual payload at a later time)?
A last thought is to employ transactional integrity, which in practice means SQLite. I have never done this for predominantly numerical data sets, but have begun to use SQLite as storage format even for command-line and workflow tools: knowing that the data will never be corrupted, even if the writing process crashes or is interrupted, is often worth the hassle of having the data in an opaque format. It also makes it possible to write to the same data store from multiple processes concurrently, saving the need to merge separate results files at a later point.
A Final Thought
One of the posts below, which advocates HDF5 because it is comparatively “the smallest evil”, goes on to say:
“The only alternative would be to roll my own system, which isn’t a pleasant idea either.”
True. But this thinking can also be too limiting. I have written two books in DocBook XML (!), because we wanted to use an “accepted standard”. If we had been less narrow-minded, it could have been us to have invented Markdown. (It was about that time.)
Tools, libraries, programming environments, and experience with the design of computer systems have advanced to such a point that the barriers to creating significant, and significantly better, systems have become very low indeed.
The HDF Group is the non-profit organization that maintains the HDF5 standard and reference implementation.
Cyrille Rossant has written an uncommonly penetrating and thoughtful essay and a follow-up on his experiences using HDF5. After investing heavily into it, he and his group decided to abandon HDF5, because the benefits did not outweigh the disadvantages. This is extremely valuable reading, although some of the issues he raises have been remedied by now (for example, there now is a public github repo for HDF5).
This post is quite typical for the circumspect and conditioned support that HDF5 receives, even from people who claim to like it.
O’Reilly has published a book on using HDF5 with Python.
Finally, here are some considerations on using HDF5 in the cloud. I am somewhat concerned that it is several years old. What has happened since?