Design of file formats (incomplete)
I worked with file formats for a while during my career. I also saw a lot of them, produced by others. It is shocking how many bad decisions people do when designing file formats, in particular when it comes to a fundamental point
Whatever you write on the customers’ disk, is going to stay, and you will have to support, forever.
And with this consideration in mind, it’s incredible how poor thinking is done on file formats, both for input and output. Here are what I believe to be typical design mistakes of a file format.
- Putting the creation time and date. This metainformation may be somehow
useful, but it's already supported by the filesystem. Unless you need to
keep history information, it makes no sense. In addition, it makes the task of
testing the output for correctness much harder, because in some cases, a quick
and dirty md5 does the job. With a "creation date" inside the file, the md5
changes at every run, unless you can override it.
- Putting a "creator" information. in some circumstances this is rather
irrelevant. Say you have a program A creating the file. The file now
contains "A" as creator. Now you open the file with program B, modify the file
considerably, and save it. Who is now the creator ? well, technically A, but
this information is largely irrelevant. A didn't provide all the information
stored into the file.
- No version. This is so obvious that when I see no version information
into a file format I start to wonder if the one who deployed the file
actually knew how to code.
- Version as a floating point number. Ok, you decide your file version is
"1.1", but that does not mean it must be a floating point number. You are
very likely to extract that value from the file and then check it for equality
in order to do things. Comparing floats for equality is a no-no, and in any
case it does not make sense. A floating point number is used to perform math
operations. A version number is an identification token. Do you really plan to
do the square root of your file version number anytime soon ?
- Reinventing the wheel, squared. This is incredible. One of the big
problems of data is how to lay them out on the disk. A file format will
most likely contain different kind of data, and may need to be portable across
different platforms. You may also need some kind of check in order to see if
the file is somehow corrupted at the basic layer of storage, before opening it.
Some output files today are zip files, masked as something else. ODP is a zip
file, and so is JAR. If you store computational data, HDF and NetCDF are good
choices. This is very convenient, because it solves you the problem of disk
representation, leaving you only the problem of semantic representation of your
data.
- Lack of uniformity, lack of official specification.
- Lack of an IO layer.