Design of file formats

I worked with file formats for a while during my career. I also saw a lot of them, produced by others. It is shocking how many bad decisions people do when designing file formats, in particular when it comes to a fundamental point

Whatever you write on the customers’ disk, is going to stay, and you will have to support, forever.

And with this consideration in mind, it’s incredible how poor thinking is done on file formats, both for input and output. Here are what I believe to be typical design mistakes of a file format.

- Putting the creation time and date. This metainformation may be somehow
  useful, but it's already supported by the filesystem. Unless you need to
  keep history information, it makes no sense. In addition, it makes the task of
  testing the output for correctness much harder, because in some cases, a quick
  and dirty md5 does the job. With a "creation date" inside the file, the md5
  changes at every run, unless you can override it.

- Putting a "creator" information. in some circumstances this is rather
  irrelevant. Say you have a program A creating the file. The file now
  contains "A" as creator. Now you open the file with program B, modify the file
  considerably, and save it. Who is now the creator ? well, technically A, but
  this information is largely irrelevant. A didn't provide all the information
  stored into the file.
- No version. This is so obvious that when I see no version information
  into a file format I start to wonder if the one who deployed the file
  actually knew how to code.
- Version as a floating point number. Ok, you decide your file version is
  "1.1", but that does not mean it must be a floating point number. You are
  very likely to extract that value from the file and then check it for equality
  in order to do things. Comparing floats for equality is a no-no, and in any
  case it does not make sense. A floating point number is used to perform math
  operations. A version number is an identification token. Do you really plan to
  do the square root of your file version number anytime soon ?
- Reinventing the wheel, squared. This is incredible. One of the big
  problems of data is how to lay them out on the disk. A file format will
  most likely contain different kind of data, and may need to be portable across
  different platforms. You may also need some kind of check in order to see if
  the file is somehow corrupted at the basic layer of storage, before opening it.
  Some output files today are zip files, masked as something else. ODP is a zip
  file, and so is JAR. If you store computational data, HDF and NetCDF are good
  choices. This is very convenient, because it solves you the problem of disk
  representation, leaving you only the problem of semantic representation of your
  data.
- Lack of uniformity, lack of official specification.
- Lack of an IO layer.