There are generally two kinds of people—those who’ve suffered a severe data loss and those who are about to suffer a severe data loss. I repeatedly jump back and forth between the two kinds.
Recently, a combination of hardware defects and a series of power outages rendered the raidz pool of the NAS of my previous research group unreadable. The OS, an old Solaris 10 x86, would not import the pool with a dreaded I/O error message. We tried importing in various modern OpenSolaris-based live distributions, even forcing the kernel to try and fix errors when possible, to no success. Perhaps disabling the ZIL (because of performance problems with NFS clients) wasn’t that good idea after all. The lack of resources for proper preventive maintenance meant that there were no real backups to restore from. Gone were a lot of research data, source codes, PhD theses, mails, and web content. In the face of the growing despair, as it all happened in the middle of several ongoing project calls, and the rapidly approaching need to accept that the data is most likely gone for good and one has to start anew, I got curious—what could really break in the “unbreakable” ZFS? Previous to that moment, ZFS was to me just a magical filesystem that can do all those things such as cheaply creating multiple filesets and instantaneous snapshots, and I never had real interest in learning how is this all implemented. This time my curiosity won and I asked the sysadmin to wait a while before wiping the disks and let me first poke around the filesystem and see if I could make it readable again. What started as a set of Python scripts to read and display on-disk data structures quickly grew into a very functional minimalistic ZFS implementation capable of reading and exporting entire datasets.