Recovering datasets from broken ZFS raidz pools


There are generally two kinds of people—those who’ve suffered a severe data loss and those who are about to suffer a severe data loss. I repeatedly jump back and forth between the two kinds.

Recently, a combination of hardware defects and a series of power outages rendered the raidz pool of the NAS of my previous research group unreadable. The OS, an old Solaris 10 x86, would not import the pool with a dreaded I/O error message. We tried importing in various modern OpenSolaris-based live distributions, even forcing the kernel to try and fix errors when possible, to no success. Perhaps disabling the ZIL (because of performance problems with NFS clients) wasn’t that good idea after all. The lack of resources for proper preventive maintenance meant that there were no real backups to restore from. Gone were a lot of research data, source codes, PhD theses, mails, and web content. In the face of the growing despair, as it all happened in the middle of several ongoing project calls, and the rapidly approaching need to accept that the data is most likely gone for good and one has to start anew, I got curious—what could really break in the “unbreakable” ZFS? Previous to that moment, ZFS was to me just a magical filesystem that can do all those things such as cheaply creating multiple filesets and instantaneous snapshots, and I never had real interest in learning how is this all implemented. This time my curiosity won and I asked the sysadmin to wait a while before wiping the disks and let me first poke around the filesystem and see if I could make it readable again. What started as a set of Python scripts to read and display on-disk data structures quickly grew into a very functional minimalistic ZFS implementation capable of reading and exporting entire datasets.

Read more…

How to really screw up your mailing list migration (in ten easy steps)

(based on a true story)

Step 0. Find an existing low-volume announcements-only mailing list, e.g. that of a certain scientific software.

Step 1. Migrate the users to a new mailing list meant for both announcements and general discussions.

Step 2. Make the new mailing list unmoderated and set the Reply-To address to be the list submission address.

Step 3. Post the announcement about the change on the list itself.

Step 4. Realise that many users have completely forgotten that they were on the former list (because it was a really low-volume one).

Step 5. Watch as angry unsubscribe messages (mostly from people who have already forgotten about the software) start hitting the list and result in a quickly growing avalanche of angry unsubscribe requests.

Step 6. Realise that step 2 was a really really really dumb one.

Step 7. Realise that step 1 was probably dumb too.

Step 8. Unsubscribe everyone from the new mailing list.

Step 9. Spend some time cleaning up your inbox.


Optional (recommended): Let an experienced list administrator show you how to do it properly with two separate lists. It’s not like your server can’t handle two instead of one, is it?

Never let scientists do the work of the system administrators!


Реформите, особено бюрократичните, не са нещо, което се случва за една нощ. След има-няма 500 дена неактивност, се реформира в напълно статична версия, генерирана с Nikola. Повечето стари URL-и са все още валидни и пренасочват към новите местоположения на писанията, но RSS-ът се мести на PixelPost галерията се трансфомира напълно в нещо не толкова красиво (все още), но концептуално малко по-различно.

Бидейки написан на Python, Nikola предпочита маркъп езика на DocutilsreStructuredText, макар голяма част от функционалността да може да се ползва и от Markdown. “Превеждането” от Textile (маркъп езика на Textpattern) в reStructuredText не беше особено тривиално, като Pandoc помогна много.

Основното предимство на Nikola, както и на повечето подобни генератори (или по-скоро компилатори) на статично съдържание, е, че цялото текстово съдържание идва от обикновени текстови файлове, които могат да се създават с произволен редактор и да се държат (заедно с всички останали принадлежащи файлове) в система за контрол на версиите, напр. Git. Няма web интерфейс за управление, няма нужда от бази данни и няма нито един сървърен скрипт, което влияее благотворно върху сигурността на сайта. Изходният код на всяка страница (с изключение на автоматично генерираните индекси) е достъпна посредством връзката Source в горния десен ъгъл на заглавната лента.

Сега остава и да започна да пиша отново.

Recipe: Obtaining peak VM size in pure Fortran

Often in High Performance Computing one needs to know about the various memory metrics of a given program with the peak memory usage probably being the most important one. While the getrusage(2) syscall provides some of that information, it’s use in Fortran programs is far from optimal and there are lots of metrics that are not exposed by it.

On Linux one could simply parse the /proc/PID/status file. Being a simple text file it could easily be processed entirely with the built-in Fortran machinery as shown in the following recipe:

vmpeak.f90 (Source)

program test
  integer :: vmpeak

  call get_vmpeak(vmpeak)
  print *, 'Peak VM size: ', vmpeak, ' kB'
end program test

! Returns current process' peak virtual memory size             !
! Requires Linux procfs mounted at /proc                        !
! Output: peak - peak VM size in kB                             !
subroutine get_vmpeak(peak)
  implicit none
  integer, intent(out) :: peak
  character(len=80) :: stat_key, stat_value
  peak = 0
  open(unit=1000, name='/proc/self/status', status='old', err=99)
  do while (.true.)
    read(unit=1000, fmt=*, err=88) stat_key, stat_value
    if (stat_key == 'VmPeak:') then
      read(unit=stat_value, fmt='(I)') peak
    end if
  end do
88 close(unit=1000)
  if (peak == 0) goto 99
99 print *, 'ERROR: procfs not mounted or not compatible'
  peak = -1
end subroutine get_vmpeak

The code accesses the status file of the calling process /proc/self/status. The unit number is hard-coded which could present problems in some cases. Modern Fortran 2008 compilers support the NEWUNIT specifier and the following code could be used instead:

integer :: unitno

open(newunit=unitno, name='/proc/self/status', status='old', err=99)
! ...

With older compilers the same functionality could be simulated using the following code.

MPI programming basics

Embracing the current development in educational technologies, the IT Center of the RWTH Aachen University (former Center for Computing and Communication) makes available online the audio recordings of most tutorials delivered during this year’s PPCES seminar. Participation in PPCES is for free and course materials are available online, but this is the first time when proper audio recordings were taken.

All videos (presentation slides + audio) are available on the PPCES YouTube channel under Creative Commons Attribution license. Course materials are available in the PPCES 2014 archive under unclear (read: do not steal blatantly) license.

My own contribution to PPCES - as usual - consists of:

  • Message passing with MPI, part 1: Basic concepts and point-to-point communication

  • Message passing with MPI, part 2: Collective operations and often-used patterns

  • Tracing and profiling MPI applications with VampirTrace and Vampir

Big thanks to all the people who made recording and publishing the sessions possible.