My pager went off Saturday morning with the following alert: "/dev/hda6 is read-only!"
I can't remember the last time I received this alert. The filesystem in question is the home of this web site, as well as a few others. Being read-only isn't a major problem since most of the data on the filesystem only needs to be read (images, static html, etc.). That is to say, anything of interest runs out of a database on another filesystem.
Since Saturday is generally a slow day for the site, it seemed like a great time to figure out what the problem was and get it fixed without having to wait for a late night maintenance. The first thing I did was check
/var/log/messages to see if any error messages were logged. Here's what I found:
kernel: EXT3-fs error (device hda6): ext3_free_blocks: Freeing blocks in system zones - Block = 65536, count = 1 kernel: Aborting journal on device hda6. kernel: ext3_abort called. kernel: EXT3-fs error (device hda6): ext3_journal_start_sb: Detected aborted journal kernel: Remounting filesystem read-only kernel: EXT3-fs error (device hda6) in ext3_free_blocks_sb: Journal has aborted kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted kernel: EXT3-fs error (device hda6) in ext3_truncate: Journal has aborted kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted kernel: EXT3-fs error (device hda6) in ext3_orphan_del: Journal has aborted kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted kernel: __journal_remove_journal_head: freeing b_committed_data kernel: __journal_remove_journal_head: freeing b_committed_data
In a nutshell, the kernel had encountered an error with the filesystem which had resulted in an aborted journal. The kernel then remounted the filesystem read-only.
First, I shutdown anything that was using the read-only filesystem. Since this was mounted on
/home, I used the
lsof command to find anything using the filesystem.
# lsof | grep home
This showed me that Apache and one user were using the affected filesystem. I logged out the user first, and then shutdown Apache. That resulted in the site downtime mentioned in the title. I then unmounted the filesystem and ran
fsck on it.
# umount /home # fsck /dev/hda6 fsck 1.38 (30-Jun-2005) e2fsck 1.38 (30-Jun-2005) /home: recovering journal /home contains a file system with errors, check forced. <snip for brevity> /home: ***** FILE SYSTEM WAS MODIFIED ***** /home: 158016/4788672 files (12.6% non-contiguous), 3631550/4785354 blocks # # mount /home
After remounting the filesystem, I checked
/var/log/messages again to make sure everything was fine. The kernel had mounted the filesystem properly and reported no errors. I then restarted Apache.
Total downtime for the few sites running on this server was around 10 minutes.
Photo by channah.