My pager went off Saturday morning with the following alert: "/dev/hda6 is read-only!"
I can't remember the last time I received this alert. The filesystem in question is the home of this web site, as well as a few others. Being read-only isn't a major problem since most of the data on the filesystem only needs to be read (images, static html, etc.). That is to say, anything of interest runs out of a database on another filesystem.
Since Saturday is generally a slow day for the site, it seemed like a great time to figure out what the problem was and get it fixed without having to wait for a late night maintenance. The first thing I did was check /var/log/messages
to see if any error messages were logged. Here's what I found:
kernel: EXT3-fs error (device hda6): ext3_free_blocks: Freeing blocks in system zones - Block = 65536, count = 1
kernel: Aborting journal on device hda6.
kernel: ext3_abort called.
kernel: EXT3-fs error (device hda6): ext3_journal_start_sb: Detected aborted journal
kernel: Remounting filesystem read-only
kernel: EXT3-fs error (device hda6) in ext3_free_blocks_sb: Journal has aborted
kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted
kernel: EXT3-fs error (device hda6) in ext3_truncate: Journal has aborted
kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted
kernel: EXT3-fs error (device hda6) in ext3_orphan_del: Journal has aborted
kernel: EXT3-fs error (device hda6) in ext3_reserve_inode_write: Journal has aborted
kernel: __journal_remove_journal_head: freeing b_committed_data
kernel: __journal_remove_journal_head: freeing b_committed_data
In a nutshell, the kernel had encountered an error with the filesystem which had resulted in an aborted journal. The kernel then remounted the filesystem read-only.
First, I shutdown anything that was using the read-only filesystem. Since this was mounted on /home
, I used the lsof
command to find anything using the filesystem.
# lsof | grep home
This showed me that Apache and one user were using the affected filesystem. I logged out the user first, and then shutdown Apache. That resulted in the site downtime mentioned in the title. I then unmounted the filesystem and ran fsck
on it.
# umount /home
# fsck /dev/hda6
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/home: recovering journal
/home contains a file system with errors, check forced.
<snip for brevity>
/home: ***** FILE SYSTEM WAS MODIFIED *****
/home: 158016/4788672 files (12.6% non-contiguous), 3631550/4785354 blocks
#
# mount /home
After remounting the filesystem, I checked /var/log/messages
again to make sure everything was fine. The kernel had mounted the filesystem properly and reported no errors. I then restarted Apache.
Total downtime for the few sites running on this server was around 10 minutes.
Photo by channah.