Ring Ring... Hello. Is our mail server not working?
I normally only get phone calls on my work mobile when something is badly wrong. Its never good news. And when its the mail server i just know its gonna be a long day.
Despite what some people think about email and its importance, once companies take things for granted all hell can break loose when those same things are taken away from them. We all rely on email. Period.
OK, take it easy. Reboot the mail server. It will be back up and running whilst i drive into the office.
3 phone calls along the M3 later and I realised today was not going to be a very productive day coding. I was gonna spend the day with Exchange.
The last sentence I wrote may sound very calm yet when this happens on one of the Exchange servers you are responsible for you will be anything but calm. Once the stores in the storage group are dismounted, users are disconnected from their precious information (mail, calendars, contacts) and they will come waving pitchforks.
I find Exchange server to be a very complicated system and as in most complicated systems the most "trivial things" may bring it to its knees. And today when I discovered the eventual reason that our Exchange database was corrupt and wouldnt mount I nearly cried out with pain & laughter.
A very oversimplified analysis of an Exchange server may state that that an Exchange server is nothing more then a database server that has some exotic extensions through which users can manipulate their data. This analysis (even though oversimplified) is not far from truth, and it emphasizes the importance of the database that stores the user's information on an Exchange server.
Exchange server uses a database technology called ESE (Extensible Storage Engine), this database technology is based on the JET (Joint Engine Technology) database engine.
The ESE engine employs several files upon which the database is built (I have only specified the ones that are relevant to our topic):
- Store files- The store files hold the information that is already committed to disk. Each Exchange store (Private and Public) consists of two files:
- EDB- Rich-text database stores information in a proprietary format called Microsoft Database Encapsulated Format (MDBEF) that is submitted by MAPI clients.
- STM- Native Content Database holds all data that is submitted by non-MAPI clients.
- Transaction Log files- the log file stores altered data before it is committed to the database. A set of log files is unique to a storage group. The log files name begins with a prefix that identifies the storage group they belong to- E00XXXXX.log (belongs to the first storage group). The suffix for each log files name is a hexadecimal sequentially assigned number.
The active log file for a specific storage group is always called: EX0.log (X represents the storage group), once it is filled (5MB) it is renamed using the next sequential hexadecimal number and a new EX0.log is created. Since by default log files are not erased (by default storage groups do not use circular logging) the space on a log disk will eventually be depleted. The standard procedure for removing unused log files is backing the system up (full or incremental).
As mentioned earlier the size of a log file is 5120KB, if you find that the size of the files is different you may be looking at a corrupt log file.
Each set of log files has two placeholder files called: res1.log and res2.log. These files are used by the storage group when it runs out of disk space to store altered information before dismounting the storage group.
- Checkpoint File- The checkpoint file is used to track which transactions have been committed to the database and which transactions have to be committed to the database. The name of the file is EX0.chk (X stands for the storage group) and its size is 8KB.
The symptoms our mail server was displaying were that the stores would not mount and all the event log messages seem to indicate that we had run out of disk space. Now as I know this can spell certain death to an exchange box I regularly check disk space - we had plenty of space - another dead end. Google?
http://support.microsoft.com/?id=819553
This gave me the exact error messages being reporting. But the E00.log it mentions was not missing. Another dead end. The article also warns about Anti-virus. Now I knew this and when I installed our server I setup server rules to exclude the exchange folders from scanning. Another blank.
Ok onto the next Google search
How to test Exchange transaction log files for corruption - http://support.microsoft.com/kb/248122
A good few hours later - no corruption detected. People were starting to wonder if they'd ever see they mail again. I wondered that myself, and let slip that the database maybe lost.
At this point, I needed a cool head. Despite pressure from all sides, I knew I shouldn't set an unrealistic time for completing the restoration of service. Based on my experience, I recommend that you follow these steps.
- Find the last backup.
- Take a copy of the mailbox and public folder stores (Exchange and streaming databases) .
- Make a copy of the transaction logs
- Disable inbound mail connections
- Keep a log of the restore process
OK, backups werent an option here as our backup scheme had failed since mid-December. Dont let this happen to you. Dont get me started talking about backups.
Although a database might be corrupt, you must take a copy of the existing databases. Don't forget to take a copy of the streaming database.
A restore can overwrite the corrupt database, so you need a way back to the state of your database when the corruption occurred. If the restore is unsuccessful, you might be able to repair this database. In this scenario, you want the most recent version of the database, even if it's damaged. If the files are large, you can save time by renaming them to something meaningful (e.g., priv1.oldedb, priv1.oldstm). Remember to leave enough disk space on the database drive for the restore.
Making a copy of the transaction logs is crucial because the transaction logs enable recovery up to the moment of the outage. Check the dates of the transaction logs, and verify that you created them since the last backup.
A server recovery might require several restore attempts, and you don't want queued inbound messages delivered until you're satisfied that the restore process was successful and everything is running normally. Therefore, you need to disable the default SMTP virtual server.
I now tried to remove the transaction logs - giving these up as lost. Now can I mount the db? No.
I used the Eseutil utility to examine the database header and learned that the database was in an inconsistent state. ESEUTIL /mh <path to database file> == Dirty Shutdown
OK, now I had to recover the database using eseutil to get the db back to a clean shutdown but this just gave errors.
Last chance saloon. Repair the db using the same tool even though recommend strongly not to do this in dirty state. Seems to work?
Can I mount the stores?
YES! Horray. Even before I got up to get a fresh cup of tea, people started noticing their mail arriving.
OK, I sit down and open my mail. And sitting there in my Alerts folder is a virus alert on the mail server. Reading, everything started to make sense. The anti-virus software had located some data in the exchange log files that matched a known virus signature. And deleted it! No wonder Exchange collapsed. All the times tied together - I had found the cause.
Then I remembered that I had ruled out anti-virus software being envolved because I had setup exclusions on the exchange data folders - why hadnt that worked?
Checked the settings - missing! Damn it! My exclusions had been lost somehow. OK - i put them back.
On the way home I suddenly realised that the settings must have been lost when the anti-virus upgraded itself recently after a major bug was found
I chuckled as a wondered how many people world wide had just had a day just like mine!