TechPublishing Now MS Certified

TechPublishing Now MS Certified
Professor Robert McMillen, MBA Microsoft Certified Trainer and Solutions Expert

Saturday, May 24, 2014

Mail server disaster

We had an interesting issue with a mail server yesterday. After about 16 hours we finally got it going again but I have never seen one like this before.
At about 2PM, Outlook showed up as disconnected. I logged into the domain controller and noticed that the domain controller with all the master roles was missing the global catalog. You can check this by looking at the event logs and seeing that there is no sysvol or netlogon share on a global catalog DC. This was a 2008 server but there was also a 2012 R2 server. 
Neither realized they were domain controllers anymore since the GCs were missing from both servers. The 2012 said it was having trouble replicating with the GC on the 2008 server.
The 2008 server said it was having a jrnl wrapper problem. This can be typically fixed by changing the wrapper registry key and restarting FRS.
In this case it didn't work.
When I went to restore from backup I discovered more troubling news. The 2012 server showed good backups but the option in Backup Exec 2012 to restore was grayed out.
The reason for this could have been many so I just turned to the 2008 restore. At first I couldn't find the system state to do the restore but in BU Exec 2012 they just moved it to a new location under the disaster recovery area.
I found the backed up state from the night before and attempted to restore from the BU Exec server after booting the 2008 server into Directory Services Restore Mode (DSRM).
When you log into DSRM you have to use the password used when setting up the server as a domain controller. If you didn't write it down then you have to go through more agony to reset it. Fortunately, we had this recorded so I logged into the server using this password created years before. This is a different password than the domain administrator password.
Here comes the catch 22. When I tried to run the restore it said you cannot use the domain administrator user and password while logged in with the DSRM account while in that mode. I can't restore the system state while logged into normal mode because it won't properly overwrite the files and I can't restore it when in DSRM mode because it can't reach the domain anymore.
All the while this is going on the mail server needs a valid domain controller to work with a global catalog server. Otherwise the services just shut down. Now, no email and no domain.
The next thing to do was to get that system state restored so all of this would start working again.
I tried restoring the system state to the local backup server and just copy it over to the 2008 server where I would use WBAdmin.exe to restore the state. The redirected restore worked and I copied it over to the 2008 server. The problem was when trying to use the command to restore it on the 2008 server it said no backup was there to restore. I also tried to use Windows Server backup and it said the same thing even though there was 28 GBs of system state data right there.
I looked at the event errors in Windows and BU Exec and found that the problem was that the restore mode user wasn't a member of the Backup Operators group. I had to restart the 2008 server into normal mode to add a new user I just created to that group. I decided to call the user backupadmin.
I added the new local user to the domain group and rebooted back into DSRM to attempt the restore again.
It now failed again with a new error that said the new user needs to have the "log on as a batch" right. That would be no problem. I rebooted the 2008 server back into normal mode and opened the local security policy. I found the right under the computer configuration, policies, user rights area but the option to change it was grayed out because it was a domain controller. That was awesome because this server thought it wasn't a DC any more. 
I then went into the Group Policy Manager and found the right again after running a group policy results wizard to locate it. It is in the security area under computer configuration in the default domain controller's policy. It was grayed out there as well. It was about 1AM and this had started about 2PM the previous day.
I decided that Microsoft might have the answer to this problem so I opened a ticket. They said they would get back to me in 2 hours. My brain was fried so after doing a little more research I ate a bunch of M&Ms, popped open a diet Coke, and started to play some Call Of Duty and watch some Netflix. At 3AM Microsoft calls and I spill my guts out like a tornado survivor. The engineer thinks about it for a second and says he's the wrong guy for this and says another engineer will call me back in another 2 hours. 
It was 3:30AM and I decided I would give it one more crack after doing more research. I found a command line tool in the Windows 2003 tool kit called NTRights. Yes that's right, 2003. I did some research on whether or not it would work on 2008 and most people said it would so I installed the tools on the 2008 server. These are all command line tools so I opened up a command line and tried to follow the directions of another engineer who fixed an issue similar to this years ago.
He said to type: ntrights +r SeBatchLogon -u machinename\username
I did an ntrights /? and found no switch that matched up to this. I tried the command and it failed.
After doing more research I found I needed to add the word "right" to the SE command and I typed this:
ntrights +r SeBatchLogonRight -u machinename\username
I wasn't sure if I was seeing it correctly but it said it took it. I stood there dumbfounded for a second. The mind plays tricks at 4AM.
I then went to the backup server and ran the restore once again and it started to restore. I waited around a while and noticed that it would take around 4 hours at this speed. I decided to go home. 
I asked my business partner Scott to come in and reboot the 2008 DC around 9AM. He came in and rebooted and everything, including email, all came back up. Yay!
On my cell phone during my few hours of sleep I noticed Microsoft engineer #2 called and left a message around 6AM to help me with the problem.
I didn't call him back but I sent him an email saying it was fixed.
Now time for more sleep.