Wednesday 14 September 2011

Problem: Clustered SQL 2008R2 setup crashed and SQL AGENT will not start

Description of the problem

A few days ago I was adding the third instance to a two-node SQL 2008R2 EE on a Windows 2008R2 EE failover cluster, while near to the end of the setup process of the first node, the setup.exe crashed, leaving the installation in an incomplete state.
I tried to run the repair tool but it would not work as "the installation has never been configured".
A similar crash of the setup.exe happened to me once before, but as it was the first instance setup, I preferred restoring the image of the server that had been made just after having finalized the Windows cluster configuration.
This time it was different: I already had two working instances running and I did not want to start from scratch nor cause downtime, etc.
Uninstalling a specific SQL Server Instance is not very easy when everything went the right way, removing a node from a Cluster could be even more challenging. I thought I had to try to manually repair the installation of the first node and then try to add the second node to the cluster.

Evaluating the situation

I started looking at how dramatic the situation could be: what had been installed, what would run and what would not. The SQL Server and SQL Server Agent services where present in the SQL Server Configuration manager but they were not started .
While I could find the SQL Server resource  in the Cluster Manager, The SQL Agent was nowhere. I tried to bring the Cluster resource online : Network, Disks and SQLServer started gracefully. I felt some relief, the fact that the cluster resource was online gave me some hope!
I went back to the SQL Server Configuration manager and I tried to start the SQL Server Agent service: it fell back so I went to the Application and System Event Logs and I found the errors logged by the SQL Server Agent service while it was trying to start. It cried a lot in fact, saying that it could not find the SQL instance it belonged.
Some of the error messages sounded like:

SQLServer Error: 65535, SQL Network
Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF].
And
Logon to server 'ClusterResource\InstanceName'
failed (DisableAgentXPs)
Strangely the cluster resource and the instance names were correct and after starting SQL Server I had successfully connected to it through SSMS.

Making the SQL Server agent service start

I Googled around and found and found this illuminating Post that explained me where the lied the Registry keys I was sure that I would have to tweak:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10_50.InstanceName\SQLServerAgent]
There I was told to complete the ServerHost key with its value:  
"ServerHost"="ClusterResourceName\\InstanceName"
Again I tried to start the Service from SQL Configuration Manager and this time it started gracefully. So first step was done, but now I had to add the service to the Cluster resource.

Adding the SQL Server agent service to the cluster

How to add a specific service to a Cluster resource? Not by using basic Cluster.exe syntax like
cluster restype "SQL Server Agent" /create /DLL:sqagtres.dll
as I have found around in a very useful MSDN forum post. This was wrong because I already had a generic "SQL Server Agent" resource in the cluster (remember I already had two SQL Server instances installed?). Plus I needed to add the specific SQL Server Agent that was part of the instance I was installing (i.e. "SQL Server Agent (InstanceName)".

These are the steps I took:

-In the Failover Cluster Management I used Add Resource to add the SQL Server Agent resource.
-I named the resource as “SQL Server Agent (InstanceName)” after the real Instance name.
-Right-clicking on the SQL Agent resource I went to Properties and filled in values for the two parameters:
  1. The Virtual Server name of the SQL instance
  2. The Instance name
Now added the SQL Server Resource as a dependency for the newly created SQL Server Agent resource.
But one more magic step was necessary.

Make the system believe that the features were correctly installed and configured

Without this operation all above will fail. I also thought that if I had found this tweak before, I might have tried to repair the installation with the repair tool from the very start!
I opened the Registry Editor and browsed to the following location
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft Microsoft SQL Server\MSSQL10.INSTNAME\ConfigurationState
(this is a named instance) Here you have to make sure that the values for all the keys present are 1. My values for all the four features listed were "2" so I changed them all. This will set the configuration state for all SQL features "installed and configured". A value greater than 1 indicates either an install failure or configuration failure.  

Bringing online the cluster resource

It was time to try to bring the resource online, and It worked perfectly. I checked for eventlog errors but could find none. It was time to install the second node.

Adding a node to a SQL Cluster

Well, believe it or not, I really thought it would have been a straightforward operation, now that I had patched my setup on the first node. Of course, I was wrong.
During the simplified "add node to existing cluster" SQL setup process, Setup.exe asks less questions because once you have agreed that you are adding a node to the proposed instance, you're just asked with the passwords of the SQL Server and of the SQL Server Agent service accounts (those service accounts being read-only as they were of course already defined during the first node setup).
So what a surprise when setup.exe asked me for the password of an SQL server Agent's EMPTY service account! I sighed, cancelled the setup and went back to the first node. What to do next?
Well, I found out that it was my fault! I had not executed the fourth step that was advised in the MSDN forum: i.e. running a repair of the failed SQL Server installation!
This time the repair went straightforward. I launched the setup of the second node again and everything went perfectly. I then could do the usual post-setup tests and configuration operations and I am now the happy Administrator of my 16th clustered SQL Server instance!
   
Resources:
Let SQL Server Agent service start
http://www.eggheadcafe.com/software/aspnet/32049165/i-surrender-sql-server-agent-fails-to-start-2-node-cluster.aspx

SQL Server Agent not listed as cluster resource (in MSDN SQL Server forums)
http://social.msdn.microsoft.com/Forums/en-US/sqlsetupandupgrade/thread/5049fc44-0f44-4caf-9c5a-282edf30d878

How to: Repair a Failed SQL Server 2008 Installation (in MSDN library)
http://msdn.microsoft.com/en-us/library/cc646006.aspx
 

2 comments:

Anonymous said...

I would like to say thank you to the author of this blog for spending more time and effort to make this site more interesting to all blogger especially for me.

Business Information

Anonymous said...

Thanks for such a great article to share with us. These are things that buyers should definitely consider when buying a commercial bldg. and asking those question would help them make a smart decision.


small business information