Upgrade cycle nightmare

I just completed an upgrade cycle of our VMware vCenter/SRM environment.  This upgrade was a nightmare of unknowns, hidden ice bergs and a tar baby (http://en.wikipedia.org/wiki/Tar-Baby) or two.

 

First during my first week I was asked to install a vCenter in one of our Data Centers. This vCenter was to support some new ESXi vSphere 5.1 hosts.  So I went ahead and installed this to a VM that had been already provisioned.  I found out that this system name had previously been used as part of a vCenter Linked Mode environment for vCenter 5.0.

Ruh-Oh!

Just so you know don’t mix versions of vCenter in Linked Mode.

Secondly as I was unfamiliar with the site/environment I installed this as a Simple vCenter install which uses VMware SSO in basic mode.

Ruh-Oh!

I found out that someone had already installed SSO on another server running vCenter 5.0 and this “new” vCenter should have been installed in multi-site mode.  Acutally I found out later that the SSO installation was not completed sucessfully.

As this installation “broke” linked mode and was causing people to upgrade their vSphere clients but then not allowing them to login properly I went ahead and removed/uninstalled linked mode from all of my vCenter installations.

Ruh-Oh!

This did not work properly.  I ended up using VMware KB 2005930 to edit my Linked Mode at each location using ADSI Edit. This restored access to all locations but isolated each vCenter and caused additional administative headaches.

During the next week I worked on documenting the Primary and Secondary vCenters that were running with SRM.  During this documentation and planning period I was kindly asked/told to go through with the upgrades at both sites. This is where the real night mare began.

At the primary site after downloading the vCenter installer ISO and the SRM installers I ran into issues getting SSO installed and running and then having the Inventory Service and Web Server service recognize the Primary SSO installation.  I finally resolved this by completely uninstalling all VMware components and keeping the vCenter 5.0 database.

Once the vCenter was installed and some basic AD groups added for permissions the secondary site was on the spot to be upgraded.  This installation was fairly seamless.  I was able to connect these two sites in Linked Mode with no SRM functionality.

Management had done their own research and thought that SRM 5.0.x would work with vCenter 5.1.0b.  I informed them prior to starting that to have SRM working would require an upgrade to SRM also. So thus began the upgrades to SRM.

There was no documentation within the IT organization as to what was being protected by SRM.  Of course… There never is documentation.  I attempted an inplace upgrade of SRM 5.0.x to 5.1.0.  The SRM DR service would never continue to run on the Primary server.  This necessitated opening up an SR with VMware support.  After working through this issue with multiple support engineers involved and uploading logs and escallating the case we found that when using our existing SRM Database the service would terminate a few seconds after being started.  But when we used a new empty database the service would run properly.

Wow.  But I didn’t know what Protection Groups had been created and what VMs were in each protection group or what Protection Plans were in place with possible customized IP addresses.  Did you know there is no tool to migrate the data from an existing SRM DB to a new SRM DB except during an installation?  Well there isn’t…  But VMware SRM Support did have an excellent support engineer who was able to help pull the data from the older SRM 5.0.x database using Microsoft SQL Server Managment Studio.  He was able to pull together the data showing our Protection groups  -> VMs and that we had no customized protection plans.

This was great!  We could install SRM with an empty DB and then “fairly easily” recreate the Protection Groups with VMs and Resource Groups and Storage and Networks and…. to get SRM up and running again.  But!  I also needed to be able to see the protected storage arrays.  This required an upgrade to the SRA for EMC RecoverPoint.  To work with VMware SRM 5.1.0 you need to be using SRM for EMC RecoverPoint 2.1. Ok sounds doable.  But to have the SRA work you need RecoverPoint to version 3.5 or later.  Sounds simple.  It’s not.  RecoverPoint is not a user upgradable solution at this moment.  It normally takes at least 5 business days to work through the EMC Change Control process for a RecoverPoint upgrade.  During this process all sorts of things are documented and verified.  Since we use SRM to move production workloads this process was shortened to less than 2 days.  The upgrade itself is fairly painless in that you upload an ISO from EMC and a special installer uploads this to the applicances and then moves the Protection groups from RPA 1 to RPA 2 and commits an upgrade of RPA 1.  At the completion of the upgrade of RPA 1 at each site the installer moves the protection back to RPA 1 from RPA and commences the upgrade of RPA 2 at each location.

At this point I was able to add the RecoverPoint appliance as a protected array pair in SRM.  Then I could create the Protection Groups and then I was able to test a Protection group failover.

What I did not know at the time of the RecoverPoint upgrade was that we were using Replication Manager to protect non-vmware Oracle DB workloads running on Solaris m5000 hardware.  And that Replication Manager needed an upgrade to version 5.4.2 or 5.4.3 to work with the latest version of RecoverPoint.  We did the upgrade on the Replication Manager server and the agents on the various hosts and it refreshed with current version info but the Replication jobs failed when trying to mount a RecoverPoint snapshot image to our Recovery target m5000 server.

After uploading numerous logfiles and support troubleshooting with EMC Replication Manager engineers it was identified that we needed an EMC Engineering hotfix for our source host to allow this to complete.  This was a known issue that will likely be resolved in the next release of Replication Manager.  To get a Hotfix you need to cool your heels a bit and let EMC Engineering spin up a version specific for your environment.  Once we had this and had it installed on our Solaris m5000 host the Replication Manager job completed and the data was available at our secondary site.

The point of this is that while PLAN is a four letter word.  If you FAIL to PLAN you are really PLANNING to FAIL.

Bookmark the permalink.

Leave a Reply