Virtualization Panic
After few days running smoothly on ESXi, I got my first scare.
Here is what happened:
First, I upgraded the Windows XP instance running within the VM to SP3.
Then, I created a new Virtual Machine from scratch on my running ESXi server to learn how to do it.
The idea was to create a new virtual file server and move all my multimedia data into it so that I could shrink the size of my website VM.
Everything worked great. This is the process:
- Create new VM using the Virtual Infrastructure Client
- Attach the ISO image of Windows XP to the VM and set it up so that it would boot from it
- Configure the network to use an available IP address (192.168.1.4); I probably could have used DHCP.
Everything worked just fine. I did everything through the management UI and without reading the documentation. I just had to google how to make the VM boot from an ISO CD image to install Windows XP into it. I also had to mount a floppy disk ISO to install the VMware SCSI drivers for Windows XP. Very straightforward.
I started the copy of the “photos” directory to the new VM partition and I let it run overnight (60GB).
In the morning I realized the the main VM had restarted during the night. I blamed Windows’ automatic update service which I now realize I should have disabled (ah! these simple bakers turned naive system administrators…) .
I blamed Microsoft and re-started the copy operation.
But after a short while……. NOOOOOOOOOO!!!
The blue screen of death
Panic
I guess at this point I experienced what I am sure some customers go through sometime. After being on the virtualization high for getting everything virtualized smoothly, I had my first blue screen of death and paniced a little.
Note that I had very reasonable excuses to:
- blame Microsoft because they ‘made’ me upgrade to SP3
- blame myself, because I jammed another virtual machine on the same server
But both excuses were scary and lame because:
- Virtualization is not supposed to impact your ability to patch and update the hosted operating system
- Virtualization is DESIGNED to let you run multiple VMs on the same server. that’s the whole point!!! (plus I was still running at 15% CPU utilization at most…)
So, I armed myself with more patience and confidence in the product and debugged a little deeper. I noticed that the error in blue screen of death was “page_fault_in_nonpaged_area“, so i did some research and found out that this error is sometimes due to faulty memory chips… and I had my aha!!!! moment.
If you remember, when I changed the hard drive in the server, I also added a memory bank that I found in my hardware drawer…
So, I removed it and went back to 1.5 GB RAM and the system is back up and running.
And so is my confidence in VMware products!!!!
The Lesson
When deploying a new infrastructure, one needs to be careful about all the moving parts, document everything and most importantly be committed to the change.
I wanted it to work, and I know it works. So I did not blame the virtualization layer and went look for the real cause that turned out to be a faulty piece of hardware.
But what if I was not a champion of the technology and I did not know that it does reliably work in ten of thousands of production installations?
A little glimps into a dynamic that does happen in the relationship between IT shops and their technology and solution providers all the time.
Customers need:
- Guidance
- References
- Commitment and Support
They want their vendors to have their back when they run into trouble
Vendors need
- A technology champion and commitment from the top within the customer
- Customer references that provide confidence but even more importantly prescriptive guidance on how to deploy the technology successfully
- Be there for the customer
More on guidance and commitment later as they core to my first project at VMware.