« Everyone is right: The Kindle is ass backwards, yet I'm going to try it | Main | OpenBSD 4.2-current on ThinkPad X40; experiences and kern.emul.linux=1 »

Data center woes; Always have a backup [plan | person | backup]

Matt and I are still involved in the data center business even though we've been hit time and time again with large customers walking out on significant bills. The hardest part about the business is getting paid after-the-fact. Companies have been great in signing up for our service(s) and the marriage is happy for at least a few months. They love us and the service, we love them and their quick payment routine. The problem with not only providing space and access to bandwidth, but application support is that even if an application has proven to be buggy (and we are *not* responsible for the complete design of the application), we somehow carry fault for problems - especially if it is found that as one problem gets fixed, another crops up.

As a small example, a web server typically has some limit to the amount of requests it can handle in a certain amount of time. The complexity of measuring this variable grows exponentially as the way requests are handed off to the web server take place and how long a given user of a website sticks around. So, the stickiness of the user - i.e. length of time on the site - and the way you set up a timeout for the server to say that it does not need the connection to the user to exist is difficult to measure. Well, if you've designed an application that we assist in managing and this becomes an issue, there are no straight-forward answers to why a server would lock up or connections are not timing out quickly or slowly enough. The process of finding a happy medium is relatively easy if traffic remains static, but not if it continually fluctuates and grows. As time goes on, instead of working with us to resolve these sorts of problems and realizing that a lot of work is required to find a solution, people start wondering whether or not we know what we're doing. The answer is usually, yes, we do. Execution isn't always perfect, but we are good at solving problems. It is difficult, however, to tell someone that their program/application design is root of the problem (or it is simply *the way things work* when traffic grows) when they are your customer and the solution is going to require time (and, of course, money) and experimentation to solve.

Friday, we ran into a completely different problem. One mostly of our own making. A good customer decided to do a major upgrade to his server (he is a bit of a unique customer in that he is neither a co-location customer, nor a managed server customer; he manages his own server, but we provide the hardware, electricity, heating/cooling and bandwidth to him) and something went awry. Both Matt and I have big families and Thanksgiving is the one time of year that everyone gets together to celebrate. Not only do we see our relatives on Thursday, but throughout the weekend. To top it off, most of the people with whom we work are in the same situation.

The first problem we found was that the lights out management (LOM) on his server was not properly configured (either the server had never been correctly configured to begin with or somehow during the reboot the LOM lost its configuration during the reboot (highly unlikely, but still possible). This meant that without physical access to the machine (which we did not have because both of us were 300 mi. away from the data center), we would need to call in someone to reconfigure the LOM.

This is where the second, and most critical, problem "cropped" up. The data center where we lease space didn't have anyone in house. Let that sink in a moment. We lease floor space from a data center where there are a few other customers like ourselves (the data center provides electricity, heating/cooling, bandwidth and security - and we thought a person to help with basic issues in case it was needed). They have racks and racks of their own equipment that they monitor and manage. No one was in house. Matt learned that they had someone on call who would be in later that evening to do normal maintenance. We had the option of calling it an emergency and having someone come in specially for us and our customer or we could wait until their employee comes in for the normal checkups. We talk to the customer and find out that waiting is ok.

By the time the LOM on the machine is fixed, the customer has been down for the good part of seven hours. I'm not ok with this, but things happen and this is one of those rare times where our reaction time was much worse than expected. We would make changes to deal with this type of problem so it didn't happen again. (I had also called another gentlemen who had helped us in the past, but he wasn't willing to come in and help. That is another story in it of itself, but I won't go into the details.) Matt was contacted an hour or so later by our customer and told that the LOM settings had been lost again. The guy who had been at the data center was again gone and the next visit was going to be 11 am the next day.

What to do...

I doubt the customer was happy, but it was left at that. I expected that by 11:30 am we would be back on the server and this time, I would look around and try to gauge what was going on. (We weren't sure, but thought that the boot record and partition table of the server had been wiped and were wondering what could have possibly caused that condition. Usually, in this situation, you are happy you make backups and commence in rebuilding.) I also wanted to see what state the machine was in so that in case there were any questions, I could do my best to answer them. 11:45 am rolls around and I didn't hear anything from Matt about the LOM being reconfigured. I contact him and we start to get frantic and wonder why the data center personnel are not responding to the request. It was only around 12:30 where we realize no one had yet made it in to do the work. No one was responding to pages, emails or phone calls. Shortly thereafter, we get a message that there is a guy working on it and he can't figure out what to do. At that point, I start to get irritated. This isn't the service I expect for our customers, nor is it the type of situation I would have ever envisioned us being in considering where our data center is and who they are. This LOM setup is simple stuff.

I finally write Matt and say that I felt we had two options: a) we (one or both of us) forego family festivities and drive up to Minneapolis and do the work ourselves or b) we tell the customer that he has two options, we give him another server or if he cannot accept that, he will have to wait for Matt to arrive in Minneapolis Sunday evening (today) to work on things himself. I wasn't going to make the customer go through any more bullshit. Obviously, the idea that our data center is staffed 24/7 is fiction. Just to see which way we might have to go, I IM the customer and ask about using a different server for the time being - one from a customer who had recently shut down operations. He says that he wants access to the data on his server. I took that to mean he wasn't going to accept the "backup" server. Although I thought it was unlikely that he was going to get anything back (because it looked like the hard drive on the server had been repartitioned from what little Matt saw the night before), it was a fair request and I was facing the idea of driving 10 hours to figure out a problem that I knew had to be simple. What a waste of time when we should have a way to deal with a problem like this other than having to be there ourselves...

Matt takes a leap of faith, however, and calls a co-location customer who is very, very familiar with our equipment and asks if he can help. He was in Minneapolis for the holiday anyway and just might understand what type of bind we were in. In the mean time, I was looking at flight possibilities so I wouldn't necessarily have to drive. In the middle of researching flights, our other customer says yes, he can go in to help. Yes, we called another customer for help. He seems to be happy to go in and it was at that point that I started to breath a little easier. I had never met this customer directly, but from talking to Matt, he seemed like the kind of person you could trust to stick around until the problem was solved. He understood what type of bind we had found ourselves in. I could plan on seeing family again and Matt could go about their plans also. Neither of us would be stuck in my car or on a plane bitching about who should have had better plans or why we want to get out of this business.

I won't get into the details that follow. Matt and I did a number of tests with rebooting the server before handing it off to the customer again. We wanted to be absolutely sure the LOM didn't lose its configuration again. Within an hour or so, it appears that the customer was back up and running (his blog details how he did it anyway). After over 24 -- more than twenty-four -- hours.

While I think I can objectively say that customer could have been more cautious about the upgrade and made sure backups would be ready to go on another server if needed, but we were solely to blame for the ridiculous amount of time it took to solve the problem. Where in the hell were the people needed to get their hands on the server and why weren't people at the data center 24/7? At this point I don't care why. The fact is that they aren't. We got caught with our pants down. It's cold in Minneapolis this time of year and no pants on has frightening effects.

We have a few options from hereon in. One is that we no longer claim we can service equipment 24/7. It is a valid option, but not very appetizing to people who are looking for co-location and or application hosting. Another option is to move our equipment to a data center where we are sure they offer 24/7 staffed support. The last option is to have two people on call whenever we are not physically available ourselves. The reality is that we're not moving. It would take too much time and cost too much money and we would be better off closing down our service if we had to do so. I don't think we can run a viable business on a 9-to-5 schedule either. So, we need to put people on call; not one person, but two (a backup to a backup). We thought we had people on call vis a vis the data center and found we didn't. This is the worst way to find out.

It isn't just a matter of having a plan. It is also important to test that plan. We won't make the same mistake again.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on November 25, 2007 12:26 PM.

The previous post in this blog was Everyone is right: The Kindle is ass backwards, yet I'm going to try it.

The next post in this blog is OpenBSD 4.2-current on ThinkPad X40; experiences and kern.emul.linux=1.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.