Monday
May212012

Everyone Wants to be the Cool Kid

In this guest post, a good friend toiling in a very private company provides a thought provoking commentary on current data center design and in vogue design approaches.  My friend has asked for anonymity, although I vouch for their professionalism.

When I was a kid, we had very little money.  Designer jeans, a shiny new bike, or the latest fad toy: you name it. I did not have it.  So, I know all too well that feeling of wanting desperately to be cool.

I think that is what is going on in the data center space these days.  The cool kids are having all the fun.  Yahoo’s chicken coop is cool as is Google’s barge concept to harness energy from the ocean waves.  (Data centers are usually in particularly boring places so what data center operator wouldn’t want the opportunity to hang out at sea?  Sign me up.)

You know what else the cool kids are doing?  They are jamming as much compute into a rack as the laws of physics will allow.  So, we should too, right?  Wrong.

Before you get all excited and tell me all about that super-cool-rack of super-cool-cloud-ready-kit that you just installed, let me clarify.  I am absolutely not saying any given rack shouldn’t be filled to the brim.  In fact, you should put a couple of these together and make sure to walk your prospective clients on by so that they think you are part of the “in crowd”.  What I am saying however is most of us are not Amazon, Google, or Yahoo. 

Most of us don’t have rack-after-rack, row-after-row, and hall-after-hall of exactly the same stuff.  And our stuff probably isn’t all running the same app so we actually do care when some of it goes down.  For most companies, filling a large data center from wall to wall with 20+kW footprints is what I am questioning.

Let’s play pretend, shall we? 

Let’s pretend we are a big-enough company that building our own data center (rather than using a co-lo) makes sense.   This is a pretty big investment and as the project manager for such an undertaking, I have a lot of things to think about.  This is because data centers are a long term investment and therefore we are making decisions that are going to be with us for a very long time.  For argument’s sake, let’s say 20 years.  This means at the 15 year mark, we are still going to try and eek out one more technology lifecycle.

15 years is a really long time in technology.

A bit less than 15 years ago, I was walking in the woods with my dog sporting a utility belt even Batman would envy.  Clipped on my belt were my BlackBerry, my cell phone, and my two way pager.  And, in hopes a great shot was to be had, I was also carrying my Nikon FM2 35MM camera.  Today, I carry none of these devices but my iPhone is never more than three feet away from my body and usually in a pocket.

As I walked through the woods that fine afternoon, all three items on my belt started to ‘go off’ at the same time just as I was trying to take a picture of my dog playing in a field.  It was not lost on me this was quite ridiculous and”someday” greater minds than mine would certainly put an end to this madness.

I am deeply grateful for all of the people up-to-and-including Steve Jobs who brought us the iPhone.  Now my operations associates don’t need to seek chiropractic help each time they are on-call.

Let’s get back to the problem of data centers.

There are three fundamental decisions we need to make:

  1. how resilient will our data center need to be,
  2. how many MW does it need to be, and
  3. how dense we are going to make the whitespace (the space where all of the servers and the other technology gets installed).

We are play pretending we are an enterprise class operation so for our discussion, so we are going to assume we need a “concurrently maintainable” data center.  This might put us somewhere between a Tier 3 and Tier 4 based on the ANSI/TIA 942 or Uptime Institute’s established checklist. 

We will further assume we are going to need at least a MW of power to start off and  we expect that we will grow into somewhere near 4MW over time.

Only one more question to go.  Power Density.

Now, if I were a consultant, maybe building something ultra-dense is part of my objective because I want to be able to talk about it in my portfolio.  However, we will work under the premise I really do have the best interests of the firm at heart.

There is a metric in the Data Center world which is both infinitely valuable and amazingly useless at the same time.  It is called PUE and it stands for Power Usage Effectiveness.

Quite simply, it is an indication of how much power is being “wasted” to operate the data center beyond what the IT already consumes.

In a perfect world, the PUE would be equal to 1.  There would be no waste.  Actually, some would suggest the goal is to have a PUE of less than one.  Huh?

To accomplish this, you would have to capture some of the IT energy and use it to avoid spending energy somewhere else.  For example, you would take the hot air the servers produce and use it to heat nearby homes and businesses thereby saving energy on heating fuel.

On the flip side, PUE can be ‘wicked high’.  Imagine cooling a mansion to 65 degrees but never leaving the room above the garage.  This is what happens when a data center is built too big too fast and there isn’t enough technology to fill the space.  In fact, there are stories of companies who had to install heaters in the data center just so that the air conditioning equipment wouldn’t take back in “too cold” air. 

Google and Yahoo tout PUEs in the 1.1-1.3 range.  This is the current ‘gold standard’.  Most reports put the average corporate PUE above 2.  But this is where we start making fruit salad – comparing apples, oranges, and star fruit.

How companies calculate their PUE is always a little unclear and it is very unclear if that average is a weighted average.  If not, the average of above 2 is quite suspect since the “average” data center is little more than an electrified closet (Ed. Note: or refrigerator.)

Let’s ignore PUE for now and assume two things about our company: 

  1. our data centers are pretty full (remember PUE gets really bad if there is a lot of wasted capacity) and
  2. we have been doing an OK but not stellar job of managing our existing data centers. 

If we were even measuring it… (I bet most companies are not)….this might place our current PUE is in the range of 1.6-1.8.   This is a difference of .5.

By now, you are probably asking yourself what PUE and our question of density have to do with each other.  As you meet with various design firms, they will try to sell you on their design based on the assumed resulting PUE.  And, proponents of high-density data center space will tell you this will improve your PUE.

My argument is simply there are a lot of other ways to go about improving our PUE without resorting to turning our data center into a sardine can.  Not overbuilding capacity, hot/cold aisle containment, simply controlling which floor tiles are open, or running the equipment just a little warmer are some immediately coming to mind.  Remember we think we have a .5 PUE opportunity.

All of these things can be done with ‘low risk’ and I contend banking on high density demand in a design needing to last 20 years is folly.   Yes, there is some stuff on the market which runs hot.  But since we already decided we are a complex enterprise that doesn’t only have low-resiliency widgets, we will also have a bunch of stuff that runs much cooler.

For our discussion, let’s assume our existing data centers average 4kW per footprint.  This puts us at about 100 watts a SF: at the high end of most data center space but not bleeding edge by any means.   I have seen some footprints pushing over into double-digits but these are offset by lower density areas of the data center (patch fields, network gear, open floor space, etc.).

Another component that isn’t breaking the power bank is storage.  Storage arrays are somewhere south of 10kW per rack and seem pretty steady.  A smart person once told me we get about twice as much disk space in a rack every two years for about the same amount of power.  And, that assumes we keep buying spinning disk for the next 20 years.  Some of the latest pc/laptop systems are loaded with flash drives and these have even started making their way into some data centers.   In a recent Computerworld article, a discussion around a particular flash storage array put the entire rack at 1.4kW.   Granted, these are pricy right now but such is the joy of Moore’s Law.  Again, 20 years is a long time.

We should also consider all of the talk about ‘green IT’ in the compute space.  Vendors “get it”.  There is much discussion today about how much compute power per watt a system provides.  This is great.  If we measure it, we can understand it.  If we understand it, we will manage it.

Intel has talked about chips that will deliver a 20X power savings.  Remember that super cool (hot) rack with 22kW of stuff jammed in?  What if it only took 1.1kW?  And what if that happens 5 years into or 20 year data center life?

The sales pitch may also try to sell you on cheaper construction due to the smaller whitespace footprint but this is just silly unless you are in downtown Manhattan.  I read something once that put the price of ‘white space’ at $80/sf.  This would mean that building a MW at 100 watts/sf would cost $800K versus $288K for the sardine can. 

This is a bit more than a $0.5MM per MW.  To put this in perspective, I would put the price tag at $35MM/MW including the building, the MEP, racks, cabling, core network, etc.  So we are talking about 1.4% of the cost.  And, since we are going to keep this investment for 20 years, we are only talking about ~25K per year in “excess” depreciated build cost per MW.

And, let’s pretend we buy into the assumption this is going to cost us some opportunity on our PUE.  This is important because paying the electric company is something that happens every year.   If we put our data center someplace reasonable, a .1 PUE difference might be worth around another $40K.  This assumes $0.06 kw/hr power which honestly we should be able to beat.

(1000kW DC) * (80% Life time Average Utilization) * (.1 PUE Impact) * (365*24 Hours) * (.06 per kW/hr)

= $42,048

So, we are out of pocket as much as $65K per year per MW.  While, I understand this is real money.  I still say, even if that PUE difference is real, it seems like a small amount to pay for future-proofing our 20 year investment. 

Even though they supposedly didn’t paint the fuselage to save weight rather than to save money, maybe we could take a lesson from the space shuttle and cut back somewhere else.  Somewhere that doesn’t lead us to treating our technology like salty fish.

Just saying.

From Google’s Data Barge Patent

Monday
May142012

Data Center Patching

We’ve all heard the expression “the network is the computer.”  Many people make a handsome living making sure data center switches and routers from vendors like Cisco, Juniper or others hum along nicely.

And while we’re hearing about the impacts of wireless, wireless data centers seem to be a long way off…leaving data centers filled with miles of wire…cables…and patch cables.

Non-technical managers may simply say “cable is cable” and not fully appreciate the value of a cable plant thoughtfully designed and implemented.

Every cable type (wired or fiber) has designed speed and maximum length characteristics.  It is amazing to me the number of times in a large data center a “flaky cable” ends up being of a cable length over the designed maximum length.  In my mind, it should be called a “flaky installation.”

A well-executed cabling job can qualify as a work of art.

We use some basic guidelines around patching we think make sense.  These are often applicable in high end data centers, with suitable modifications for smaller shops.

A word on “making cables.”  Small shops seem to love to “make cables.”  Small shops often don’t really have the expertise to field make and test a cable.  When the costs of making the cable are included (it is not “free”), the savings through reduced troubleshooting become clear.

There are numerous documents available qualifying as “prior art” on cable plants.  Here’s some high level guidelines we used on a recent immplementation:

General  

  • Plastic zip ties are not permitted for cable management for either copper or fiber patch cords. Hook and loop fasteners (aka, Velcro®) shall be used to dress bundles of cables and to provide strain relief.
  • All patch cords must be sized appropriately for the application, with only a small service loop at each end to facilitate tracing. Large loops of excess cable are not permitted.
  • All cables must be neatly routed and dressed.
  • Only patch cords from reputable, nationally known, industry recognized firms such as Belden, Siemon, Ortronics, TYCO/AMP, Corning, etc. are permitted. “No brand” generic cords are not permitted.

Copper Network Patch Cords

  • All copper patch cords must be “factory” manufactured, terminated, and tested in an appropriate facility. Field-terminated cords are not permitted.
  • All copper patch cords shall be ANSI/EIA Category 6A (Cat 6A) tested and certified.
  • All Cat 6A patch cords shall be F/UTP or STP construction.

Fiber Patch Cords

  • All fiber patch cords must be “factory” manufactured, terminated, and tested in an appropriate facility.
  • Field-terminated cords are not permitted.
  • Multimode fiber patch cords shall be OM3, laser optimized multi-mode fiber (LOMMF), supporting 10Gb Ethernet to 300m
  • Single mode fiber patch cords shall be OS1
  • Connector types shall be coordinated with the Client appropriate to the application.

Other Patch Cords

  • Patch cords for DS1 (T-1) circuits shall be Cat 6A.
  • Patch cords for DS0 circuits may be made on site to provide for specific pinnings or connector type. Cords should be tested with a pair tester.
  • Patch cords for DS3 circuits may be made on-site with the appropriate co-axial cable and connectors. Cords should be tested for continuity with an ohm meter or co-axial cable tester if available.

Data Center Cable Naming Standards

  • Two labels on each wire on each end (4 labels per wire)
  • First label is RX-X where X-X is the rack number and the run number
  • The second label is P.X.X where X.X is the port of the switch it is plugged into

Data Center Cable Color Standards

  • End User Connection - Blue
  • LAN Server Connections - Orange
  • Management (iLO and KVM) - Green
  • DMZ – Pink
  • Phones - White
  • Internal Uplinks: Yellow
  • External uplink: Grey or Black

What standards/practices do you find valuable?

 

Monday
May072012

An (Electronic) Alternative to Data Center Construction

In these posts, we’ve covered a variety of ways to extend the life of a data center, or to consider co-location.  All these topics assume the load or demand from the data center equipment is fixed, and that the supply (space, cooling or power) must accommodate the demand.

One client looked at this opportunity a different way and came out with an intriguing alternative.

This client has a data center in a downtown office building.  The data center was hemmed in on all sides, cooling improvements were proving costly, and the lease expiration on the building was on the horizon.

This client had already migrated to blade chassis and virtualized 75% of their environment.

They went the next step, addressing their storage, too. They implemented flash drives over traditional spinning disk.

This client dropped storage heat load by 50%!  Flash drives offer 50% lower heat output consume 55% less energy as compared to spindle based drives and the realized 9  times I/O bandwidth increase. (This helped them consolidate workloads onto fewer disks and drop rack space for storage by 52% as well.)

They were able to leverage this additional bandwidth to meet their business goals with many fewer disks, many of which individually also produced less heat.

They found the capital expenditures for this project was less than expanding the datacenter cooling capacity, given the horizon for retiring the datacenter. Furthermore, this project supported the consolidation of the storage environment, easing the eventual transition out to a new datacenter.

This client found flexibility and adaptability as the most critical design element for a datacenter. When you’re faced with constraints and want to look outside the box, take a different look at what’s in your (datacenter) box first!

What would you do if faced with a similar set of circumstances?

Sunday
Apr292012

Why I still like Service Indicator Lights

My business partner loves to tell people I can always find something with an error light illuminated in a data center.

You know the light….the one basically saying “I need attention.”  They vary in color…some may be red, some amber.  They are meant to draw attention to them.

Well, please don’t tell him….but I’m actually not omniscient.  The law of averages begins to come into play.  In a large data center holding thousands of devices, with a mean time between failures measured in years, something is always failing.

Sometimes the “I need attention” light comes on because a dual power corded device only has one power cord plugged in.  All the computer knows is one side of power isn’t working…so someone should know about it.  Sometimes the light comes on for more serious reasons….such as a fan failing or a processor check.

Some might argue the external service indicators are a thing of the past.  In an era of systems being able to report on a large variety of health measures, most servers are already communicating to a variety of monitoring systems so “looking at a light” may be considered old fashioned.  And while I wouldn’t argue that, I don’t use the service indicator lights that way.

I use them as a simple way to consider how an organization is servicing their environment.

For example, year ago I walked into the data center of a financial services firm and noticed an service indicator was “on” with a key processor.  I made a mental note.

A week later, the same light was still on.  I got the manager and asked him about it.  “Our people are aware and taking care of it.”

Another week goes by…and the same light is still on.  Admittedly, someone could have repaired the machine and another failure ensured.  Such was not the case.  This company simply wasn’t taking their systems servicing seriously.  This time, the manager did get the machine repaired before there was a serious issue.

This same data center also had a large number of dual power corded devices with indicator lights illuminated.  The company quickly discovered one side of the power distribution unit had a tripped breaker and they didn’t realize it.

So clearly systems monitoring and power monitoring was an issue.

Do I run my shop based on the lights?  No.  Automated tools properly configured will provide better instrumentation.

That said, these “dummy lights” painstakingly added to systems by thoughtful engineers provide a litmus test for me…a simple litmus test I can administer solely by walking around a data center without touching anything. 

So, the truth is I always do find lights on, as can you.  It’s how your organization deals with the lights that differentiates.

How do you react to service indicator lights?

Sunday
Apr222012

The Importance of a Quality RFP

We spend a great deal of time creating and facilitating RFPs. Clients request RFPs because it helps them to make decisions. We like doing RFPs because it brings precision and process to something emotional, the sales process.

I would like to tell you we don’t want to interfere with the vendor/customer relationship, but it would not be true. We teach our clients to strive for fact-based and unemotional procurement decisions. This is the opposite of how vendors (including Harvard Partners) sell customers. Vendors try to convince customers the vendor’s products are unique, special, and something the customer MUST have. Deep down inside, we all want to own the “shiny object” and be better than our peers. Unfortunately, that is not always the best answer.

The RFP process brings parity to procurement decisions and a good RFP process allows the solution with the best value to rise to the top. A good RFP should also be quick and simple for the vendor to complete.

What makes a quality RFP?

  • Statement of Purpose – It is important to clearly state why the RFP is being issued. The reader of the RFP (vendor) is looking to uncover the problem being solved by the RFP. This is what good salespeople do. Hide the true purpose of the RFP and your response will probably not meet the mark.
  • Detail – Details matter. The more detail you put in your RFP, the more accurate (I did not say detailed) your response will be. Details will also save the vendor time in creating their response. By removing ambiguity you remove wasted time.
  • Setting Expectations – Let the vendor understand the process and when they will receive a response. Appreciate that they have put time and effort into responding and deserve to know what is going on. Allow them to contact you for status.
  • Response Template – Whenever possible, give the vendor a template for the response. It will save you time when compiling responses and it will save the vendor time in creating the response.
  • Non-Technical Response – We are typically asked to create RFPs for technology procurement by people who are not that technical. They are decision makers who rely on their vendors and others to be technical. Pack an RFP response with too many technical acronyms and too many speeds and feeds and you lose them. By creating a detailed RFP you cause the response to be mostly yes, no, and pricing. This is something a non-technical person can understand
  • Allow for Errors – When we evaluate RFP responses we work to uncover a bad response and give the bidder the opportunity to correct their mistake. Most of the time this was due to a bad assumption on the part of the bidder. In the spirit of getting the most accurate information for our client, we feel bidders should be given the opportunity to correct mistakes.

While we strive for high-quality RFPs we also recognize the need for the vendor and customer to have a strong, positive relationship. RFPs do nothing to make that happen. We recommend prospective customers visit with each vendor, prior to the RFP process, and allow the vendor to ask questions and sell the customer. This gives the prospective vendor an opportunity to get to know the sales people and delivery team. After all, in the end, “people buy from people.”

The RFP is a great tool to aid in the quality and timeliness of the decision-making and procurement processes. Like any project, your procurement process needs a plan. Think of the RFP as both a requirements document and project plan rolled into one. With a clear understanding of the problem and expectations, vendors can focus on proposing a better and more cost effective solution.

Sunday
Apr082012

The Case for Bring Your Own Device (BYOD)

When you hire a carpenter, do you have to buy the saw? The answer is obviously no, the carpenter brings the tools of their trade. Why isn't it the same in IT?

As a "migrant white collar worker" (aka consultant), we get to see what many organizations do for their desktop support. Many give us a laptop to do work, on the presumption we won't take their data and our machine will, by default, adhere to their security policies.

(This is very misguided. Inevitably we are asked by the client to help get a large (legitimate) file on to one of their own machines (like a set of data center plans). Bottom line is we can get the file there…and we're not breaking any rules. And, I digress.)

The time for Bring Your Own Device is here.

Yes, there are organizations that will do virtual desktops, or a Desktone solution. For some organizations, these will be fine transitional approaches to what we believe is an environment truly open to virtually any device.

To make this happen, security organizations will have to be enlightened on what is really being secured. The truth is, it's the data. If customer data is on a laptop, that laptop is a risk. Why is customer data on a laptop? Organizations must protect data.

Applications will need to be written to accept different devices. Years ago, I was a Mac man. I gave up and went to the Microsoft world as it was (in the late 80s) hard to work in a mixed environment. Today, that's hardly a consideration. Macs and PCs interact every day.

Devices will present a challenge. Frankly, I don't care what device you use. If you like an underpowered machine (hand saw) and I like more power (power saw), so be it. That said, I happen to like my BlackBerry for email/calendar/contact integration…yet it is a dying machine in favor of glitzier, arguably less secure tools. Far be it from me to try and resurrect RIM, but their email integration is fabulous (I'm not a fan of sending everything to them first, but again I digress.)

What's the key in your company? Are you considering BYOD, or is it a distant future?


 


 

Saturday
Mar312012

Another Open Letter to Co-Lo Data Center Operators

On February 5th Gary Kelley, posted an article titled “An Open Letter to Co-Lo Data Center Operators.” In the article he offered some do’s and don’ts during the RFP process. I would like to share some observations around what prospects are look for when evaluating a data center co-location provider.

As Gary mentioned, Harvard Partners performs a lot of co-lo RFPs, co-lo contract negotiations on behalf of clients, and migrations to co-location sites. We coach and mentor clients from the point they think they might want to migrate their data center to watching them become fully operational in a co-location site. During this process we see the reaction clients have to the sales pitch, value proposition, and site visit of many co-lo operators.

Here are some observations:

  • You are not special – we know every co-lo operator is special and you are all better than your competition. After two site visits (and two data center tours) our clients turn to us and say “it feels like we have done this before.” As this is what you are selling, you must do the data center tour, but please recognize it can be mind-numbing to see one data center after another. Keep the tour brief and focus on what will interest the prospect.
  • Listen to the customer – the customer has selected to visit because they are trying to solve a specific problem. Maybe it’s a construction project impacting their data center or frequent outages. Maybe they want to instill more discipline in their data center operations. In any case, it is important for you to understand who you are talking to and why they are seeking co-location. Don’t compare the prospective customer to organizations they can’t relate to. Using an example of a very large company when talking to a medium-sized prospect can become insulting as it makes the prospect feel much less important.
  • Don’t oversell - most customers are coming from a server room or a Tier 1 data center. Maybe they have a generator and maybe they have some extra cooling. The likelihood is their “data center” is a converted room in a commercial office building. You are about to show them something so far beyond their imagination they tend to get overwhelmed and ask “do I really need all this.”
  • Details matter – customers come to you because they know you will pay attention to the details required to operate a state-of-the art data center. Don’t create doubt with such incidents as dirty floors, too many construction workers around production equipment, ceiling tiles with water stains, visible rodent traps, and propane grills anywhere near your generators. When the prospective customer comments “I wouldn’t let that happen in my data center” you know you’ve lost the deal.
  • Make it real – whatever you sell make sure the prospective customer can translate it into dollars. Every co-lo operator talks about managed services and cloud in one slide and then goes back to talking about generators, cross-connect rooms, and cooling towers. If you are going to talk about extra services then make them real through a demonstration and consider bundling services into the co-lo contract. Every one of our clients has told us co-location is the first step to the cloud. We watched two CIOs “wake up” as a co-lo operator created a fully operational Windows Server instance in their cloud within 5 minutes. The same co-lo operator offered free “cloud migration services” which got a response of “so what” from the CIOs. When the offer changed to a free 4 CPU, 12 GB RAM, 100GB storage instance for a year, the CIOs immediately decided to go with that co-lo operator. The CIOs were able to dollarize the offer and place a value on it.
  • Become a partner – customers are looking for you to demonstrate how you will make them better. Providing a resilient and robust data center is an important selling point, but so is your knowledge of process, managed services, technical architectures, cloud, and DR. We watched as technical teams between a co-lo operator and a customer started to have a “mind meld” over a network architecture. The CIO observed the interaction between the two teams and was convinced he had found a partner and made a decision to go with that co-lo operator.
  • Be personable – Gary always reminds me “people buy from people.” In the co-lo space it is always true. You might think they are buying the generators, chillers, and multiple points of access into your facility, but what they are buying is you. The CIO is looking for the person they will trust with their career. Every CIO I meet when working on co-lo engagements tells me the co-lo decision is a make-or-break decision for their career. Very simply, if they can’t provide services to their users, then they had better start updating their resume.
  • Price is important – once you accept you are not special then you realize co-location services are a commodity (sorry). Responding to an RFP with a low price gets the attention of CIOs. Most of the time, before we get called in by a CIO, they have already visited one or two co-location sites. They like what they see and are very excited, but realize they should do an RFP. When one of those co-lo operators comes back with something resembling “list price,” the CIOs go visit other co-lo operators and quickly realize the services they require are commodity.
  • Bundle – when we do an RFP we always ask for a certain number of remote hands hours to cover things like tape rotation. You can differentiate yourself by offering these types of services as part of the base cost when the prospective customers visit you. It is something they remember.

We believe the co-location market is changing. Just two years ago we were mostly concerned about customer expansion over the course of a 5-year contract. Today, we must think about contraction of space as customers think of going to managed services and the cloud. Your relationship with the customer and helping them grow are the key selling points as our industry moves forward.

Sunday
Mar252012

When to Update Production and DR

Some companies run a "production only" environment. Think a restaurant, where they buy packaged software and can use paper as a backup system. The chances are excellent they buy packaged software, and are looking to the software provider to have proper systems management.

Other companies can take environments to another extreme…having multiple development, certification, integration, performance, staging production and disaster recovery environments, steadfastly promoting code through each environment. Many organizations take "release management," to levels comparable to large software firms.

Of course, the age old debate of few formal releases vs quick regular releases is always in play.

This post tackles a simple question. Assuming you have both a production and a disaster recovery environment, what is the upgrade order?


Obviously the circumstances in your environment will drive what you do. One might argue in a truly active:active environment, there is no such concept as disaster recovery. For purposes of this post, production and DR are separate, failover is possible, and failover is tested regularly.

Some people argue the natural upgrade approach is to upgrade disaster recovery first, then complete an upgrade process by touching production. In this approach, every stage of the promotion process is views as tests preserving production for the "final" change.

We posit Disaster Recovery should be upgraded after ensuring Production is stable.

It's not that we don't think Production is important. To the contrary, we revere the Production environment.

By upgrading DR after Production, you assure the business can failover if the upgrade proves untenable in production. There is always a known working copy available.

IT professionals following this approach have to determine when Production is stable. Is it an hour? A day? A cycle?

We suggest it is after a day's stable processing. A day is arbitrary; as a practical matter once changes are in place there is a point of no return where any fixes will be made in the new environment and not after failing back.

IT professionals must remember a promotion cycle is not complete until DR is upgraded. When organizations neglect to upgrade disaster recovery, they lose their failover ability.

Is this a once size fits all recommendation? No, you need to look at your environment and the changes underway. Database Schema changes, and core functionality changes may preclude a phased approach. Since we fundamentally don't subscribe to big bang, we suggest always trying to maintain a failover ability.

How does your organization deal with upgrades/migrations to minimize risk?

Sunday
Mar182012

It’s Better in the Dark

In the mid 80's I had the opportunity to work with Arnold Farber and Rosemary LaChance of a company known as Farber/LaChance. They preached the gospel of lights out operations, without any staffing. Fast forward nearly 30 years, and www.farberlachance.com doesn't have an owner (as of March 28, 2012.)


The truth is I don't know what happened to Arnold and Rosemary, and hope they are blissfully retired somewhere. As I look at data centers, who is really doing lights out operations?


It's my observation small and medium business' are often running lights out….with someone getting a text message if something fails. These are not sophisticated operations; if power is lost the text system often fails. These are operations where automation supports the business, and isn't mission critical.


In larger organizations, staffed operations centers remain the norm. Why?


Back in the mid-80s, we were worried about running batch jobs at the right time, or making sure the printouts got produced. Scheduling, Tape Management and Report Distribution were pretty hot back in the day. Heck, some people were even getting press coverage.




Thinking about today, large data centers have made great process with reliability, monitoring, correlation, scheduling, backup management, replication, etc. Many keep all but a handful of people out of the data center (system admins get upset when they can't be in the machine room…and the truth is they don't need to be there.) And still, people are often in place. Why?


Ultimately people make decisions. People are pretty good correlation engines.


It's people who open the data center doors for repair people. People who run the incident bridges.


And in the end, for a company with mission critical systems, having a small cadre of people is very inexpensive insurance.


So it's about the criticality of the systems to the business.


I thank Arnold and Rosemary for their contributions in pushing forth with the vision. Their vision is still valid, as we couldn't run the highly complex environments of today without automation. As a discipline, we still have progress we can make.


What automation efforts are you doing leading to lights out? How close is your business to true lights out operation?



 



 


Saturday
Mar102012

When to Deliver Bad News

Timing around delivering "bad news" can be tricky. Deliver too quickly, and you'll be accused of panicking. Wait too long and it will be suggested you hid a problem.




It's our experience giving people bad news sooner rather than later is the key. If you are working on a major program, you might list "bad news" on a Risk/Issues register…where a risk is something that might happen, and an issue is something that has happened.


We always try to work an issue before raising a red flag. After all, it's our job to work items and make them non-issues. Sometimes there just is no way out.


When you identify an issue, always try to suggest mitigations…so the issue just doesn't lay on the table without hope.


A few other factors always come in to play:




  • Nobody likes surprises – the last thing an executive wants is to be surprised by an issue. Most executives are accomplished at hearing bad news when they know bad news is coming. It can be as simple as saying, "We've got some bad news to cover," followed by an explanation. Often the executive has ways to mitigate an issue…and can when given the opportunity.


    One time a subordinate sent out an email with a CFO's name on it. The CFO had not seen or approved the email. When this was uncovered, we went to the CFO, explained the situation, let him get a little red-faced, and then knew we would live for another day when he said, "All we can do now is backfill, so how do we do that?"




  • Share bad news privately – inexperienced project managers will clearly identify an issue on a report or status and leave it at that. This presumes the busy executive has time to read the report! Make a point of sharing bad news privately


    Execs are people, too. Sometimes they just need the privacy to drop an f-bomb and then get their game face on.




  • Always get to the executive first – bad news travels like wildfire. Executives want to help, and feel betrayed if they hear about issues from a peer or in a status meeting.


    When an executive feels they are the last to hear bad news, trust erodes quickly. Executives may not like bad news, but they appreciate knowing about it first.




Simple? In a blog, it is simple. In the real world it is much more complicated and gets into the personalities involved:



  • Is the executive available (or traveling?)


  • Bad news on Fridays can be tough.


  • When is the "big meeting" (whether related or not)



Our rule of thumb is the bigger the risk, the quicker we must act.