The AnarchAngel : IT

Showing posts with label IT. Show all posts

Friday, April 20, 2012

Stress Test for IT Ops Shops

Cringely wrote something interesting today, ostensibly about how to fix IBM (a cause I believe is currently lost. their management is too far down the hole to see daylight anymore); but particularly, what I think is actually a pretty great stress test for any IT ops shop.

For those of you unfamiliar with the IT outsourcing business, many companies, both large and small; contract out some or all of the operations, design, implementation, maintenance, service, and support, of their IT operations to professional service providers.

These services can be on a long term or short term basis; and in scope can extended anywhere from staff supplementation (a few extra bodies), to staff replacement, all the way to complete "blackbox" outsourcing; where your company literally has no IT staff OR infrastructure whatsoever (sometimes even no desktop PCs), and everything is handled by an outside company.

This has of course been going on since the 1950s and 1960s with mainframes, and 70s and 80s with minicomputers; but over the past 15 years or so, has dramatically accelerated, to the point that many organizations even outsource all of their IT desktop and server operations, support, application management, administration... Some even outsource their IT management, policy setting... really everything related to information technology; except perhaps the senior management (CIO, CTO etc...).

Very frequently, this means that no single person actually employed by an organization, has any IT knowledge, expertise, or access to the IT resources of that organization; or, if they do, they don't have the time necessary to properly oversee these functions (because, after all, it's about cost savings; which means man hour savings more than anything else).

What results from this, is in theory a cost savings (and often in practice); but when no-one employed by your company has the skill or understanding to properly evaluate the service they are getting from these contractors... well, things can get out of control very quickly.

The U.S. Navy, U.S. Air Force, DOD, the State of California, and New York City (among thousands of other organizations around the world); have all found this out the hard way, to the tune of billions of dollars in cost overruns, and project failures.

As it happens, I have worked in this business much of my career; mostly in the design, architecture, and implementation functions, but some in the operations and support functions. In many of the roles I've held along the way, I've dealt with a lot of these contracts, in all their many variants; and the many companies that provide them (including IBM, and my new employer; both quite extensively).

In fact, the new job I start Monday is an operations role (managing information security operations), on a "staff" contract (which is one of the variants of these types of contracts, where instead of completely outsourcing the function, the company brings in outsourced contractors, who act within the structure of the parent company as if they were employees... thus "staff"... even though most of the actual IT roles are in fact filled by contractors).

I don't disagree with anything Cringely has written in his post; or for that matter, the rest of the post series on IBM (which, if you have any interest in IBM or in IT, you should read, if you haven't already. It's big news, and I think it's correct).

The test he proposes, SHOULD be one that any competent and well run IT organization SHOULD be able to pass; whether outsourced, or organic to a company.

Frankly, I know a lot of shops that wouldn't pass this test... In fact I don't know many shops that could pass every part of this test in a reasonable amount of time... and I'm willing to bet a lot of folks are going to be seeing this come from their clients or their management in the next few days or weeks.

So, the test that Cringely proposes:

Ask your IT outsourcing provider to produce the following:

1) A list of all your servers under their support. That list should include:

Make

Model

Serial Number

Purchase Date

Original and current asset value

Processor type and speed

Memory

Disk Storage

Hostname

IP address(s)

Operating system(s)

Software product(s)

Business Application(s)

Is this list complete? How long did it take your provider to produce the list? Did they have all this information readily accessible and in one place?

2) A report on the backup for your servers for the last 2 weeks.

Are all servers being backed up?

Are all the backups running in the planned time window? Is there ample time left over, or is the operation using every minute of the backup window?

When backup runs on a server there are always files that are open or locked and the backup cannot copy them. Every day the backup team needs to look at their reports and make sure that files that were missed are backed up. In your examination of the backup reports you should see evidence of this being done.

If you spot any potential problems with a server ask for a list of all the files on the server. The list should show the filenames, date’s, and if the archive (backup) bit has been flipped

Is this list complete? How long did it take your provider to produce the report? How often does your provider conduct a data recovery test? If a file is accidentally deleted, how long does it take your provide to recover it? Can your provider perform a “bare metal” restoration? (bare metal is the recovery of everything, the operating system included onto a blank system)

3) A report on the antivirus software on your Windows servers.

Is antivirus software running on all your Windows servers?

Is it the same (standard) version?

Are the virus signature file(s) current?

Ask for case information on any recent virus infections

Is this list complete? How long did it take your provider to produce the report? When a virus is detected on a server, how is the alert communicated to your IT provider? How fast do they log the event and act on it?

4) A report on your network. It should include:

Illustrations of the major network equipment including routers, switches, firewalls, etc.

IP address allocations.

Internal DNS entries.

Current routing and firewall rules.

Is this information complete and current? How long did it take your provider to produce this information? Is this information stored in a readily accessible place so that anyone from your IT provider can use it to diagnose problems?

5) Information on your Disaster Recovery plans Here is what you want to know:

Documentation on a recent DR test, the plan and results. It should show the actual times tasks were started and completed. Problems should be logged. (it is okay for there to be some problems, that is the purpose of the test)

Ask for a list of names from the IT provider of the people who worked on the test.

How many people who worked on the test live full time in the same country as your DR facility?

Did your IT provider fly in an army of offshore support folks for the test?

If there was a real disaster how long would it take your IT provider to assemble a team to support your emergency?

Ask for a list of your critical applications to be provided and supported in a disaster.

Is the list complete and correct? Is there sufficiently detailed information on each critical application? How much data is involved? Is the data actively sync’d over a network? How often is the sync’ing process checked? What hostnames and filesystems need to be restored? What application skills are needed to start up the applications?

6) Help desk information. Here is what you want to know:

Ask for a report of all the help desk tickets for the last 2 weeks.

Independently ask your company (not your IT provider) for information on known IT problems over the last two weeks.

Compare the information from the helpdesk and your company sources.

Pick a few random incidents from the help desk ticket report. How long did it take to discover the problem? How long did it take your IT provider to begin to work on the problem? How long did it take your IT provider to fix the problem? Was the problem really fixed?

Is there an active problem prevention program? Is your IT provider examining the reported IT problems and finding ways to reduce the number and frequency of problems?

How long did it take your provider to produce this report? Did they have all the help desk ticket information readily accessible to everyone and in one place?

7) Look for evidence of continuous improvement.

Repeat this process once a month.

Look for changes and improvements month to month and over several months.

Are the total number of problems being reduced?

Is the response time to fix problems being improved?

Is there clear evidence your IT provide has an active and effective continuous improvement program.

A good IT provider will have the tools to automatically collect this data and will have reports like these readily available. It should be very easy and quick for a good IT provider to produce this information.

A key thing to observe is how much time and effort does it take your IT provider to produce this information. If they can’t produce it quickly, then they don’t have it. If they don’t have it they can’t be using it to support you. This then will lead you to the most important question: are they doing the work you are paying them for?

If you're in IT, I'd be willing to bet your own shop can't pass this test in all aspects; at least not within 30 minutes or an hour, or even the same business day. In my rather broad experience across hundreds of clients, if you can even get most of it within a business day, you're doing pretty well.

That's not as it should be; but it's often how it is. Many shops, if not most, just don't have all the elements they need to maintain this level of operational fitness.

Something that Cringely didn't explicitly write here, but which should be addressed (and is implied in the test); is that passing this sort of test is really dependent on four elements:

Proper tools: Your team needs to have the right tools, access to them, and have them properly configured; so that they can do all of these tasks efficiently, effectively, and consistently.
Proper process: Your IT processes need to account for all aspects of your operations. They need to be easy to understand, well documented, readily accessible and readable by anyone who needs them, consistent but flexible, goal oriented and mission focused, PROPERLY TESTED; and your staff must be properly trained on them.
Adequate staff levels: You need to have enough people to cover all the work required for your IT needs. To an extent, good tools and good process can reduce your staffing requirements (and in some ways, the skill and training requirements for that staff), but you can only cut so far. You MUST have adequate coverage, and that coverage must have sufficient skill, knowledge, and training; to meet your needs. Further, you must understand that your staff are human beings, with lives and pursuits outside of work. They have vacations, and family emergencies, and they get sick. They have different skillsets, and different skill levels. Some are more or less efficient or effective than others at some tasks . Treating your staff as fungible man hours is a sure and certain recipe for failure.
Good IT management: Without good management, none of these things will happen. If handed them on a silver platter, without good management, they will stop working. Management must keep all these factors in mind at all times, and maintain MISSION FOCUS above all else. You are not here to meet a metric, you are here to get a job done; a job that enables other peoples jobs. Meeting your metric isn't doing your job; making sure others can do their jobs is.

Of course, that's just my opinion, I could be wrong.

Wednesday, April 07, 2010

Happy Birthday to Information Technology

Today is one of the most important single dates in information technology. It could reasonably be called ITs birthday.

On April 7th 1927, the first long distance television broadcast was made... Information technology isn't just about computers after all...

On April 7th 1964, the modern mainframe, the IBM System 360, was officially announced. There were of course other large computers before the System 360, but the 360 represented the first truly modern mainframe... and more importantly, the modern generation of mainframe system software, and applications.

So much so, that many core mainframe applications running today, are at least in part, binary compatible back to the programs of the S/360 in 1964.

The system 360 pioneered so many things we consider standard today, it can be fairly said that all modern computing is in some way descended from it.

Five years later, April 7th 1969, RFC-1 was published, setting the first basic standards of what would become the internet.

So, happy birthday to the industry that pays my bills.

Monday, September 29, 2008

Virtual Boy - Part 1

Definition of Virtualization in Enterprise Computing

At this time there exists no coherent, industry wide, definition of virtualization; or accepted method of tracking what environments, application, and infrastructure are virtualized.

As there are many technologies, solutions, and even fundamental approaches available for virtualization (some of which will be review in later documents), one cannot simply identify a product, and define an application using it as being “virtualized”. In this instance a higher level definition is necessary.

To this end and for purposes of this document (and future documents in this series), enterprise virtualization shall be defined as follows:

Virtualization, is the provisioning of computing resources for applications or services, in a manner abstracted from the hardware that will provide those computing resources.

Typically, multiple virtualized applications may be provisioned on a single hardware instance (server, appliance, mainframe etc..); or cluster or plex of multiple hardware instances, behaving as a single instance.

Conversely, the computing resources for virtualized applications may be spread across multiple aggregated hardware instances or a cluster or plex of hardware instances.

Either class of provisioning would be considered virtualized.

Computing resources for an application may be mapped to dedicated hardware; however through the abstraction of virtualization, ideally, this hardware should be able to be changed without requiring the rebuild or reconfiguration of the virtualized instance (though some downtime may occur).

THE CHALLENGES OF THE “TRADITIONAL”
APPLICATION PROVISIONING MODEL

The traditional model of application provisioning involves defining a maximum resource requirement and projected growth requirements (for three to four years) for an application; and provisioning a dedicated server or servers which will provide up to 100% more computing resources than that projected requirement (depending on the application tier); each server requiring it’s own power, cooling, floor space, storage, and supporting infrastructure.

Importantly, in this model, the end user is paying full price for their projected needs, and must pay for a substantial portion (if not all) of the infrastructure costs up front. If the users business needs change, they will either need to go back and re-architect and re-implement the solution; or they will be left with excess capacity that they are paying for, but not utilizing.

Additionally, in the tradtional provisioning model, the enterprise must provision the full computing capacity, floorspace, power and cooling, and storage requirements for all projected growth of the application, without regard to actual utilization.

As a result, across the entire enterprise IT world, we have average CPU utilizations on the order of 6% to 8%, average memory utilization of 18% to 40%, and allocated storage utilization of 14% to 40% (these numbers represent broad ranges, because there are multiple conflicting data sources, depending on which organization, infrastructure, and measuring method are involved).

Compare these estimates, to “best practice” utilization goals of 40% to 60% average CPU and memory utilization, and 60% to 80% allocated storage utilization.

This is harmful both to the end user, and to the enterprise as a whole, because this underutilized infrastructure presents a large fixed cost, as well as a significant allocation of limited facilities resources.

Taken over an entire large enterprise, this inefficiency in resource allocation, implementation, and utilization can add up to hundreds of millions of dollars in wasted time, and wasted capacity. Across the entire world of enterprise IT, the total could be hundreds of billions.

Additionally, the significant upfront costs, and three to four year commitment of resources required to implement any solution; create an environment hostile to the development of new and innovative solutions. It is extremely difficult to create, develop, and test new technologies (that may, or may not, present viable business solutions when fully developed); if there are significant resource dedication requirements to even basic experimentation on a small scale.

Finally, the provisioning of physical infrastructure requires a significant expenditure of time and effort across many groups within an enterprise. Workflow analysis across large enterprise IT, shows that up to 160 individuals may be directly involved with the provisioning of a single hardware instance; and that provisioning may take up to 12 weeks from the inception of a project.

THE VIRTUAL COMPUTING MODEL

Virtual computing aims to address the issues raised above, by provisioning computing resources for applications in a manner independent of the hardware the application will be provisioned on.

Virtualizing an application allows an end user to specify only the resources they need, and allows the enterprise to allocate only what is required, from pools of available computing resources. This can be managed in a centralized way, to ensure adequate capacity, performance, availability, and quality of service are maintained; while improving average utilization of individual hardware instances from 4% up to as much as 60% (although average utilization can be increased over 60%, this is against best practices of capacity planning).

Additionally, if an applications resource needs to shrink or grow, virtualization allows the end user the flexibility to request additional computing resources, or reduce their computing resources (and thus reduce their cost); without rebuilding and re-provisioning the application.

Critically, this capacity can be provisioned at little incremental cost, and with minimal effort and involvement of personnel; far more rapidly (a matter of hours or days) than physical infrastructure can be provisioned.

Presuming efficiency, and effective management are maintained in the virtual environment, these efficiences of process and materiel can reverse the potentially hundreds of billions of dollars of waste in the traditional application provisioning model.

In future posts in this series, I will discuss virtualization technologies and methodologies.