Building Service Monitoring as a Service with an Eye on the Cloud: Service Monitoring as a Strategic Opportunity

Blog
IT Services
Service management

Author Carroll Moon, Senior Architect for Cloud Service Management, Microsoft Services

September 30, 2014 |

5 min read

Blog
IT Services
Service management

One of the most important aspects of delivering a world-class service, regardless of whether the word "cloud" is associated with the service or not, is Service Monitoring.

Monitoring should not be considered a commodity, but rather a strategic opportunity for Service Delivery. Why? It is because the worst thing that can happen with Service Delivery is not having an incident - incidents happen; the worst thing that can happen with Service Delivery is having a customer or user notify you about a service incident.

Think back to the last time the electricity failed at your home. You probably phoned the utility company and were delighted to learn that the utility company was fully aware of the incident and that they were already working on service restoration. A recording probably gave you an estimated resolution time too. Despite the incident itself, you were not dissatisfied because you recognize that your electrical service is imperfect and incidents will happen.

But if, when reporting the incident and its impact on you and your street, you encountered a surprised support technician who knew nothing about the outage, you might feel differently. You might even feel like the hospital patient who needs to notify the nurse that the bedside heart rate monitor is no longer working!

From a user and customer perspective, there are two outcomes with Service Incidents:

A. The service provider is aware of the incident (and ideally has proactively informed the affected users),
B. The service provider is not aware of the incident and the user (or customer) needed to report the impact.

"B" is the norm and "A" is so rare that it's actually a delighter and differentiator for the customer. The first step to proactively communicating to users and customers is to actually have knowledge of the incident; you need to know there is an incident before you can communicate an incident. Therefore, if monitoring fails, the service provider does not even have a chance to delight the customers.

Taking only a "Monitoring-Platform" Approach will not Provide the Desired Results

Most enterprise IT organizations take a breadth or horizontal monitoring approach. There is a centralized monitoring team responsible for the console(s) in the Operations Center. That team is usually most concerned about integrating with network, datacenter, operations, and facilities monitoring. Those teams usually have also made provisions for application/service teams to post critical events to the console. If the monitoring team is mature, they may be monitoring hardware and operating systems for critical alerts (like "bad power supply" or "disk full") across all servers in the datacenter. But that is about as far as a centrally-delivered-horizontal-monitoring solution goes, and that approach does not solve the strategic need to be great at monitoring the actual service.

Most application/service teams realize the gap in the centralized-only-approach and drive investments in depth or vertical monitoring. Those vertical investments usually come in two flavors:

The app/service team gets into the monitoring business. These teams tend to buy and deploy tools and hope that the monitoring packs for the application or service will cover the need, but the results never seem to meet expectations. The downside of this approach is that the app team duplicates effort with the central monitoring team and they spend most of their time running a monitoring platform rather than building the app-specific monitoring rules that were the objective in the first place,

The centralized monitoring team tries to solve the application/service need, but they do not understand the app/service enough to build the rules that are needed and therefore they struggle to add any incremental value.

Neither option accomplishes the goal of capitalizing on monitoring as a strategic opportunity for world-class service delivery; neither approach will deliver positive results, and the service delivery team will miss discovery of incidents while the users and customers will have to call to report impact. Users and customers will be very dissatisfied as a result.

The Key is to Take a "Monitoring Platform" Approach and a "Service-Centric" Approach to Service Monitoring

What is the answer? The best option is to turn the "or" into an "and" - make the breadth and depth teams accountable for different aspects of end-to-end monitoring. The central monitoring platform team should deliver a standardized monitoring platform that enables enterprise console, paging/texting, emails, automated callouts, etc. and the app/service team should be made accountable for the following objectives:

I. Count: incidents not caught by monitoring (i.e., calls from users or customers to report incidents)
II. Count: alert volume
III. Count: incident volume
IV. Boolean: integrating with the standardized monitoring platform?

Assigning and measuring accountability to the app/service team for the balanced objectives above will ensure the application team builds the right monitoring rules and that the application team integrates with the existing investments for the monitoring platform owned by the centralized monitoring team. Integrating with the standardized platform has the added benefit of allowing all monitoring data to be sourced from one place which enables more insightful reporting for Proactive Problem Management.

Taking the two-pronged approach (monitoring platform team delivers a world class monitoring platform and the app/service team is accountable for the outcomes of monitoring for the service) maximizes the return on investment in the centralized monitoring platform by raising its importance for service delivery. Most importantly, the service is guaranteed to be well monitored because the app/service team is accountable for the results versus ticking a box for commodity "monitoring".