A Beginner’s Guide to Incident Management

Today’s technology-driven businesses require a methodology to bounce back from IT system issues. Discover how incident management plays a key role.

Updated June 26, 2020

It’s inevitable. Every business eventually encounters technology issues affecting the organization, or worse, its customers. Who do you call in this situation?

That’s when your company’s Information Technology (IT) team springs into action. Even then, they require the right processes to effectively address the issue and get systems back to normal. That's the role of incident management.

Overview: What is incident management?

An "incident" is the IT industry’s term for an unplanned disruption or a degradation in IT systems performance. Incident management addresses these events to restore the affected systems to a normal state.

The incidents vary in severity. Sometimes, IT systems experience slowness. In other cases, systems suffer a complete outage. Incident management handles these events differently.

Various IT frameworks, such as the ITIL processes (Information Technology Infrastructure Library), outline the steps for incident management.

Whether you implement an established methodology, for example, ITIL v3, or you create your own, you need to outline the process for incident management execution and all team members involved in that process must understand and support it.

For example, software developers may not be the ones to field problems, but if the issue resides in the code they wrote, they must stop what they’re doing to address it.

Benefits of incident management

Every organization using technology requires incident management protocols. The benefits are significant while the impact of not having them is costly.

1. Preserves revenue

If your business relies on technology, revenue preservation from proper incident management is tremendous. The average cost to a business from an hour of system downtime is estimated at over $300,000 according to Gartner Research.

Imagine a company that relies on a website for sales, such as Amazon.com. If their website goes down for several hours, the lost revenue could be astronomical.

2. Increases customer satisfaction

Ever been on the phone with a business and the representative on the other side asked you to wait because their system was slow? Have you downloaded an app that caused your computer or smartphone to slow down or drain its battery quickly?

These are examples of how technology problems can turn away customers. If your internal business systems experience frequent incidents, you can’t efficiently service your clients. If your product is technology-based but runs into technical issues, customers will stop using the product.

Incident management not only addresses these situations as they arise, incident processes ensure the problem doesn’t come back, improving the customer experience.

3. Improves business efficiency

If your company’s staff rely on IT systems for their jobs and those systems suffer issues, their ability to work declines. Having a robust incident management process keeps employees working and productive.

And it’s not just employee productivity that improves. Incident management strives to learn how to prevent the problem from recurring. When system issues are minimized or prevented, the entire business improves its efficiency.

Workers aren’t impacted and IT teams can focus on tasks that add value to the organization instead of fighting fires.

5 steps in the incident management process

Addressing a technical problem involves steps that comprise the incident response life cycle.

Step 1: Detect

The first step of the incident management process involves detecting the issue. If customers or other system users report a problem, that’s one means of detection, but it’s the worst.

The ideal approach is for the IT team to set up automated monitoring systems that constantly analyze critical IT infrastructure and software, proactively looking for problems. One means of doing so is to establish benchmarks for system performance.

The monitoring solution then regularly checks to ensure those benchmarks are met, and if not, an alert sets off notifications to the IT team so further investigation can occur. The goal is for customers or users to never know an issue cropped up.

These alerts must be routed to the appropriate IT staff member. If the issue resides in the software, the people who wrote the code need to know about the event. If it appears to be a hardware issue, alerts go to the team members responsible for that part of the IT system.

As incidents occur, log the details. Include the date and time, a description of the affected systems and nature of the problem, and a category assignment that allows tracking of similar issues to identify trends.

Many incidents go first to help desk staff, also called service desk, particularly if a user is reporting the problem. The help desk represents the frontline IT team members who communicate with users about IT requests and issues. Typically, this team uses specialized IT help desk software to manage incidents and user requests through IT tickets.

Step 2: Prioritize

Incidents require prioritization. A business won’t have enough personnel to respond to every incident equally, and some are so minor that a response isn’t warranted.

Define various prioritization levels based on impact to your business and customers. For example, if your software creates problems for a single user, maybe the user’s computer is outdated or another root cause specific to that individual. That issue would be lower in priority to a system outage that affected multiple users.

Coupled with prioritization levels, determine which team members need to get involved at each level. Some situations require all hands on deck while others can be resolved by service desk personnel provided with the appropriate technical training.

Step 3: Respond

After completing an initial assessment, respond appropriately. A response means you’re looking into the problem, and the appropriate incident communication occurs.

Responses must be immediate, even if it’s just to inform users of a problem and that it’s being worked on. Responses range from looping in team members who can address the situation, such as software developers, to investigating the issue to determine the root cause.

If the incident prioritization level is high, responses may involve escalation to other teams or supervisors. This can include waking up team members in the middle of the night if critical systems are down.

Step 4: Recover

This step is all about resolving issues. To recover from a system issue, you must know what’s causing the problem and who possesses the knowledge to fix it. The goal is to get the system back to a normal state of function quickly.

Continue communicating status to all external or internal stakeholders throughout the recovery process to keep people informed. This raises stakeholder confidence in the IT team. Once the affected systems are restored, immediately inform all affected users.

Sometimes, the recovery process involves multiple steps. A quick fix may be required in the short term to return affected systems to a usable state while more holistic, longer-term fixes are worked on to ensure the issue doesn’t recur.

Step 5: Learn and improve

Every incident creates a learning opportunity. The IT team can identify ways to improve by reflecting on the incident, how to prevent it, and how to further streamline the incident management life cycle.

Look at data to identify trends that point to a deeper problem management scenario rather than an isolated incident.

Unless the incident was minor, perform an incident postmortem. The postmortem, like the autopsy of a dead body to assess the cause of death, is a formal process for the IT team to dig into why the incident occurred, how to learn from it, and to build an action plan to address outstanding concerns.

The postmortem is a blameless process focused on how the team can better serve your customers. The objective is to create a continuous process of improvement so that the same incident never occurs twice.

A last word on incident management

A big piece of incident management success is data. With incidents, data of all types come in handy. You need data to track trends and report on the number and types of incidents you’re experiencing. Data identifies the appropriate benchmarks for incident alerting by your monitoring system.

Data helps your IT team gain insights for improvement, such as how to shorten recovery time.

With a combination of data, an incident management process, and the people and tools to support it, your organization can deliver incident management that resolves problems before your customers are aware. That’s the goal of incident management.

The Motley Fool has a Disclosure Policy. The Author and/or The Motley Fool may have an interest in companies mentioned. Click here for more information.