The Goal of AI in ITOps is Operational Improvement

I’ve been speaking with more folks again recently about AI initiatives in their IT organizations. Most recently it was a systems team that ran the on-prem server infrastructure, Azure environment, and just recently also tasked with end-user computing.

What struck me in the conversation was how the discussion kept going back to AI. That might seem weird for me to say considering the goal was to discuss AI initiatives, but let me explain.

AI is a technology; it’s a tool. Yes, I understand that in reality AI is a complex and sophisticated family of technologies, but ultimately it’s still a tool. So unless you’re literally building models and live in academia or the world of R&D, AI is not a goal unto itself. Starting with AI was the wrong approach in the discussion with this systems team, and it’s usually the wrong approach for any team.

For most of us in our respective niche of IT operations, the goal of AI is operational improvement.

Instead of starting with AI, that means starting with identifying the operational problem we want to solve, and then understanding things like organizational roadblocks, telemetry pipelines, skills gaps, and so on. Applying AI for the sake of AI is a costly and futile endeavor. If the ROI isn’t there in terms of operational improvement in terms of whatever KPIs are relevant to the organization, it’s almost certainly doomed to fail.

Much of the time, what we mean by AI is classical AI (rules-based AI) or ML-based predictive analytics. An LLM certainly can play a helpful role in ITOps as well, but the thing to remember is that these are all tools that help us achieve some sort of measurable outcome.

For the systems team I was speaking with, the outcomes were about improving infrastructure stability, reducing the MTTR of incidents, and finding ways to reduce cost (especially cloud cost).

In AI, and for pretty much any technology, architecture is strategic. Tools are tactical. If AI is a tool, then it comes only after outcomes are defined. In terms of AI, the outcomes are defined first, followed by the architecture, and only then the specific models or algorithms.

So if the goal is improving IT operations, only after identifying the relevant outcomes in terms of KPIs will we know where there is potential for an AI initiative. The goal should never be implementing an AI solution for AI’s sake.

Could some of the outcomes the systems team identified be achieved through means other than AI? Possibly, and that’s the whole point. Now that we had clear operational outcomes, we could survey the landscape of tools and methodologies that would get us there in a way that made sense.

Here’s a hypothetical scenario.

Perhaps through our discovery we learn that the majority of data center incidents are because of bad optics from a hardware refresh 5 years ago. One option is to create a workflow to generate, ingest, process, clean, and transform telemetry from all the data center switches and optics, ingest the optic vendor information such as operating temperature, etc, and then apply a model like logistic regression to assign a probability of failure to the data center optics in near real-time.

That sounds amazing, right? But think about the cost to develop the telemetry pipeline, infrastructure to process and store the data, the skills gap that needs to be addressed, or in other words the total cost of development and ownership.

Another option might be to simply replace all the optics from that hardware refresh.

You see, an AI initiative for the sake of doing AI can be very complex and achieve the desired outcome at unreasonable cost. And, unfortunately, the track record is that many AI initiatives fail for one reason or another. The goal must be identifying operational outcomes and then determining if AI (using whatever definition you want to use) is the best way to achieve that outcome.

Sometimes AI is the best solution, as we’ve seen with many streaming media, e-commerce, financial, and package delivery organizations. So I’m not decrying the use of AI and declaring it modern day snake oil. Instead, I encourage you to start with identifying the operational problem you want to solve and defining the operational outcomes your solution should produce.

For the systems team I’ve been referring to, that meant digging deeper into their automation workflows only to discover some self-induced problems in how they were able to recover from server failures. The fix was a matter of reviewing and adjusting the flow in their runbooks.

On the other hand, now that they were tasked with end user computing, they struggled to understand trends in the incidents end users were reporting. Were there any patterns based on OS? Geographic location? The type of work the end user was doing? They felt like they were flying by the seat of their pants on every ticket.

In this case, we discussed the possibility of pointing an LLM to their ticketing platform in a RAG system to identify any patterns, trends, etc. Yes, I know it’s much more complex than simply pointing an LLM to a ticketing system, but in this case, AI (generative AI in the form of an LLM + vector database), might just be the best way to achieve the outcome they wanted.

This isn’t specific to AI. Technology for the sake of technology is never the right approach, though in my experience, that can be tough to live-out. We live under the tyranny of the present –  the present shiny new tech that captures the hearts and minds of our industry. And many of us (or maybe just me?) have made the mistake of approaching a project as a resume building endeavor. This is also never the right approach.

AI is no different. As we start to see more use-cases for AI in ITOps, and as we deepen our understanding of the role it plays in our industry, we must start with the outcomes we want to achieve and always keep in mind that the goal of AI is operational improvement.

Thanks,

Phil

Leave a comment

Blog at WordPress.com.

Up ↑