Can AI speed up root cause analysis in networks?

Can AI speed up root cause analysis in networks?

Mobile networks are designed and built with security in mind. It takes a lot for attackers to do any harm. Still, of course, security incidents may happen, and when they do, fast countermeasures are critical. Of equal importance is incident investigation – to find the root cause and take the appropriate actions so that the real problem can be addressed in an instant and prevented from occurring again. Such measures become even more critical as new applications, use cases and industries connect to the network, emphasizing the requirements on high resilience, minimal downtime, and fast recovery.

One such measure is root cause analysis. Recent high profile incidents of network and cloud outages have led R&D communities to intensify their efforts to develop solutions that can restore services much faster and minimize damages and losses. Today, there are also compliance regulations in place that require service providers to provide timely information about the root cause of incidents, reinforcing the importance of root cause analysis. 

In the era of 5G, networks rely on virtualization for enhanced flexibility and performance. To realize these benefits, network function virtualization introduces several levels. When an incident occurs, symptoms detected at one level may very well have its real root cause happening at some other level. It is important to link the symptoms to the real root cause efficiently and effectively even if they belong to different levels.

What is network function virtualization?

Network function virtualization is the migration of network functionality from physical custom-built network nodes to software that runs on a generic hardware compute platform. It makes it possible for communication service providers (CSPs) to manage, move and expand their network capabilities on demand using virtual, software-based applications across distributed hardware resources.

In collaboration with researchers at Concordia University, we have explored the possibilities for improved incident investigation in virtualized environments, specifically addressing 5G. This research resulted in a novel solution that combines the well-established provenance graph analysis with AI-based techniques for efficient root cause analysis.

Let’s expand a bit on the thinking behind this research work, and its applicability in mobile networks.

What is incident investigation using provenance? And what are the challenges?

Provenance graph is a well-known tool to capture causal relationships between events happening in the system. Provenance graph analysis can help in identifying the root cause of security incidents by tracking back all events in sequence, from the last logged event related to the incident (that is, the symptom), all the way to the source event that caused the incident – the root cause.

However, in a virtualized environment, this process may become challenging and costly. This is because:

  • With an increased number of logged events, the effectiveness and scalability of existing provenance-based solutions may significantly decrease if applied as-is.
  • The multiple levels aspect, introduced by the network function virtualization (NFV) environment, makes provenance capturing and analyzing very challenging and error prone without appropriate models and processes.

For instance, identifying the causal dependency and semantic relationship between events that have occurred at different levels requires extensive domain knowledge and most probably human expertise. However, the task of the human analyst could still be made quicker and easier with support of the right tools.

How to solve incident investigation in multi-level NFV

In this context, we took on a research journey resulting in what we call ProvTalk – a provenance analysis system designed to handle the unique multi-level nature of NFV. It is based on our earlier root cause analysis research prototype, DominoCatcher.

Can AI speed up root cause analysis in networks?

Domino Catcher root cause analysis research prototype

Our solution is developed together with experts at Concordia University and addresses the following:

  • Links the provenance graphs at different levels of the NFV stack by capturing the cross-level dependencies.
  • Assists the human analyst in identifying the root cause of security incidents. To this end, it employs graph pruning techniques and data mining approaches for (system or user-related) frequent patterns to encapsulate the complexity of the graph analysis via aggregations, while preserving the valuable details for efficient root cause analysis.
  • Finally, a rule-based approach is leveraged to automatically translate details of a provenance graph (or a subset of it) into an incident report that can be interpreted by human analysts.
Overview of the provenance analysis solution

Figure 1: Overview of the provenance analysis solution

Provenance analysis and data mining: How it works

Let’s take a closer look at the technical features behind the new provenance analysis solution.

To enable provenance analysis, we first defined a platform independent provenance model based on the World Wide Web Consortium (W3C) standard specification PROV-DM, making it possible for us to organize different levels of the NFV stack into different layers in the provenance graph. This model captures virtual resources (at different levels of abstractions) as nodes and operations on those resources as edges connecting the nodes. To define the cross-level dependencies, we used specifically labeled edges to connect virtual resources from different levels.

Once the model has been defined and verified, we then use it to automatically capture all virtual resources at different levels and management operations modifying them, using event interception mechanisms deployed as middleware, to trace back what is happening at runtime at different levels.

But what needs to be done when an incident happens, for example a security breach of the virtualized resources at any given level? First and foremost, the multi-level provenance graph needs to be examined to search for the root cause, all the way up until the alert is first received. As human involvement is fundamental to this process and since such a provenance model generally includes too much low-level information to be processed manually, we have developed a set of helpful tools that simplify the provenance-related information and make it easier for human analysts to interpret and understand what has happened. All of these tools are executed automatically, and the information then provided to the analyst who can adjust and analyze accordingly by applying their own expertise.

These tools are executed in a three-step process:

Step 1: Multi-level pruning

The first tool is multi-level pruning, which uses the meta-information from the incident alert to filter the irrelevant information from the provenance graph using cross-level dependencies. This means that human analysts can identify potentially irrelevant parts of the graph at different levels much more efficiently and through means that are otherwise non-existent today. This tool helps to narrow down the search space for root causes quite substantially.

Step 2: Mining-based aggregation

The second tool is mining-based aggregation, which enables parts of the graph to be grouped in a reversible manner to reduce the redundancy in the graph and add high-level semantics to low-level operations. This can make the provenance graph much easier to understand. More specifically, this aggregation targets the most frequent sequence of lower-level operations that are automatically triggered after an upper-level operation in the NFV stack. It also targets the administrative routine operations (e.g., maintenance tasks) which appear regularly in the provenance graph. Mining-based aggregation provides human analysts with need-to-know information about what is happening at low-level details, enabling them to focus on the main task, which is root cause finding.

Step 3: Translate graph into human readable text

Lastly, when some paths have been identified in the provenance graph, the third tool can be leveraged to translate these parts into text that can be easily read by human analysts and provide additional useful guidance in the investigation process. This feature can also be used to generate a report describing the result of the investigations by the analyst. The generated report explains in natural language (in our case English) what has happened and how the incident symptoms are linked to the root cause. To do that, it describes which virtual resources and suspicious operations have been involved, the timing of what happened, and which parties performed those operations.

Towards efficient incident handling in 5G and future 6G

The main benefit of this research work is to build a concise and interpretable provenance graph using data comprising a large number of events that have taken place across several levels of NFV. This eases the task of the human analyst in finding the root cause of the security incident and leads to a significant reduction in provenance graph sizes without losing information vital to the investigation, and with lower latency and computational overhead. For CSPs, this delivers the obvious and substantial benefits of reduced costs related to incident investigation, as well as significantly improved incident response time.

It’s expected our solution can be smoothly adapted as a foundation for efficient incident analysis also in 6G. Many aspects of future 6G will evolve from 5G, with virtualization and cloud-native technologies as key factors, and we specifically designed ProvTalk to tackle specificity and complexity of virtualization environments.

Further reading

Read the research paper behind the work, published in the Network and Distributed System Security Symposium (NDSS).

Learn more about Ericsson’s other cyber security initiatives developed together with Concordia University.

Learn more about Ericsson’s vision for future network security.

Learn more about network function virtualization (NFV), and its role in improving 5G trustworthiness.

This work has been carried out as part of the industrial research chair between Ericsson and Concordia University with funding from the Natural Sciences and Engineering Research Council of Canada (NSERC). Read more about it here.

Leave a Reply