Skip to main contentIBM Β AIOps Training

Understand the Incident

πŸ“₯ STEP: Understand the Incident

πŸš€ Action Click twice on the Last occurence Header.

Result: The β€œCommit in repository robot-shop by Niklaus Hirt on file robot-shop.yaml” should be at the bottom*

image

πŸ“£ Narration

When trying to understand what happened during the incident, I sort the Alerts by occurence. This allows you to understand the chain of events.

  • I can see that the first event was a code change that had been commited to GitHub. When I hover over the description I get the full text. So it seems that the Development Team has reduced the available memory for the mysql database.

Other events are confirming the hypothesis.

  • I can then see the CI/CD process kick in and deploys the code change to the system detected by the Security tool and
  • Instana has has detected the memory size change.
  • Then Functional Selenium Tests start failing and
  • Turbonomic tries to scale-up the mysql database.
  • Instana tells me that the mysql Pod is not running anymore, the replicas are not matching the desired state.
  • CloudPak for AIOps has learned the normal, good patterns for logs coming from the applications. The Story contains a Log Anomaly that has been detected in the ratings service that cannot acces the mysql database.

πŸ“₯ STEP: Metric Anomaly

image

πŸš€ Action Click on a Alert line that has ANOMALY: in the Type column. Then open the Metric Anomaly Details accordion.

πŸ“£ Narration

  • CloudPak for AIOps is also capable of collecting metrics from multiple sources and detecting Metric Anomalies. It was trained on hundreds or thousands of metrics from the environment and constructs a dynamic baseline (shown in green). The graphic suddenly turns red which relates to detected anomaly when the database is consuming a higher amount of memory than usual.
image

πŸš€ Action (1) In Related Alerts select some additional alerts.

πŸ“£ Narration

You can display several alerts at the same time to better understand the temporal dependencies

πŸš€ Action (2) Select a portion of the graph with your mouse to zoom in

πŸ“£ Narration

Now let’s zoom in to better see the anomalies

image

πŸš€ Action Hover over a datapoint to show the before/after values.

πŸ“£ Narration

I can clearly see that the incident caused the Latencies to skyrocket and the Transactions per Seconds are almost zero. This is yet another confirmation of the source of the problem.

πŸš€ Action Close the Metric anomaly details view.

Page last updated: 03 November 2022