DECICE

Smarter Cloud: AI Detects Anomalies in Kubernetes Before It’s Too Late

Cloud computing has transformed modern infrastructure by enabling scalable, flexible, and cost-effective resource management. At the core of this transformation lies Kubernetes, the de facto standard for orchestrating containerized applications in the cloud. Its ability to automate deployment, scaling, and operations has made it essential for modern DevOps and MLOps workflows.

However, ensuring reliability and stability in these dynamic, large-scale environments remains a significant challenge. Complex behaviors, and unexpected changes—such as resource usage spikes, service degradation, or failures—can severely impact performance and availability. These incidents, commonly referred to as anomalies, are difficult to predict due to the scale, heterogeneity, and temporal variability of workloads.

Detecting such anomalies is both critical and complex. Traditional monitoring tools often fall short when it comes to identifying subtle or evolving patterns, which is why intelligent, automated solutions are increasingly necessary.

To address this challenge, we propose a fully automated anomaly detection pipeline tailored for Kubernetes clusters. Our system combines state-of-the-art AI techniques with a robust Machine Learning Operations (MLOps) framework, enabling scalable, continuous, and adaptive health monitoring.

The pipeline integrates key cloud-native tools to establish a comprehensive MLOps ecosystem:

  • Kubeflow for managing end-to-end ML workflows,
  • MLflow for experiment tracking and model lifecycle management,
  • MinIO for scalable object storage.

To detect anomalies, we employ a combination of techniques capable of identifying both statistical outliers and complex temporal patterns in multivariate time-series data. Deep learning models—such as Long Short-Term Memory (LSTM) networks, Temporal Convolutional Networks (TCN), and Autoencoders—are used in an unsupervised setting, detecting anomalies based on reconstruction error or prediction deviation thresholds.

These are complemented by classical machine learning algorithms like Isolation Forest and One-Class SVM, as well as statistical methods including the Z-Score and ARIMA, chosen for their simplicity and interpretability.

Crucially, the pipeline supports continuous retraining and redeployment, ensuring it remains effective as system behavior evolves. By integrating AI-driven analytics with cloud-native observability and automation, our solution contributes to the development of AIOps —Artificial Intelligence for IT Operations—paving the way for self-healing and resilient cloud-native infrastructures.

Author(s):

Fatemeh Bozorgi, Mohsen Seyedkazemi Ardebili, Alma Mater Studiorum  – Universita di Bologna

Links:

Alma Mater Studiorum – Universita di Bologna: https://www.unibo.it/en/

See also the News Article about “Digital Twin”: https://www.decice.eu/project-news/digital-twin/

Keywords:

DECICE, Digital Twin, Kubernetes, Anomaly Detection, MLOps, AIOps

Spread the love
back to top icon