$\color{SeaGreen}{AI-Powered\ Alerting\ System:}$ $\color{OrangeRed}{Critical\ Alerts\ Only}$

This repository implements an AI-powered alerting system that uses a Hugging Face BERT model to classify and prioritize log alerts based on severity, specifically notifying only for critical alerts. The system integrates with Prometheus for metrics collection and Grafana for visualization and alerting, and is built with Python for log processing.

📢Introduction

This project demonstrates how to classify log events using Hugging Face's BERT model to filter critical log messages and trigger alerts only when critical issues arise. Prometheus is used to scrape the log metrics, and Grafana is used for visualization and alert notifications. This approach reduces noise by ensuring that only critical logs are flagged and alerted.

🚀Features:

AI-Based Log Classification: Uses machine learning to classify log messages based on severity.
Critical Alerts: Alerts are triggered only for critical logs, reducing noise and improving response time.
Prometheus & Grafana Integration: Real-time metrics collection and visualization.
Production-Ready Deployment: Uses Gunicorn to run the Flask app in a production environment.
Kubernetes Support: Kubernetes manifests for deploying the system in a scalable environment.
Lazy Loading: The system optimizes resource usage with lazy loading of machine learning models.

📜Prerequisites

Before starting, make sure you have the following tools installed:

Python 3.8+: The application is built using Python.
Prometheus: For metrics collection. Prometheus will scrape metrics from the Python app.
Grafana: For data visualization and alerting. Grafana is used to monitor log metrics from Prometheus.
Gunicorn: For running the Python app in a production environment. It replaces the Flask development server.
Docker (Optional but recommended): Simplifies the setup for Prometheus, Grafana, and the Python app, and is useful for running the services in containers.
Kubernetes (Optional): If you plan to deploy the app in a Kubernetes cluster, ensure you have a working Kubernetes environment.
Pip: For managing Python packages and installing dependencies.
Pipenv (Optional): For virtual environment and dependency management, if you prefer using Pipenv over pip.

🐍Python Dependencies

You'll need to install the following Python libraries:

transformers: For Hugging Face's BERT model, which is used to classify log events based on their content.
prometheus-client: For exposing log metrics to Prometheus.
torch: The PyTorch library is used to run the Hugging Face BERT model. It provides an efficient and flexible way to run and classify log events.
flask: The Flask web framework is used to create a simple web API for the AI-powered alerting system. The API allows you to send log messages for classification and trigger alerts if needed.
gunicorn: Gunicorn is a WSGI HTTP server for running Python web applications like Flask in a production environment. It allows handling multiple requests efficiently, providing better performance and scalability compared to Flask's built-in development server.
requests: For sending Slack notifications (optional).
smtplib: For sending email notifications (optional).

🦄Why Flask and Gunicorn?

Flask: Flask is a lightweight web framework, perfect for building and exposing APIs, especially in a development or small-scale environment. It provides easy setup and flexibility for defining routes and handling requests.
Gunicorn: Flask's built-in development server is not suitable for production, as it can only handle a single request at a time and is not designed for high-performance workloads. Gunicorn, a robust WSGI server, is typically used in production environments. It allows Flask to run as a more efficient, multi-threaded, and scalable web application, handling concurrent requests more effectively.

🗝️ CONCLUSION
In short, *Flask* handles the logic of the web application, while *Gunicorn* ensures that the application can serve requests at scale in a production environment.

🏗️Project Structure

Here’s the structure of the project:

.
├── docker-compose.yml
├── k8s
│   ├── grafana-configmap.yaml
│   ├── grafana-deployment.yaml
│   ├── grafana-pvc.yaml
│   ├── grafana-service.yaml
│   ├── prometheus-configmap.yaml
│   ├── prometheus-deployment.yaml
│   ├── prometheus-pvc.yaml
│   ├── prometheus-service.yaml
│   ├── python-app-deployment.yaml
│   └── python-app-service.yaml
├── LICENSE
├── my_app
│   ├── app.py
│   ├── Dockerfile.app
│   ├── __init__.py
│   ├── start_app.py
│   └── static
│       └── favicon.ico
├── prometheus-grafana
│   ├── alert_rules.yml
│   ├── Dockerfile.grafana
│   ├── Dockerfile.prometheus
│   ├── grafana.ini
│   └── prometheus.yml
├── Prometheus_Grafana_Python_Hugging_Face.png
├── README.md
├── requirements.txt
├── sonar-project.properties
└── tests
    ├── conftest.py
    ├── __init__.py
    ├── test_app.py
    ├── test_parametrized.py
    └── test_start_app.py

6 directories, 31 files

- Python: Core application code.

- Docker Compose: Multi-container setup in `docker-compose.yml`.

- Kubernetes: Deployment manifests in `k8s/`.

- GitHub Actions: CI/CD workflows in `.github/workflows/`.

🧑‍🔧Installation

Step 1: Clone the repository

git clone https://github.com/meleksabit/ai-powered-alerting-system.git
cd ai-powered-alerting-system

Step 2: Install Python dependencies(if you choose Manual Installation, without using Docker or Docker Compose)

Install the required Python libraries using pip:

pip install -r my_app/requirements.txt

Step 3: Install and set up Prometheus and Grafana

You can run the application using Docker or Docker Compose.

Option 1: Using Docker Compose (Recommended)

This will set up both the Python app, Prometheus, and Grafana services in containers.

Run the following command to start the services:

docker-compose up --build

docker-compose up: Starts the services based on the docker-compose.yml file.
--build: Forces Docker to rebuild the images even if nothing has changed. You can skip --build for subsequent runs if no changes are made to the Dockerfiles or dependencies.

The services will be available at:

Prometheus: Accessible at http://localhost:9090
Grafana: Accessible at http://localhost:3000
Python app:
- Flask app running on http://localhost:5000
- Prometheus metrics exposed at http://localhost:8000/metrics

Option 2: Manual Installation

You can also manually install Prometheus and Grafana on your local machine. Follow the links below for instructions:

🐋Docker-Related Files

Dockerfile for the Python App

# Use a slim version of Python to reduce image size
FROM python:3.11-slim-buster

# App version
LABEL version="2.0.3"

# Install necessary system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Set the working directory in the container
WORKDIR /app

# Create a non-root user and group
RUN groupadd -g 1000 appgroup && \
    useradd -u 1000 -g appgroup -m appuser

# Copy requirements.txt from root
COPY requirements.txt ./

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt --upgrade pip

# Preload Hugging Face models to avoid downloading on startup
RUN python -c "from transformers import AutoModelForSequenceClassification, AutoTokenizer; \
    AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'); \
    AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')"

# Copy application code from the root directory
COPY my_app/ ./my_app/

# Change ownership of the /app directory to the non-root user
RUN chown -R appuser:appgroup /app

# Switch to the non-root user
USER appuser

# Expose necessary ports for Flask (5000) and Prometheus metrics (8000)
EXPOSE 5000
EXPOSE 8000

# Run the application (starting both Prometheus and Gunicorn from Python)
CMD ["python", "my_app/start_app.py"]

Dockerfile for Prometheus

# Use the official Prometheus image as the base
FROM prom/prometheus:main

# Copy custom Prometheus configuration and alert rules into the container
COPY prometheus.yml /etc/prometheus/prometheus.yml
COPY alert_rules.yml /etc/prometheus/alert_rules.yml

# Expose Prometheus on the default port
EXPOSE 9090

# Command to run Prometheus
CMD ["--config.file=/etc/prometheus/prometheus.yml", "--storage.tsdb.path=/etc/prometheus/data"]

Dockerfile for Grafana

# Use the official Grafana image as a base
FROM grafana/grafana:main-ubuntu

# Copy only the Grafana configuration file
COPY ./grafana.ini /etc/grafana/grafana.ini

# Expose Grafana web port
EXPOSE 3000

# Use the default entry point for the Grafana image
CMD ["/run.sh"]

Docker Compose File

Here’s the docker-compose.yml that sets up both Prometheus, Grafana, and the Python app:

services:
  # Prometheus service
  prometheus:
    build:
      context: ./prometheus-grafana
      dockerfile: Dockerfile.prometheus
    image: ${DOCKER_USERNAME}/prometheus:${TAG}
    ports:
      - "9090:9090"
    user: "65534"
    restart: unless-stopped
    networks:
      - monitor-net

  # Grafana service
  grafana:
    build:
      context: ./prometheus-grafana
      dockerfile: Dockerfile.grafana
      args:
        TAG: ${TAG}
    image: ${DOCKER_USERNAME}/grafana:${TAG}
    ports:
      - "3000:3000"
    volumes:
      - ./prometheus-grafana/grafana.ini:/etc/grafana/grafana.ini
      - grafana_data:/var/lib/grafana
    secrets:
      - grafana_admin_user
      - grafana_admin_password
    environment:
      - GF_SECURITY_ADMIN_USER_FILE=/run/secrets/grafana_admin_user
      - GF_SECURITY_ADMIN_PASSWORD_FILE=/run/secrets/grafana_admin_password  # Expose Grafana on port 3000
    restart: unless-stopped
    networks:
      - monitor-net

  # Python Flask app service
  python-app:
    build:
      context: .
      dockerfile: ./my_app/Dockerfile.app
      args:
          TAG: ${TAG}
    image: ${DOCKER_USERNAME}/ai-powered-alerting-system:${TAG}
    ports:
      - "5000:5000"  # Expose Flask app on port 5000
      - "8000:8000"  # Expose Prometheus metrics on port 8000
    volumes:
      - ./my_app:/app  # Mount app source code
    restart: unless-stopped
    depends_on:
      - prometheus
      - grafana
    networks:
      - monitor-net

# Define secrets for Grafana
secrets:
  grafana_admin_user:
    file: ${HOME}/secrets/grafana_admin_user.txt
  grafana_admin_password:
    file: ${HOME}/secrets/grafana_admin_password.txt

# Define a shared network
networks:
  monitor-net:
    driver: bridge

# Define a volume for Prometheus data storage
volumes:
  prometheus_data:
  grafana_data:

🛠️Configuration

🔥Step 4: Prometheus Configuration

Edit the prometheus-grafana/prometheus.yml file to add a scrape config for your Python app that exposes metrics on localhost:8000:

# Global settings
global:
  scrape_interval: 15s  # Scrape every 15 seconds
  evaluation_interval: 15s  # Evaluate rules every 15 seconds

# Alertmanager configuration (if using Alertmanager)
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']  # Define Alertmanager target

# Reference to rule files
rule_files:
  - "/etc/prometheus/alert_rules.yml"  # Points to your alert rules file

# Scrape configurations
scrape_configs:
  # Scrape Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Scrape metrics from the Python AI-powered alerting app (now on port 8000)
  - job_name: "ai-powered-alerting-app"
    static_configs:
      - targets: ["python-app:8000"]  # Python app exposing metrics on port 8000

Scrapes metrics from your Python app (ai-powered-alerting-app) at localhost:8000.
Includes the alert_rules.yml file for Prometheus to evaluate alert rules.

📛Step 5: Alert Rules Configuration

Create the alert_rules.yml file in your Prometheus configuration directory (/etc/prometheus/).

alert_rules.yml:

groups:
  - name: critical_alert_rules
    rules:
      - alert: CriticalLogAlert
        expr: log_severity{level="critical"} > 0
        for: 1m
        labels:
          severity: "critical"
        annotations:
          summary: "Critical log detected"
          description: "A critical log event was detected in the AI-powered alerting system."

🤷‍♂️❔How This Works:

prometheus.yml: This file tells Prometheus to scrape metrics from both Prometheus itself and the AI-powered alerting app (your Python app).
alert_rules.yml: This file defines alerting rules that notify you when a critical log event is detected (based on the log_severity metric exposed by the Python app).

🤗Step 6: Hugging Face BERT Model Setup

In the my_app/app.py file, we’ll load the BERT model from Hugging Face and classify log messages.

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

def lazy_load_model():
    """Lazy load the model and tokenizer for text classification."""
    global model, tokenizer, classifier, app_ready
    if model is None or tokenizer is None or classifier is None:
        logging.info("Loading model and tokenizer lazily...")
        tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
        model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
        classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
        logging.info("Model and tokenizer loaded.")
    app_ready = True

def classify_log_event(log_message):
    """Classify log messages using Hugging Face DistilBERT model."""
    lazy_load_model()  # Ensure the model and tokenizer are loaded
    result = classifier(log_message)  # Use the classifier pipeline

    # Determine severity based on sentiment analysis result
    severity = 'not_critical' if result[0]['label'] == 'POSITIVE' else 'critical'
    log_severity.labels(severity=severity).inc()  # Update Prometheus metric
    logging.info(f"Classified log '{log_message}' as {severity}")
    return severity

⚡Usage

Step 7: Run the Python Application

Now you can run the AI-powered alerting system:

docker-compose up --build

📝Testing and Alerts

🔥Step 8: Expose Metrics to Prometheus

The Python app will expose Prometheus metrics at http://localhost:8000/metrics. Prometheus will scrape these metrics to monitor the log severity levels (e.g., critical, not_critical).

Metrics URL: http://localhost:8000/metrics

Prometheus will automatically scrape this endpoint based on the scrape configuration.

🗂️Step 9: Test Log Classification

You can test the log classification functionality by generating various log messages through the app's HTTP API.

Use the /log/<message> endpoint to send log messages to be classified by the Hugging Face BERT model.
The model will classify each log as either critical or not critical, based on the message's sentiment (this uses a sentiment analysis model as a placeholder).

Example log classifications:

Test Log 1: Classifying a user log-in message as "not critical":

curl http://localhost:5000/log/User%20logged%20in%20successfully

Test Log 2: Classifying an SQL injection attempt as "critical":

curl http://localhost:5000/log/SQL%20injection%20attempt%20detected%20in%20API

Test Log 3: Classifying a critical vulnerability detection as "critical":

curl http://localhost:5000/log/Critical%20vulnerability%20found%20in%20package%20xyz

Each of these log messages will be classified by the AI-powered system, and the classification will be reflected in the Prometheus metrics.

The Python app automatically updates the Prometheus metric log_severity with the corresponding severity label (critical or not_critical), which Prometheus will scrape.

🔥🔅Prometheus and Grafana Setup

Step 10: Set Up Grafana for Alerts

You can now set up Grafana to visualize and alert based on the log_severity metrics.

Open Grafana: Access Grafana by navigating to http://localhost:3000 in your browser.
Add Data Source: Add Prometheus as the data source in Grafana:

Name: Prometheus
Type: Prometheus
URL: http://prometheus:9090 (Use the container name if Grafana and Prometheus are running in Docker, i.e., http://prometheus:9090)

Create a Dashboard:

Build a dashboard in Grafana to visualize the log severity metrics being scraped from Prometheus.
For example, create a time series graph to display the metric log_severity with labels for critical and not_critical logs.

Set Up Alerts:

Create an alert rule in Grafana to send notifications when the log_severity metric for critical logs exceeds 0.

Example Grafana alert rule:

# Condition: Trigger an alert if any critical logs are detected
expr: log_severity{severity="critical"} > 0
# Condition: Trigger an alert if any critical logs are detected
expr: log_severity{severity="critical"} > 0
for: 1m
labels:
  severity: "critical"
annotations:
  summary: "Critical log detected"
  description: "A critical log was detected in the application"

💡Demo

After setting up Prometheus and Grafana with the Python AI-powered alerting system, you’ll be able to:

Monitor Logs:

View the log severity metrics in Grafana to monitor the number of critical and non-critical logs processed by the system.

Trigger Alerts:

Grafana will trigger alerts based on the log_severity metric.
Only logs classified as critical by the BERT model will trigger alerts, reducing noise and focusing on important events.

➕📶🔝🆙Additional Improvements:

☸️Kubernetes Deployment:

You can also deploy the system using Kubernetes. This section includes the Kubernetes manifests for deploying Prometheus, Grafana, and the Python app.

Deployment Steps:

Apply the Kubernetes manifests:

kubectl apply -f k8s/

Scale the Python app: If you want to scale the Python app deployment, run:

kubectl scale deployment python-app --replicas=3

Kubernetes Deployment Files

Below are the Kubernetes manifest files located in the k8s/ directory:

Deployment for Python App (python-app-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: python-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: python-app
  template:
    metadata:
      labels:
        app: python-app
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        runAsNonRoot: true
        fsGroup: 2000
      containers:
        - name: python-app
          image: angel3/ai-powered-alerting-system:${IMAGE_TAG}
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "400m"
              memory: "512Mi"
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
          env:
            - name: SENDER_EMAIL
              valueFrom:
                secretKeyRef:
                  name: email-secrets
                  key: sender-email
            - name: NOTIFICATION_RECEIVER
              valueFrom:
                secretKeyRef:
                  name: email-secrets
                  key: notification-receiver
            - name: SLACK_BOT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: email-secrets
                  key: SLACK_BOT_TOKEN
            - name: SLACK_SIGNING_SECRET
              valueFrom:
                secretKeyRef:
                  name: email-secrets
                  key: SLACK_SIGNING_SECRET
          ports:
            - containerPort: 5000
          startupProbe:
            httpGet:
              path: /startup
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /readiness
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5

Service for Python App (python-app-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: python-app-service
spec:
  type: NodePort
  selector:
    app: python-app
  ports:
    - protocol: TCP
      port: 5000
      targetPort: 5000

Deployment for Prometheus (prometheus-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        runAsNonRoot: true
        fsGroup: 2000
      containers:
        - name: prometheus
          image: prom/prometheus:main  # Replaced `main` with a stable version
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: config-volume
              mountPath: /etc/prometheus/
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config

Service for Prometheus (prometheus-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
spec:
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090
  type: NodePort

ConfigMap for Prometheus (prometheus-configmap.yaml):

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: default
data:
  prometheus.yml: |
    # Global settings
    global:
      scrape_interval: 15s  # Scrape every 15 seconds
      evaluation_interval: 15s  # Evaluate rules every 15 seconds

    # Alertmanager configuration (if using Alertmanager)
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']  # Define Alertmanager target if in use

    # Reference to rule files
    rule_files:
      - "/etc/prometheus/alert_rules.yml"  # Points to the alert rules file

    # Scrape configurations
    scrape_configs:
      # Scrape Prometheus itself
      - job_name: "prometheus"
        static_configs:
          - targets: ["localhost:9090"]

      # Scrape metrics from the Python AI-powered alerting app via localhost (requires port-forwarding)
      - job_name: "ai-powered-alerting-app"
        static_configs:
          - targets: ["localhost:8000"]  # Python app exposing metrics, accessible on localhost via port-forwarding

Persistent Volume Claim for Prometheus (prometheus-pvc.yaml):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi  # Adjust storage size as needed

Deployment for Grafana (grafana-deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  labels:
    app: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        runAsNonRoot: true
        fsGroup: 2000
      containers:
        - name: grafana
          image: grafana/grafana:main-ubuntu
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: grafana-data
              mountPath: /var/lib/grafana
            - name: grafana-config
              mountPath: /etc/grafana/grafana.ini
              subPath: grafana.ini
      volumes:
        - name: grafana-data
          persistentVolumeClaim:
            claimName: grafana-pvc
        - name: grafana-config
          configMap:
            name: grafana-config

Service for Grafana (grafana-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: grafana-service
spec:
  selector:
    app: grafana
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  type: NodePort

ConfigMap for Grafana (grafana-configmap.yaml):

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
data:
  grafana.ini: |
    [security]
    admin_user = ${GF_SECURITY_ADMIN_USER}
    admin_password = ${GF_SECURITY_ADMIN_PASSWORD}

Persistent Volume Claim for Grafana (grafana-pvc.yaml):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Note

These manifest files help you set up the Python app, Prometheus, and Grafana in a Kubernetes cluster.

Tip

Handling Gunicorn Worker Timeouts If you encounter issues such as worker timeouts in Gunicorn (e.g., WORKER TIMEOUT errors in the logs), you can adjust the worker timeout directly in the start_app.py script. The current configuration in start_app.py sets a timeout of 30 seconds, which can be increased if necessary to prevent premature worker timeouts during long-running processes or slow startup times. The configuration looks like this:

options = {
    'bind': '0.0.0.0:5000',
    'workers': 4,
    'timeout': 30,  # Default timeout set to 30 seconds
}

If needed, you can increase the timeout by modifying the timeout value in this script.

This configuration ensures that the Gunicorn workers have enough time to handle requests, especially during long-running processes or slow startup times.

📌 Roadmap: Next Steps for Improvements

This section outlines potential improvements and enhancements for the AI-Powered Alerting System to make it more robust, scalable, and feature-rich:

🔔 Notification System

✅ Implement Email Notifications --> implemented via `yagmail` library

Integrate email notifications (e.g., using SMTP libraries like `smtplib` or third-party APIs like SendGrid) to send alerts for critical logs detected by the system.

Why? Provides real-time updates to stakeholders.

✅ Integrate Slack Notifications --> implemented via `slack_bolt` library

Use Slack webhooks to send log classifications and critical alerts directly to dedicated Slack channels.

Why? Improves communication within teams and ensures swift responses to critical events.

🧠 Enhanced AI/ML Capabilities

⬜ Experiment with Alternative Language Models (LLMs)

Test with other transformer-based models like `GPT`, `T5`, or fine-tuned versions of `BERT` specific to log analysis or sentiment classification (e.g., Hugging Face's `bert-for-log-analysis` models).

⬜ Implement Model Monitoring and Retraining Pipelines

Automate periodic retraining of the ML model using up-to-date logs to improve accuracy. Tools like MLflow or TensorFlow Serving can be helpful.

Why? Maintains the model's effectiveness as log patterns evolve over time.

📈 Scalability Enhancements

⬜ NGINX Integration

Add NGINX as a reverse proxy to improve load balancing and handle multiple simultaneous requests efficiently.

Why? Enhances performance and security, especially under heavy traffic.

⬜ Service Mesh with Istio

Use Istio to manage service-to-service communication, observability, and security within your Kubernetes cluster.

Why? Simplifies networking, provides traffic encryption, and facilitates microservice observability.

⬜ Adopt Horizontal Pod Autoscaling

Enable Kubernetes Horizontal Pod Autoscaling (HPA) for the Python app to dynamically scale based on CPU or memory utilization.

Why? Ensures that the system can handle varying workloads efficiently.

🚀 Deployment & CI/CD

⬜ ArgoCD for GitOps Deployment

Implement ArgoCD to manage Kubernetes deployments via GitOps principles.

Why? Automates and synchronizes deployment workflows, reducing manual intervention and ensuring consistency.

✅ Add Unit Testing to CI/CD Pipelines

Include unit tests in the GitHub Actions pipeline for verifying individual components in isolation.

Why? Ensures the correctness of each function or module, catching bugs early in development.

⬜ Add Integration Testing to CI/CD Pipelines

Include integration tests for end-to-end system verification in the GitHub Actions pipeline.

Why? Ensures that new code changes don’t break interdependent components.

🔒 Security Improvements

⬜ Enforce HTTPS with Cert-Manager

Use Cert-Manager in Kubernetes to automatically issue and renew TLS certificates for secure communication.

Why? Protects sensitive data and avoids exposing the application over HTTP.

⬜ Implement Role-Based Access Control (RBAC)

Define and enforce fine-grained access permissions within the Kubernetes cluster.

Why? Enhances security by limiting access to resources based on user roles.

🛠 Additional Improvements

⬜ Centralized Logging with ELK Stack

Integrate Elasticsearch, Logstash, and Kibana to provide powerful log aggregation and analysis capabilities.

Why? Enables deeper insights into logs and simplifies debugging.

⬜ Performance Benchmarking

Conduct stress testing and performance benchmarking (e.g., with k6, Apache JMeter) to identify bottlenecks.

Why? Helps optimize the system for high availability.

⬜ Support Multiple Alert Channels

Extend the alerting framework to integrate with additional tools like PagerDuty, Microsoft Teams, or Opsgenie.

Why? Provides flexibility for different organizations.

⬆️

Name		Name	Last commit message	Last commit date
Latest commit History 625 Commits
.github		.github
k8s		k8s
my_app		my_app
prometheus-grafana		prometheus-grafana
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Prometheus_Grafana_Python_Hugging_Face.png		Prometheus_Grafana_Python_Hugging_Face.png
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
sonar-project.properties		sonar-project.properties

License

meleksabit/ai-powered-alerting-system

Folders and files

Latest commit

History

Repository files navigation