Athena Platform - End-to-End Monitoring Solution

Digital-On Tech Healthcare Technology Monitoring

Athena Platform - End-to-End Monitoring Solution

Executive Summary

The Athena Platform is a comprehensive healthcare technology monitoring and analytics solution built on Grafana Enterprise. It provides real-time visibility into system health, automated compliance reporting, secure user management, and professional Digital-On branding throughout the user experience.

Platform URL: https://athena.digitalon.co.za
Organization: Digital-On Tech
Purpose: Healthcare Technology Infrastructure Monitoring


Architecture Overview

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         External Access                          │
│                    (Cloudflare Tunnel - TLS)                    │
│                  https://athena.digitalon.co.za                 │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┴────────────────────────────────────┐
│                      Athena Platform Host                        │
│                                                                  │
│  ┌────────────────────────────────────────────────────────┐   │
│  │         Grafana Enterprise 12.3.0                      │   │
│  │  - Monitoring Dashboards                               │   │
│  │  - User Management                                      │   │
│  │  - Alert Management                                     │   │
│  │  - Digital-On Branding                                  │   │
│  └──────────┬──────────────────────┬──────────────────────┘   │
│             │                       │                           │
│  ┌──────────┴──────────┐  ┌────────┴─────────┐                │
│  │  PostgreSQL 16      │  │   Redis          │                │
│  │  - Grafana DB       │  │   - Data Source  │                │
│  │  - User Data        │  │   - Caching      │                │
│  │  - Dashboards       │  │   - Time Series  │                │
│  └─────────────────────┘  └──────────────────┘                │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │            Automation & Monitoring Services               │  │
│  │                                                            │  │
│  │  • grafana-compliance.timer (hourly)                      │  │
│  │  • grafana-auto-disable.timer (every minute)              │  │
│  │  • grafana-active-users.timer (every 15 minutes)          │  │
│  │  • msmtp (Mailgun SMTP)                                   │  │
│  │  • Telegram notifications                                  │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Grafana Enterprise

Version: 12.3.0
Purpose: Primary monitoring and visualization platform

Capabilities: - Real-time metrics visualization - Custom dashboard creation - Alert management and routing - User access control - Query performance analysis - Plugin ecosystem (Redis datasource, image renderer)

Customizations: - Complete Digital-On branding (logo, colors, login page) - Custom email templates (welcome, password reset) - Self-registration with approval workflow - Enhanced monitoring and compliance features

2. PostgreSQL Database

Version: 16
Purpose: Grafana backend storage

Contains: - User accounts and authentication - Dashboard definitions - Alert rules and notification channels - Plugin configurations - Session data - Audit logs

Benefits: - Better performance than SQLite - Transaction safety - Easier backup and replication - Query optimization capabilities

3. Redis

Purpose: Data source and caching layer

Use Cases: - Time-series data storage - Application caching - Real-time metrics - Session storage (optional)

4. Cloudflare Tunnel

Purpose: Secure external access without port forwarding

Features: - TLS encryption end-to-end - DDoS protection - No firewall rules needed - Automatic certificate management - Traffic analytics

5. Email System (msmtp + Mailgun)

Purpose: Transactional email delivery

Configuration: - SMTP Host: smtp.eu.mailgun.org:587 - From Address: [email protected] - Sender Name: Athena Grafana

Email Types: - Welcome emails (branded) - Password reset (branded) - Compliance alerts - User approval notifications

6. Telegram Integration

Purpose: Real-time operational notifications

Notifications: - Active user monitoring (15-minute pulses) - System alerts - Operational status updates


Automated Workflows

User Registration & Approval Flow

User Registers                Welcome Email Sent
     ↓                               ↓
┌────────────────┐           ┌──────────────┐
│ Signup Form    │──────────→│ Email System │
│ /signup        │           └──────────────┘
└────────┬───────┘                   │
         │                           ↓
         │                    ┌──────────────────┐
         │                    │ User receives    │
         │                    │ branded welcome  │
         │                    └──────────────────┘
         ↓
┌─────────────────────────────────────────┐
│ Account Created (Viewer role)           │
└────────┬────────────────────────────────┘
         │
         ↓ (within 1 minute)
┌─────────────────────────────────────────┐
│ Auto-Disable Script Runs                │
│ • Detects new user                      │
│ • Disables account                      │
│ • Sends notification to admin           │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Admin Reviews User                      │
│ • Checks email notification             │
│ • Reviews user details                  │
│ • Decision: Approve or Reject           │
└────────┬────────────────────────────────┘
         │
         ↓ (if approved)
┌─────────────────────────────────────────┐
│ Admin Enables User                      │
│ • Via Web UI or API                     │
│ • User can now log in                   │
└─────────────────────────────────────────┘

Compliance Monitoring Flow

Every Hour (06:00-18:00 SAST)
         ↓
┌─────────────────────────────────────────┐
│ Compliance Script Runs                  │
│ • Parse Grafana logs (last 60 min)     │
│ • Extract user activity                 │
│ • Identify accessed dashboards          │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Check: Only Admin Users Active?         │
│ (userId 1 or 4)                         │
└────────┬────────────────────────────────┘
         │
    ┌────┴────┐
    │ YES     │ NO (regular users active)
    ↓         ↓
┌───────┐  ┌──────────────────┐
│ Send  │  │ Exit silently    │
│ Alert │  │ (compliance OK)  │
└───┬───┘  └──────────────────┘
    │
    ↓
┌─────────────────────────────────────────┐
│ HTML Email to First2Lead                │
│ • [email protected]           │
│ • Shows admin activity                  │
│ • Lists dashboards accessed             │
│ • Branded with Digital-On styling      │
└─────────────────────────────────────────┘

Active User Monitoring Flow

Every 15 Minutes
         ↓
┌─────────────────────────────────────────┐
│ Active Users Script Runs                │
│ • Query Grafana logs (last 15 min)     │
│ • Extract authenticated requests        │
│ • Count requests per user               │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Aggregate User Activity                 │
│ • Username, user ID, org ID             │
│ • Request count per user                │
│ • Sort by activity level                │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Send Telegram Pulse                     │
│ • Total users active                    │
│ • Total requests                        │
│ • Top 10 users by activity              │
│ • Time window details                   │
└─────────────────────────────────────────┘

User Management System

User Roles & Permissions

Role Capabilities
Admin Full platform access, user management, config
Editor Create/edit dashboards, cannot manage users
Viewer Read-only access to dashboards (default new users)

Authentication Flow

User Access Request
         ↓
┌─────────────────────────────────────────┐
│ Check: Account Enabled?                 │
└────────┬────────────────────────────────┘
         │
    ┌────┴────┐
    │ NO      │ YES
    ↓         ↓
┌─────────┐  ┌──────────────────┐
│ Access  │  │ Check credentials│
│ Denied  │  │ (username/pass)  │
└─────────┘  └────────┬─────────┘
                      │
                 ┌────┴────┐
                 │ Valid?  │
                 ↓         ↓
             ┌────┐    ┌───────────┐
             │ NO │    │ YES       │
             ↓    │    ↓           │
        ┌─────────┴────┐           │
        │ Login Failed │           │
        └──────────────┘           │
                              ┌────┴──────┐
                              │ Create    │
                              │ Session   │
                              └────┬──────┘
                                   ↓
                              ┌──────────────┐
                              │ Grant Access │
                              │ by Role      │
                              └──────────────┘

Password Reset Flow

User Requests Reset
         ↓
┌─────────────────────────────────────────┐
│ "Forgot Password?" Link                 │
│ User enters email                       │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Generate Reset Token                    │
│ • Unique code                           │
│ • 4-hour expiration                     │
│ • Stored in database                    │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Send Branded Reset Email                │
│ • Digital-On styling                    │
│ • Reset link with code                  │
│ • Security warnings                     │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ User Clicks Link                        │
│ • Validates token                       │
│ • Checks expiration                     │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ User Sets New Password                  │
│ • Password strength validation          │
│ • Token invalidated                     │
│ • Confirmation shown                    │
└─────────────────────────────────────────┘

Data Flow Architecture

Monitoring Data Flow

┌──────────────┐
│ Data Sources │
│ • Servers    │
│ • Apps       │
│ • Databases  │
│ • APIs       │
└──────┬───────┘
       │ (metrics, logs, traces)
       ↓
┌──────────────────┐
│ Data Collection  │
│ • Telegraf       │
│ • Prometheus     │
│ • Custom agents  │
└──────┬───────────┘
       │
       ↓
┌──────────────────┐
│ Time-Series DB   │
│ • Redis          │
│ • InfluxDB       │
│ • Prometheus     │
└──────┬───────────┘
       │
       ↓
┌──────────────────────────────────┐
│ Grafana Enterprise               │
│ • Query data sources             │
│ • Transform & aggregate          │
│ • Apply alert rules              │
│ • Render visualizations          │
└──────┬───────────────────────────┘
       │
       ├─────────────────┬─────────────────┬─────────────────┐
       ↓                 ↓                 ↓                 ↓
┌─────────────┐  ┌──────────────┐  ┌─────────────┐  ┌──────────────┐
│ Dashboards  │  │ Alerts       │  │ Reports     │  │ APIs         │
│ (Web UI)    │  │ (Email/TG)   │  │ (PDF/Email) │  │ (External)   │
└─────────────┘  └──────────────┘  └─────────────┘  └──────────────┘

User Activity Logging Flow

User Action in Grafana
         ↓
┌─────────────────────────────────────────┐
│ Grafana Request Handler                 │
│ • HTTP request received                 │
│ • Session validation                    │
│ • Authorization check                   │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Logger (router_logging)                 │
│ • Context: userId, orgId, uname         │
│ • Request: method, path, status         │
│ • Performance: duration, size           │
└────────┬────────────────────────────────┘
         │
         ├──────────────┬──────────────────┐
         ↓              ↓                  ↓
┌──────────────┐  ┌─────────────┐  ┌──────────────┐
│ journalctl   │  │ /var/log/   │  │ PostgreSQL   │
│ (systemd)    │  │ grafana/    │  │ (audit log)  │
└──────┬───────┘  └──────┬──────┘  └──────┬───────┘
       │                 │                │
       └────────┬────────┴────────────────┘
                ↓
┌─────────────────────────────────────────┐
│ Monitoring Scripts                      │
│ • Compliance monitoring                 │
│ • Active user tracking                  │
│ • Audit reporting                       │
└─────────────────────────────────────────┘

Security Architecture

Security Layers

┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Network Security                                   │
│ • Cloudflare Tunnel (TLS 1.3)                              │
│ • No exposed ports                                          │
│ • DDoS protection                                           │
│ • Rate limiting                                             │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Authentication                                     │
│ • Username/password (bcrypt hashed)                        │
│ • Session management                                        │
│ • Password complexity requirements                         │
│ • Account lockout after failed attempts                    │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Authorization                                      │
│ • Role-based access control (RBAC)                         │
│ • Organization-level isolation                             │
│ • Dashboard-level permissions                              │
│ • Data source access control                               │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Application Security                              │
│ • Auto-disable new accounts                                │
│ • Admin approval required                                  │
│ • Input validation                                         │
│ • XSS/CSRF protection                                      │
│ • SQL injection prevention                                 │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 5: Data Security                                     │
│ • PostgreSQL authentication                                │
│ • Encrypted database connections                           │
│ • Secure credential storage                                │
│ • Audit logging                                            │
└─────────────────────────────────────────────────────────────┘

Secret Management

Secrets Storage: - /etc/default/grafana-auto-disable - Auto-disable credentials - /etc/default/grafana-compliance - Compliance monitoring credentials - /etc/default/grafana-active-users - Telegram bot credentials - /etc/grafana/grafana.ini - Database and SMTP credentials - /etc/msmtprc - Email SMTP credentials

Permissions: All secret files are 600 (root-only read/write)


Monitoring & Observability

Platform Health Monitoring

Metrics Tracked: - Grafana uptime and performance - PostgreSQL connection pool status - Redis hit/miss ratios - Email delivery success rates - User session counts - Dashboard load times - API response times

Alert Channels: 1. Email - High-priority alerts 2. Telegram - Real-time operational updates 3. Grafana UI - Dashboard-based alerts

Compliance & Audit

Compliance Monitoring: - Hourly checks for admin-only activity - Automated reporting to First2Lead - Dashboard access tracking - User activity correlation

Audit Logging: - All user logins - Dashboard access - Configuration changes - User management actions - API calls

Log Retention: - System logs: 30 days (journalctl) - Grafana logs: 30 days (/var/log/grafana/) - PostgreSQL logs: 7 days - Audit logs: 90 days (database)


Operational Workflows

Daily Operations

Automated Tasks: - ✅ User activity monitoring (every 15 minutes) - ✅ Auto-disable new users (every minute) - ✅ Compliance monitoring (hourly, 06:00-18:00 SAST) - ✅ Database connection health checks - ✅ Service status monitoring

Manual Tasks: - Review pending user approvals (as needed) - Review compliance alerts (daily) - Dashboard maintenance (weekly) - System updates (monthly)

Incident Response

Alert Triggered
         ↓
┌─────────────────────────────────────────┐
│ Notification Sent                       │
│ • Telegram pulse                        │
│ • Email alert                           │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Triage                                  │
│ • Severity assessment                   │
│ • Impact analysis                       │
│ • Initial investigation                 │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Diagnosis                               │
│ • Check system logs                     │
│ • Review metrics                        │
│ • Identify root cause                   │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Remediation                             │
│ • Apply fix                             │
│ • Verify resolution                     │
│ • Document incident                     │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Post-Mortem                             │
│ • Root cause analysis                   │
│ • Prevention measures                   │
│ • Documentation update                  │
└─────────────────────────────────────────┘

Scalability & Performance

Current Capacity

  • Concurrent Users: 100+
  • Dashboards: Unlimited
  • Data Sources: 20+
  • Query Performance: < 2s average
  • Uptime Target: 99.5%

Performance Optimization

Database: - Connection pooling (max 100 connections) - Query optimization with indexes - Regular VACUUM operations - Monitoring slow queries

Caching: - Redis caching for frequent queries - Browser caching for static assets - Dashboard cache TTL: 5 minutes

Resource Limits: - Grafana memory: 2GB max - PostgreSQL memory: 4GB shared buffers - Redis memory: 1GB max


Disaster Recovery

Backup Strategy

Database Backups: - Frequency: Daily automated backups - Retention: 30 days - Location: Local + offsite - Type: Full PostgreSQL dumps

Configuration Backups: - Grafana configuration files - Systemd unit files - Email templates - Scripts and automation

Recovery Time Objectives: - RTO (Recovery Time Objective): 4 hours - RPO (Recovery Point Objective): 24 hours

Failover Procedures

  1. Database Failure: Restore from latest backup
  2. Grafana Failure: Restart service, verify logs
  3. Network Failure: Cloudflare automatic failover
  4. Complete System Failure: Rebuild from documentation + backups

Integration Points

External Systems

  1. Mailgun (SMTP)
    • Transactional emails
    • Delivery tracking
    • Bounce handling
  2. Telegram
    • Bot API for notifications
    • Group messaging
    • Real-time alerts
  3. Cloudflare
    • DNS management
    • Tunnel service
    • Analytics
  4. Digital-On Support Portal
    • Support ticketing integration
    • User documentation links
    • Contact management

API Endpoints

Grafana HTTP API: - /api/users - User management - /api/dashboards - Dashboard operations - /api/datasources - Data source config - /api/admin/users - Admin operations - /api/health - Health checks


Technology Stack Summary

Component Technology Version Purpose
Platform Grafana Enterprise 12.3.0 Monitoring & Visualization
Database PostgreSQL 16 Data persistence
Cache Redis Latest Time-series & caching
Tunnel Cloudflare Latest Secure access
Email Mailgun + msmtp Latest Email delivery
Notifications Telegram Bot API Latest Real-time alerts
OS Ubuntu Server Latest LTS Host operating system
Init systemd Latest Service management
Scripting Bash 5.2+ Automation scripts

Success Metrics

Key Performance Indicators (KPIs)

Platform Availability: - Target: 99.5% uptime - Measured: Last 30 days average

User Satisfaction: - Login success rate > 99% - Dashboard load time < 2s - Zero data loss incidents

Security Compliance: - 100% of new users require approval - Hourly compliance monitoring active - Zero unauthorized access incidents

Operational Efficiency: - Automated user management - Automated compliance reporting - 15-minute active user visibility - Minimal manual intervention required


Future Roadmap

Planned Enhancements

  1. Enhanced Monitoring
    • Additional data source integrations
    • Custom plugin development
    • Advanced alerting rules
  2. User Experience
    • Mobile app support
    • Customizable user dashboards
    • Advanced visualization options
  3. Automation
    • Automated dashboard provisioning
    • Self-service user management
    • Intelligent alert correlation
  4. Integration
    • API-first architecture expansion
    • Webhook support for external systems
    • SSO/LDAP integration
  5. Analytics
    • Usage analytics dashboard
    • Performance trending
    • Capacity planning insights

Conclusion

The Athena Platform represents a complete, production-ready monitoring solution for healthcare technology infrastructure. With automated compliance monitoring, secure user management, professional branding, and comprehensive observability, it provides Digital-On Tech with the tools needed to maintain visibility and control over critical systems.

Key Strengths: - ✅ Professional, branded user experience - ✅ Automated compliance and security workflows - ✅ Real-time monitoring and alerting - ✅ Comprehensive documentation - ✅ Production-ready and scalable

Platform Status: