Skip to main content

Infrastructure Monitoring

🟢 Operational Service | 192.168.20.151:9000

Comprehensive infrastructure monitoring and management stack providing real-time visibility into Docker Swarm clusters, application performance, and system health across the RESTIV Technology infrastructure.

Overview

The Infrastructure Monitoring service combines Portainer for visual container management with Prometheus/Grafana for metrics collection and visualization, creating a complete observability platform for our distributed infrastructure.

Key Capabilities

  • Container Orchestration: Docker Swarm management via Portainer dashboard
  • Performance Monitoring: Prometheus metrics collection with Grafana visualization
  • Real-time Alerting: Automated notifications for system anomalies and failures
  • Resource Management: CPU, memory, storage, and network utilization tracking
  • Service Discovery: Automatic detection and monitoring of new services

Architecture

Core Components

  • Portainer: Visual Docker Swarm management and container orchestration
  • Prometheus: Time-series metrics collection with service discovery
  • Grafana: Data visualization, dashboards, and alerting
  • Node Exporter: System-level metrics collection across all nodes
  • Traefik: Reverse proxy and load balancer with metrics exposure

Infrastructure Topology

Primary Proxmox Host (192.168.20.151)
├── Docker Swarm Manager
│ ├── Portainer (port 9000)
│ ├── Prometheus (port 9090)
│ ├── Grafana (port 3000)
│ └── Traefik (ports 80, 443)
├── Worker Nodes
│ ├── Node Exporter (port 9100)
│ └── Application Services
└── Storage Systems
├── VM-Fast (NVMe storage)
├── VM-Storage (standard storage)
└── Shared-CephFS (distributed storage)

Features

📊 Portainer Dashboard

Visual container management with:

  • Service Management: Start, stop, scale, and update Docker services
  • Stack Deployment: Deploy complex applications using Docker Compose
  • Volume Management: Create and manage persistent storage volumes
  • Network Configuration: Configure custom networks and service discovery
  • User Access Control: Role-based access to different environments

📈 Prometheus Monitoring

Comprehensive metrics collection including:

  • System Metrics: CPU, memory, disk, and network utilization
  • Container Metrics: Per-container resource usage and performance
  • Application Metrics: Custom application metrics via /metrics endpoints
  • Infrastructure Health: Service availability and response times
  • Alert Rules: Configurable alerting based on threshold violations

🎛️ Grafana Dashboards

Rich visualization and analytics:

  • Infrastructure Overview: High-level system health and capacity
  • Service Performance: Application-specific metrics and SLA tracking
  • Resource Utilization: Historical trends and capacity planning
  • Alert Management: Visual alert status and escalation workflows
  • Custom Dashboards: Tailored views for different stakeholder needs

🔍 Service Discovery

Automated monitoring setup:

  • Docker Service Discovery: Automatic detection of new containers
  • Consul Integration: Service registry and health checking
  • Dynamic Configuration: Auto-updating monitoring configs
  • Multi-environment Support: Development, staging, and production isolation

Getting Started

Access Requirements

  • VPN or internal network access to 192.168.20.151
  • Portainer user account (request via IT)
  • SSH access for advanced troubleshooting (authorized personnel only)

Portainer Access

  1. Access Dashboard

    http://192.168.20.151:9000
  2. Login

    • Use your assigned Portainer credentials
    • Select the appropriate Docker Swarm environment
  3. Navigate Services

    • View running services in the Services tab
    • Monitor resource usage in the Container tab
    • Manage volumes and networks as needed

Grafana Dashboards

  1. Access Grafana

    http://192.168.20.151:3000
  2. Browse Dashboards

    • Infrastructure Overview: System-wide health and performance
    • Docker Swarm: Container and service metrics
    • Application Performance: Service-specific monitoring
    • Alert Dashboard: Current alerts and escalation status

Key Dashboards

Infrastructure Overview Dashboard

  • System uptime and availability metrics
  • Resource utilization across all nodes
  • Network throughput and latency
  • Storage capacity and performance

Docker Swarm Dashboard

  • Service health and scaling status
  • Container resource consumption
  • Image pull and deployment metrics
  • Swarm node health and availability

Application Performance Dashboard

  • Response times and error rates
  • Database performance metrics
  • Cache hit ratios and efficiency
  • Background job processing status

Alerting & Notifications

Alert Categories

  • Critical: Service outages, resource exhaustion
  • Warning: Performance degradation, capacity thresholds
  • Info: Deployment events, configuration changes

Notification Channels

  • Slack integration for real-time alerts
  • Email notifications for critical issues
  • Webhook integration for automation workflows

Alert Response Procedures

  1. Immediate Assessment: Check Grafana dashboards for context
  2. Initial Response: Basic troubleshooting and service restart if needed
  3. Escalation: Contact infrastructure team for complex issues
  4. Documentation: Update runbooks with resolution details

Infrastructure Services Status

Current Deployment (192.168.20.151)

  • Portainer: 1/1 replicas (Healthy)
  • Prometheus: 1/1 replicas (Active)
  • Grafana: 1/1 replicas (Operational)
  • Traefik: 1/1 replicas (Load Balancing)
  • Node Exporter: 3/3 replicas (Global deployment)

Storage Utilization

  • VM-Fast: NVMe storage for high-performance workloads
  • VM-Storage: Standard storage for general applications
  • Shared-CephFS: Distributed storage for shared data

Network Configuration

  • Internal Network: 192.168.20.0/24 (Primary subnet)
  • Service Discovery: Traefik with automatic SSL/TLS
  • Load Balancing: Round-robin with health checks
  • Ingress Control: HTTP/HTTPS routing with middleware

Maintenance & Operations

Routine Maintenance

  • Daily: Monitor dashboard alerts and system health
  • Weekly: Review capacity utilization and performance trends
  • Monthly: Update container images and security patches
  • Quarterly: Capacity planning and infrastructure optimization

Backup Procedures

  • Configuration Backup: Daily backup of Portainer and Grafana configs
  • Metrics Retention: 30-day detailed metrics, 1-year aggregated data
  • Volume Snapshots: Automated snapshots of persistent storage
  • Disaster Recovery: Documented procedures for service restoration

Performance Optimization

  • Resource Allocation: Monitor and adjust container resource limits
  • Cache Configuration: Optimize caching strategies for frequently accessed data
  • Network Optimization: Tune network settings for improved throughput
  • Storage Performance: Monitor disk I/O and optimize storage allocation

Troubleshooting

Common Issues

High Resource Utilization

  • Check Grafana dashboards for resource-intensive services
  • Scale services horizontally if needed via Portainer
  • Review container resource limits and requests

Service Connectivity Issues

  • Verify Traefik routing configuration
  • Check Docker Swarm overlay network health
  • Validate service discovery and health checks

Dashboard Access Problems

  • Confirm VPN/network connectivity to 192.168.20.151
  • Verify user credentials and access permissions
  • Check service status in Portainer

Emergency Procedures

  1. Service Outage: Access Portainer to restart affected services
  2. Resource Exhaustion: Scale down non-critical services temporarily
  3. Network Issues: Check Traefik status and routing configuration
  4. Data Loss: Initiate backup recovery procedures

Expansion Planning

Secondary Infrastructure (192.168.20.60)

  • Proxmox Host: Available for scaling operations
  • K3s Template: Prepared for Kubernetes workload expansion
  • Available Resources: Significant CPU, memory, and storage capacity
  • Integration Planning: Future integration with primary infrastructure

Scaling Considerations

  • Horizontal Scaling: Add worker nodes to Docker Swarm cluster
  • Vertical Scaling: Increase resources for existing services
  • Multi-site Deployment: Expand monitoring to additional locations
  • Hybrid Architecture: Integration with cloud-based monitoring solutions

Status: ✅ Operational | Uptime: 99.9% | Monitored Services: 15+ | Last Updated: January 2025