Infrastructure Monitoring

🟢 Operational Service | 192.168.20.151:9000

Comprehensive infrastructure monitoring and management stack providing real-time visibility into Docker Swarm clusters, application performance, and system health across the RESTIV Technology infrastructure.

Overview

The Infrastructure Monitoring service combines Portainer for visual container management with Prometheus/Grafana for metrics collection and visualization, creating a complete observability platform for our distributed infrastructure.

Key Capabilities

Container Orchestration: Docker Swarm management via Portainer dashboard
Performance Monitoring: Prometheus metrics collection with Grafana visualization
Real-time Alerting: Automated notifications for system anomalies and failures
Resource Management: CPU, memory, storage, and network utilization tracking
Service Discovery: Automatic detection and monitoring of new services

Architecture

Core Components

Portainer: Visual Docker Swarm management and container orchestration
Prometheus: Time-series metrics collection with service discovery
Grafana: Data visualization, dashboards, and alerting
Node Exporter: System-level metrics collection across all nodes
Traefik: Reverse proxy and load balancer with metrics exposure

Infrastructure Topology

Primary Proxmox Host (192.168.20.151)
├── Docker Swarm Manager
│   ├── Portainer (port 9000)
│   ├── Prometheus (port 9090)
│   ├── Grafana (port 3000)
│   └── Traefik (ports 80, 443)
├── Worker Nodes
│   ├── Node Exporter (port 9100)
│   └── Application Services
└── Storage Systems
    ├── VM-Fast (NVMe storage)
    ├── VM-Storage (standard storage)
    └── Shared-CephFS (distributed storage)

Features

📊 Portainer Dashboard

Visual container management with:

Service Management: Start, stop, scale, and update Docker services
Stack Deployment: Deploy complex applications using Docker Compose
Volume Management: Create and manage persistent storage volumes
Network Configuration: Configure custom networks and service discovery
User Access Control: Role-based access to different environments

📈 Prometheus Monitoring

Comprehensive metrics collection including:

System Metrics: CPU, memory, disk, and network utilization
Container Metrics: Per-container resource usage and performance
Application Metrics: Custom application metrics via /metrics endpoints
Infrastructure Health: Service availability and response times
Alert Rules: Configurable alerting based on threshold violations

🎛️ Grafana Dashboards

Rich visualization and analytics:

Infrastructure Overview: High-level system health and capacity
Service Performance: Application-specific metrics and SLA tracking
Resource Utilization: Historical trends and capacity planning
Alert Management: Visual alert status and escalation workflows
Custom Dashboards: Tailored views for different stakeholder needs

🔍 Service Discovery

Automated monitoring setup:

Docker Service Discovery: Automatic detection of new containers
Consul Integration: Service registry and health checking
Dynamic Configuration: Auto-updating monitoring configs
Multi-environment Support: Development, staging, and production isolation

Getting Started

Access Requirements

VPN or internal network access to 192.168.20.151
Portainer user account (request via IT)
SSH access for advanced troubleshooting (authorized personnel only)

Portainer Access

Access Dashboard
```
http://192.168.20.151:9000
```
Login
- Use your assigned Portainer credentials
- Select the appropriate Docker Swarm environment
Navigate Services
- View running services in the Services tab
- Monitor resource usage in the Container tab
- Manage volumes and networks as needed

Grafana Dashboards

Access Grafana
```
http://192.168.20.151:3000
```
Browse Dashboards
- Infrastructure Overview: System-wide health and performance
- Docker Swarm: Container and service metrics
- Application Performance: Service-specific monitoring
- Alert Dashboard: Current alerts and escalation status

Key Dashboards

Infrastructure Overview Dashboard

System uptime and availability metrics
Resource utilization across all nodes
Network throughput and latency
Storage capacity and performance

Docker Swarm Dashboard

Service health and scaling status
Container resource consumption
Image pull and deployment metrics
Swarm node health and availability

Application Performance Dashboard

Response times and error rates
Database performance metrics
Cache hit ratios and efficiency
Background job processing status

Alerting & Notifications

Alert Categories

Critical: Service outages, resource exhaustion
Warning: Performance degradation, capacity thresholds
Info: Deployment events, configuration changes

Notification Channels

Slack integration for real-time alerts
Email notifications for critical issues
Webhook integration for automation workflows

Alert Response Procedures

Immediate Assessment: Check Grafana dashboards for context
Initial Response: Basic troubleshooting and service restart if needed
Escalation: Contact infrastructure team for complex issues
Documentation: Update runbooks with resolution details

Infrastructure Services Status

Current Deployment (192.168.20.151)

Portainer: 1/1 replicas (Healthy)
Prometheus: 1/1 replicas (Active)
Grafana: 1/1 replicas (Operational)
Traefik: 1/1 replicas (Load Balancing)
Node Exporter: 3/3 replicas (Global deployment)

Storage Utilization

VM-Fast: NVMe storage for high-performance workloads
VM-Storage: Standard storage for general applications
Shared-CephFS: Distributed storage for shared data

Network Configuration

Internal Network: 192.168.20.0/24 (Primary subnet)
Service Discovery: Traefik with automatic SSL/TLS
Load Balancing: Round-robin with health checks
Ingress Control: HTTP/HTTPS routing with middleware

Maintenance & Operations

Routine Maintenance

Daily: Monitor dashboard alerts and system health
Weekly: Review capacity utilization and performance trends
Monthly: Update container images and security patches
Quarterly: Capacity planning and infrastructure optimization

Backup Procedures

Configuration Backup: Daily backup of Portainer and Grafana configs
Metrics Retention: 30-day detailed metrics, 1-year aggregated data
Volume Snapshots: Automated snapshots of persistent storage
Disaster Recovery: Documented procedures for service restoration

Performance Optimization

Resource Allocation: Monitor and adjust container resource limits
Cache Configuration: Optimize caching strategies for frequently accessed data
Network Optimization: Tune network settings for improved throughput
Storage Performance: Monitor disk I/O and optimize storage allocation

Troubleshooting

Common Issues

High Resource Utilization

Check Grafana dashboards for resource-intensive services
Scale services horizontally if needed via Portainer
Review container resource limits and requests

Service Connectivity Issues

Verify Traefik routing configuration
Check Docker Swarm overlay network health
Validate service discovery and health checks

Dashboard Access Problems

Confirm VPN/network connectivity to 192.168.20.151
Verify user credentials and access permissions
Check service status in Portainer

Emergency Procedures

Service Outage: Access Portainer to restart affected services
Resource Exhaustion: Scale down non-critical services temporarily
Network Issues: Check Traefik status and routing configuration
Data Loss: Initiate backup recovery procedures

Expansion Planning

Secondary Infrastructure (192.168.20.60)

Proxmox Host: Available for scaling operations
K3s Template: Prepared for Kubernetes workload expansion
Available Resources: Significant CPU, memory, and storage capacity
Integration Planning: Future integration with primary infrastructure

Scaling Considerations

Horizontal Scaling: Add worker nodes to Docker Swarm cluster
Vertical Scaling: Increase resources for existing services
Multi-site Deployment: Expand monitoring to additional locations
Hybrid Architecture: Integration with cloud-based monitoring solutions

Status: ✅ Operational | Uptime: 99.9% | Monitored Services: 15+ | Last Updated: January 2025

Overview​

Key Capabilities​

Architecture​

Core Components​

Infrastructure Topology​

Features​

📊 Portainer Dashboard​

📈 Prometheus Monitoring​

🎛️ Grafana Dashboards​

🔍 Service Discovery​

Getting Started​

Access Requirements​

Portainer Access​

Grafana Dashboards​

Key Dashboards​

Infrastructure Overview Dashboard​

Docker Swarm Dashboard​

Application Performance Dashboard​

Alerting & Notifications​

Alert Categories​

Notification Channels​

Alert Response Procedures​

Infrastructure Services Status​

Current Deployment (192.168.20.151)​

Storage Utilization​

Network Configuration​

Maintenance & Operations​

Routine Maintenance​

Backup Procedures​

Performance Optimization​

Troubleshooting​

Common Issues​

Emergency Procedures​

Expansion Planning​

Secondary Infrastructure (192.168.20.60)​

Scaling Considerations​