Infrastructure Monitoring
🟢 Operational Service | 192.168.20.151:9000
Comprehensive infrastructure monitoring and management stack providing real-time visibility into Docker Swarm clusters, application performance, and system health across the RESTIV Technology infrastructure.
Overview
The Infrastructure Monitoring service combines Portainer for visual container management with Prometheus/Grafana for metrics collection and visualization, creating a complete observability platform for our distributed infrastructure.
Key Capabilities
- Container Orchestration: Docker Swarm management via Portainer dashboard
- Performance Monitoring: Prometheus metrics collection with Grafana visualization
- Real-time Alerting: Automated notifications for system anomalies and failures
- Resource Management: CPU, memory, storage, and network utilization tracking
- Service Discovery: Automatic detection and monitoring of new services
Architecture
Core Components
- Portainer: Visual Docker Swarm management and container orchestration
- Prometheus: Time-series metrics collection with service discovery
- Grafana: Data visualization, dashboards, and alerting
- Node Exporter: System-level metrics collection across all nodes
- Traefik: Reverse proxy and load balancer with metrics exposure
Infrastructure Topology
Primary Proxmox Host (192.168.20.151)
├── Docker Swarm Manager
│ ├── Portainer (port 9000)
│ ├── Prometheus (port 9090)
│ ├── Grafana (port 3000)
│ └── Traefik (ports 80, 443)
├── Worker Nodes
│ ├── Node Exporter (port 9100)
│ └── Application Services
└── Storage Systems
├── VM-Fast (NVMe storage)
├── VM-Storage (standard storage)
└── Shared-CephFS (distributed storage)
Features
📊 Portainer Dashboard
Visual container management with:
- Service Management: Start, stop, scale, and update Docker services
- Stack Deployment: Deploy complex applications using Docker Compose
- Volume Management: Create and manage persistent storage volumes
- Network Configuration: Configure custom networks and service discovery
- User Access Control: Role-based access to different environments
📈 Prometheus Monitoring
Comprehensive metrics collection including:
- System Metrics: CPU, memory, disk, and network utilization
- Container Metrics: Per-container resource usage and performance
- Application Metrics: Custom application metrics via /metrics endpoints
- Infrastructure Health: Service availability and response times
- Alert Rules: Configurable alerting based on threshold violations
🎛️ Grafana Dashboards
Rich visualization and analytics:
- Infrastructure Overview: High-level system health and capacity
- Service Performance: Application-specific metrics and SLA tracking
- Resource Utilization: Historical trends and capacity planning
- Alert Management: Visual alert status and escalation workflows
- Custom Dashboards: Tailored views for different stakeholder needs
🔍 Service Discovery
Automated monitoring setup:
- Docker Service Discovery: Automatic detection of new containers
- Consul Integration: Service registry and health checking
- Dynamic Configuration: Auto-updating monitoring configs
- Multi-environment Support: Development, staging, and production isolation
Getting Started
Access Requirements
- VPN or internal network access to 192.168.20.151
- Portainer user account (request via IT)
- SSH access for advanced troubleshooting (authorized personnel only)
Portainer Access
-
Access Dashboard
http://192.168.20.151:9000 -
Login
- Use your assigned Portainer credentials
- Select the appropriate Docker Swarm environment
-
Navigate Services
- View running services in the Services tab
- Monitor resource usage in the Container tab
- Manage volumes and networks as needed
Grafana Dashboards
-
Access Grafana
http://192.168.20.151:3000 -
Browse Dashboards
- Infrastructure Overview: System-wide health and performance
- Docker Swarm: Container and service metrics
- Application Performance: Service-specific monitoring
- Alert Dashboard: Current alerts and escalation status
Key Dashboards
Infrastructure Overview Dashboard
- System uptime and availability metrics
- Resource utilization across all nodes
- Network throughput and latency
- Storage capacity and performance
Docker Swarm Dashboard
- Service health and scaling status
- Container resource consumption
- Image pull and deployment metrics
- Swarm node health and availability
Application Performance Dashboard
- Response times and error rates
- Database performance metrics
- Cache hit ratios and efficiency
- Background job processing status
Alerting & Notifications
Alert Categories
- Critical: Service outages, resource exhaustion
- Warning: Performance degradation, capacity thresholds
- Info: Deployment events, configuration changes
Notification Channels
- Slack integration for real-time alerts
- Email notifications for critical issues
- Webhook integration for automation workflows
Alert Response Procedures
- Immediate Assessment: Check Grafana dashboards for context
- Initial Response: Basic troubleshooting and service restart if needed
- Escalation: Contact infrastructure team for complex issues
- Documentation: Update runbooks with resolution details
Infrastructure Services Status
Current Deployment (192.168.20.151)
- Portainer: 1/1 replicas (Healthy)
- Prometheus: 1/1 replicas (Active)
- Grafana: 1/1 replicas (Operational)
- Traefik: 1/1 replicas (Load Balancing)
- Node Exporter: 3/3 replicas (Global deployment)
Storage Utilization
- VM-Fast: NVMe storage for high-performance workloads
- VM-Storage: Standard storage for general applications
- Shared-CephFS: Distributed storage for shared data
Network Configuration
- Internal Network: 192.168.20.0/24 (Primary subnet)
- Service Discovery: Traefik with automatic SSL/TLS
- Load Balancing: Round-robin with health checks
- Ingress Control: HTTP/HTTPS routing with middleware
Maintenance & Operations
Routine Maintenance
- Daily: Monitor dashboard alerts and system health
- Weekly: Review capacity utilization and performance trends
- Monthly: Update container images and security patches
- Quarterly: Capacity planning and infrastructure optimization
Backup Procedures
- Configuration Backup: Daily backup of Portainer and Grafana configs
- Metrics Retention: 30-day detailed metrics, 1-year aggregated data
- Volume Snapshots: Automated snapshots of persistent storage
- Disaster Recovery: Documented procedures for service restoration
Performance Optimization
- Resource Allocation: Monitor and adjust container resource limits
- Cache Configuration: Optimize caching strategies for frequently accessed data
- Network Optimization: Tune network settings for improved throughput
- Storage Performance: Monitor disk I/O and optimize storage allocation
Troubleshooting
Common Issues
High Resource Utilization
- Check Grafana dashboards for resource-intensive services
- Scale services horizontally if needed via Portainer
- Review container resource limits and requests
Service Connectivity Issues
- Verify Traefik routing configuration
- Check Docker Swarm overlay network health
- Validate service discovery and health checks
Dashboard Access Problems
- Confirm VPN/network connectivity to 192.168.20.151
- Verify user credentials and access permissions
- Check service status in Portainer
Emergency Procedures
- Service Outage: Access Portainer to restart affected services
- Resource Exhaustion: Scale down non-critical services temporarily
- Network Issues: Check Traefik status and routing configuration
- Data Loss: Initiate backup recovery procedures
Expansion Planning
Secondary Infrastructure (192.168.20.60)
- Proxmox Host: Available for scaling operations
- K3s Template: Prepared for Kubernetes workload expansion
- Available Resources: Significant CPU, memory, and storage capacity
- Integration Planning: Future integration with primary infrastructure
Scaling Considerations
- Horizontal Scaling: Add worker nodes to Docker Swarm cluster
- Vertical Scaling: Increase resources for existing services
- Multi-site Deployment: Expand monitoring to additional locations
- Hybrid Architecture: Integration with cloud-based monitoring solutions
Status: ✅ Operational | Uptime: 99.9% | Monitored Services: 15+ | Last Updated: January 2025