Skip to content

System Monitoring

System Monitoring: Infrastructure Health and Performance Management for rConfig V8

Section titled “System Monitoring: Infrastructure Health and Performance Management for rConfig V8”

System monitoring for rConfig V8 addresses the challenge of maintaining reliable network configuration management operations by providing visibility into infrastructure health, application performance, and resource utilization. Organizations can leverage external monitoring tools to detect issues proactively, prevent service degradation, and maintain operational continuity for mission-critical configuration management workflows.

rConfig V8 is built on a robust technology stack including Laravel (PHP framework), web servers (Apache/Nginx), PHP-FPM, databases (PostgreSQL/MariaDB), and queue management (Laravel Horizon). Comprehensive monitoring of these components ensures optimal performance, stability, and security across the entire application stack.

Disk capacity and I/O performance directly impact rConfig’s ability to store configurations, write logs, and maintain database operations.

Metrics to monitor:

  • Disk space utilization: Percentage of used space on partitions containing rConfig data
  • Available disk space: Absolute free space in gigabytes
  • Disk I/O wait: Percentage of CPU time waiting for disk operations
  • Read/write throughput: MB/s for sequential and random operations
  • IOPS: Input/output operations per second

Critical directories:

  • /var/www/html/rconfig8/current/storage: Configuration files, logs, backups, uploaded files
  • Database partition: PostgreSQL or MariaDB data directory
  • /tmp: Temporary files during operations

Why monitoring matters: Insufficient disk space prevents configuration downloads, breaks backup operations, causes database failures, and fills log partitions leading to application crashes. High I/O wait indicates storage bottlenecks degrading performance.

Recommended thresholds:

  • Warning: 80% disk utilization
  • Critical: 90% disk utilization
  • Alert: I/O wait sustained above 25%
  • Alert: Available space < 10GB on rConfig partition

Monitoring approaches:

Terminal window
df -h /var/www/html/rconfig8/current/storage
iostat -x 1 10

Use Nagios, Zabbix, or custom scripts integrating with df, iostat, and iotop for continuous monitoring.

The web server (Apache or Nginx) handles all user interface requests, API calls, and configuration downloads.

Metrics to monitor:

  • Request rate: Requests per second
  • Response time: Average and 95th percentile latency
  • Error rate: 4xx and 5xx HTTP status codes
  • Active connections: Current open connections
  • Queue depth: Pending requests waiting for processing

Why monitoring matters: Slow response times frustrate users and indicate capacity issues. High error rates suggest application problems requiring investigation. Connection exhaustion prevents new users from accessing rConfig.

Recommended thresholds:

  • Warning: Average response time > 2 seconds
  • Critical: Average response time > 5 seconds
  • Alert: 5xx error rate > 1% of requests
  • Alert: Active connections > 80% of configured maximum

Apache monitoring:

Enable mod_status in Apache configuration:

<Location "/server-status">
SetHandler server-status
Require local
</Location>

Access status at http://localhost/server-status or integrate with monitoring tools.

Nginx monitoring:

Enable stub_status module:

location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}

Monitoring tools: Apache mod_status, Nginx stub_status, Prometheus exporters, ELK Stack for log analysis, or Grafana dashboards.

PHP-FPM manages PHP worker processes that execute rConfig application code.

Metrics to monitor:

  • Active processes: Currently executing requests
  • Idle processes: Available workers ready for new requests
  • Total processes: Current worker pool size
  • Request duration: Time spent processing requests
  • Memory per process: Memory consumption of individual workers
  • Slow requests: Requests exceeding configured slow threshold

Why monitoring matters: Worker pool exhaustion causes request queuing and timeouts. Memory bloat indicates memory leaks requiring process recycling. Slow requests identify performance bottlenecks.

Recommended thresholds:

  • Warning: < 5 idle processes available
  • Critical: 0 idle processes (pool exhausted)
  • Alert: Average process memory > 256MB
  • Alert: Slow request count increasing

PHP-FPM status configuration:

Enable status page in PHP-FPM pool configuration:

pm.status_path = /php-fpm-status

Access via web server:

location ~ ^/php-fpm-status$ {
access_log off;
allow 127.0.0.1;
deny all;
include fastcgi_params;
fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
}

Monitoring tools: PHP-FPM status page, New Relic APM, Datadog, or Prometheus php-fpm-exporter.

System resource constraints directly limit rConfig’s capacity to process operations concurrently.

Metrics to monitor:

  • Total memory utilization: Percentage of RAM in use
  • Available memory: Free RAM for new processes
  • Swap usage: Memory paged to disk (indicates RAM exhaustion)
  • CPU utilization: Percentage across all cores
  • Load average: System load over 1, 5, and 15-minute intervals
  • Process-specific metrics: Memory and CPU per rConfig component

Why monitoring matters: Memory exhaustion triggers swap usage, drastically degrading performance. High CPU utilization indicates capacity limits or inefficient code. Sustained high load suggests undersized infrastructure.

Recommended thresholds:

  • Warning: Memory utilization > 85%
  • Critical: Memory utilization > 95%
  • Alert: Swap usage > 100MB (swap should rarely be used)
  • Warning: CPU utilization > 80% sustained for 5+ minutes
  • Critical: Load average > number of CPU cores

Monitoring commands:

Terminal window
top
htop
free -h
vmstat 1

PHP memory configuration: Increase PHP memory_limit from default 128MB to approximately 50% of server RAM for dedicated rConfig servers. See PHP Settings for detailed guidance.

Monitoring tools: top, htop, Nagios, Zabbix, Prometheus node_exporter, or cloud provider metrics (CloudWatch, Azure Monitor).

Database health is critical as all rConfig operational data resides in PostgreSQL or MariaDB.

Metrics to monitor:

  • Query performance: Average query execution time
  • Slow queries: Queries exceeding threshold (typically 1-5 seconds)
  • Connection count: Active and total database connections
  • Connection pool utilization: Percentage of max_connections in use
  • Database size: Total storage consumed
  • Replication lag: For replicated databases, delay between primary and replica
  • Lock waits: Queries waiting for table locks
  • Buffer cache hit ratio: Percentage of queries served from memory vs disk

Why monitoring matters: Slow queries degrade application responsiveness. Connection exhaustion prevents new operations. Database growth without monitoring leads to capacity exhaustion. Replication lag risks data loss during failover.

Recommended thresholds:

  • Warning: Average query time > 100ms
  • Critical: Slow queries (> 1 second) increasing
  • Warning: Connections > 70% of max_connections
  • Critical: Connections > 90% of max_connections
  • Alert: Replication lag > 60 seconds
  • Alert: Buffer cache hit ratio < 90%

PostgreSQL monitoring:

SELECT * FROM pg_stat_activity;
SELECT * FROM pg_stat_database;
SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;

MariaDB monitoring:

SHOW PROCESSLIST;
SHOW STATUS;
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SELECT * FROM information_schema.innodb_trx;

Monitoring tools: Percona Monitoring and Management (PMM), pgBadger, MySQLTuner, pg_stat_statements, Grafana with PostgreSQL/MySQL datasources, or Datadog database monitoring.

Laravel Horizon manages background job processing for configuration downloads, compliance checks, and scheduled tasks.

Metrics to monitor:

  • Queue depth: Number of pending jobs in each queue
  • Processing rate: Jobs processed per minute
  • Failed jobs: Jobs that encountered errors
  • Worker status: Health of queue worker processes
  • Job wait time: Time jobs spend in queue before processing
  • Job processing time: Duration to complete jobs

Why monitoring matters: Growing queue depth indicates insufficient workers or stuck jobs. Failed jobs represent unsuccessful operations requiring investigation. Worker failures stop all background processing.

Recommended thresholds:

  • Warning: Queue depth > 100 jobs
  • Critical: Queue depth growing continuously
  • Alert: Failed jobs increasing
  • Critical: Horizon process not running
  • Alert: Jobs waiting > 5 minutes before processing

Monitoring Horizon:

Check Horizon status:

Terminal window
php /var/www/html/rconfig8/current/artisan horizon:status

Access Horizon dashboard: https://your-rconfig-server/horizon

Monitor Supervisord:

Terminal window
sudo supervisorctl status

Monitoring tools: Horizon built-in dashboard, Supervisord web interface, custom scripts checking process status, or integration with Prometheus via Laravel Horizon Prometheus exporter.

For comprehensive Horizon documentation, see Horizon Queue Manager.

Backup monitoring ensures configuration data protection and disaster recovery capabilities.

Metrics to monitor:

  • Backup success/failure: Whether backups complete without errors
  • Backup duration: Time required to complete backups
  • Backup size: Size of generated backup archives
  • Backup frequency: Ensuring scheduled backups execute as configured
  • Backup integrity: Verification that backup archives are valid

Why monitoring matters: Failed backups discovered during disaster recovery are too late. Backup duration increases may indicate capacity issues. Backup size growth helps capacity planning.

Recommended thresholds:

  • Critical: Any backup failure
  • Warning: Backup duration > 2x typical duration
  • Alert: Backup size growth > 50% month-over-month
  • Alert: Last successful backup > 48 hours ago

Monitoring approaches:

Review backup logs:

Terminal window
tail -f /var/www/html/rconfig8/current/storage/logs/laravel.log | grep -i backup

Check backup file timestamps:

Terminal window
ls -lth /var/www/html/rconfig8/current/storage/app/backups/ | head -10

Monitoring tools: Custom scripts parsing backup logs, integration with backup solutions (Bacula, Veeam), monitoring scheduled task execution, or alerting on backup file age.

For backup configuration details, see Backups.

Security monitoring of critical configuration files detects unauthorized modifications.

Files to monitor:

  • .env file: Contains encryption keys, database credentials, and sensitive configuration
  • Application code: Unauthorized changes indicate compromise
  • Configuration files: Web server, PHP, and database configurations
  • SSL certificates: Expiration and replacement detection

Why monitoring matters: Unauthorized .env modifications expose encryption keys and credentials. Code changes may introduce vulnerabilities or backdoors. Configuration tampering causes application failures or security weaknesses.

Recommended monitoring:

  • Alert: Any .env file modification
  • Alert: Application code changes outside maintenance windows
  • Alert: SSL certificate expiration within 30 days
  • Alert: Configuration file modifications by unauthorized users

Monitoring approaches:

File integrity monitoring:

Terminal window
sudo tripwire --check

Monitor .env with inotify:

Terminal window
inotifywait -m /var/www/html/rconfig8/current/.env

Monitoring tools: Tripwire, AIDE (Advanced Intrusion Detection Environment), OSSEC, or custom scripts using file checksums.

Log monitoring provides early warning of application errors, exceptions, and security events.

Log locations:

  • Laravel application logs: /var/www/html/rconfig8/current/storage/logs/laravel.log
  • Web server logs: /var/log/apache2/ or /var/log/nginx/
  • PHP-FPM logs: /var/log/php-fpm/
  • Database logs: PostgreSQL or MariaDB log directories
  • System logs: /var/log/syslog or /var/log/messages

Why monitoring matters: Repeated errors indicate application bugs or configuration issues. Exception patterns reveal performance problems. Security events in logs enable threat detection.

Recommended monitoring:

  • Alert: PHP fatal errors or exceptions
  • Alert: Database connection failures
  • Alert: Authentication failures (potential brute force)
  • Warning: Deprecation warnings in logs
  • Alert: Disk space errors writing logs

Monitoring approaches:

Real-time log watching:

Terminal window
tail -f /var/www/html/rconfig8/current/storage/logs/laravel.log

Search for errors:

Terminal window
grep -i "error\|exception\|fatal" /var/www/html/rconfig8/current/storage/logs/laravel.log

Monitoring tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Graylog, Prometheus Loki, or custom scripts with alerting.

For log management details, see Application Log and System Logs.

Uptime monitoring: Ensure rConfig web interface and API endpoints remain accessible. Use external monitoring services (Pingdom, UptimeRobot, StatusCake) to detect outages from user perspective.

SSL certificate expiration: Monitor certificate validity and expiration dates. Alert 30 days before expiration to allow renewal time. Use openssl commands or monitoring tools with SSL certificate checks.

Network connectivity: Monitor connectivity to managed devices. Device unreachability may indicate network issues rather than rConfig problems. Track success rates for device connections.

API performance: If using rConfig API, monitor response times, error rates, and rate limit consumption. Degraded API performance impacts integrations and automation.

ToolPurposeSuitable For
NagiosGeneral infrastructure monitoring with alertingOrganizations with existing Nagios infrastructure
ZabbixComprehensive system and application monitoringEnterprises requiring detailed metrics and trending
Prometheus + GrafanaTime-series metrics collection with visualizationModern containerized or cloud-native deployments
ELK StackLog aggregation, analysis, and searchingEnvironments with high log volume requiring centralized logging
DatadogCloud-based monitoring and APMOrganizations preferring SaaS monitoring solutions
New RelicApplication performance monitoring with deep visibilityEnvironments needing code-level performance insights
Tripwire / AIDEFile integrity monitoring for securitySecurity-conscious environments with compliance requirements
Percona PMMMySQL/MariaDB and PostgreSQL monitoringDatabase-heavy deployments requiring query optimization

Centralize monitoring: Use a unified monitoring platform aggregating metrics from all rConfig components. Distributed monitoring across multiple tools complicates correlation and alerting. Select one primary platform (e.g., Zabbix, Prometheus) and integrate all metrics.

Implement gradual alerting: Configure warning thresholds before critical thresholds to enable proactive response. For example, alert at 80% disk usage (warning) and 90% (critical), allowing intervention before exhaustion.

Set up alert escalation: Configure escalation policies routing alerts to appropriate personnel based on severity and duration. Critical alerts may page on-call engineers, while warnings email during business hours.

Create monitoring dashboards: Build dashboards providing at-a-glance status of critical metrics. Executives benefit from high-level availability dashboards, while operations teams need detailed component metrics.

Monitor external dependencies: Track external services rConfig integrates with (LDAP, SSO, Zabbix, external APIs). External dependency failures impact rConfig operations but may not manifest in rConfig metrics.

Test monitoring alerting: Periodically trigger test alerts to verify alerting mechanisms function correctly. Monitoring providing alerts no one receives fails its purpose.

Document response procedures: Create runbooks mapping each alert type to specific investigation and remediation steps. This accelerates incident response and enables junior staff to handle common issues.

Review monitoring regularly: Quarterly reviews should evaluate whether monitoring thresholds remain appropriate as infrastructure scales, usage patterns change, or new features are deployed.

Tune alert thresholds: Adjust thresholds based on operational experience. Thresholds triggering frequent false positives cause alert fatigue, while thresholds set too high miss real issues.

Correlate metrics across layers: When investigating issues, examine metrics across all layers (infrastructure, database, application, queue). Problems in one layer often manifest as symptoms in others.

Secure monitoring credentials: Monitoring systems access sensitive infrastructure metrics and logs. Protect monitoring platform credentials with same rigor as production system access.

Encrypt monitoring data: Ensure monitoring data transmits encrypted (TLS/SSL) between agents and collectors. Monitoring data may contain sensitive information about infrastructure topology and vulnerabilities.

Restrict monitoring access: Limit monitoring dashboard access to authorized personnel. Full infrastructure visibility assists attackers in reconnaissance.

Monitor the monitors: Implement meta-monitoring ensuring monitoring platform itself remains operational. Monitoring failures during incidents compound problems.

Audit monitoring changes: Track who modifies monitoring configurations, thresholds, or alert recipients. Unauthorized monitoring changes may indicate compromise or insider threats.

Retain monitoring data: Configure retention periods aligned with regulatory requirements. Some frameworks require 90-day to multi-year retention of monitoring data and alerts.

Generate compliance reports: Produce reports demonstrating monitoring coverage, alert response times, and system availability for auditors.

Document monitoring strategy: Create formal documentation describing monitoring architecture, covered components, alert definitions, and response procedures. Auditors often require documented monitoring practices.

Implement change tracking: Monitor and log all system changes (configuration, software updates, user modifications) for audit trails and forensic investigation.

Integrate monitoring with Horizon queue metrics:

  • Track queue depth trends to identify capacity issues
  • Alert on failed job patterns indicating systemic problems
  • Monitor job processing times to detect performance degradation
  • Correlate queue depth with scheduled task execution

Track compliance-related metrics:

  • Policy evaluation completion rates
  • Compliance check duration trends
  • Failed compliance evaluation patterns
  • Compliance result storage growth

Monitor device connection operations:

  • Success/failure rates per vendor or device type
  • Configuration download duration trends
  • Authentication failure patterns
  • Network timeout patterns by geographic region

Symptom: Monitoring Agent Cannot Connect to rConfig

Section titled “Symptom: Monitoring Agent Cannot Connect to rConfig”

Diagnosis: Verify firewall rules and authentication.

Possible Causes:

  • Firewall blocking monitoring agent connections
  • Incorrect monitoring credentials
  • Monitoring endpoint disabled in rConfig
  • Network connectivity issues

Resolution Steps:

  1. Verify monitoring endpoint accessibility from monitoring server
  2. Check firewall rules permit monitoring agent connections
  3. Validate monitoring credentials
  4. Review rConfig logs for authentication failures
  5. Test network connectivity between monitoring server and rConfig

Symptom: Monitoring Shows High Memory Usage

Section titled “Symptom: Monitoring Shows High Memory Usage”

Diagnosis: Identify which processes consume memory.

Possible Causes:

  • PHP memory_limit set too high allowing individual processes to consume excessive RAM
  • Memory leak in application code or PHP extensions
  • Database query results loading large datasets into memory
  • Insufficient RAM for workload

Resolution Steps:

  1. Identify top memory-consuming processes: ps aux --sort=-%mem | head -20
  2. Review PHP memory_limit setting in php.ini
  3. Check for memory leaks by monitoring individual process memory over time
  4. Optimize database queries returning large result sets
  5. Consider increasing server RAM if workload legitimately requires more memory

Symptom: Alert Fatigue from Too Many Warnings

Section titled “Symptom: Alert Fatigue from Too Many Warnings”

Diagnosis: Review alert thresholds and frequencies.

Possible Causes:

  • Thresholds set too aggressively
  • Normal operational patterns triggering warnings
  • Lack of alert de-duplication
  • Missing alert priority classification

Resolution Steps:

  1. Analyze alert history to identify most frequent alerts
  2. Adjust thresholds for alerts triggering frequently without requiring action
  3. Implement alert de-duplication to group related alerts
  4. Re-classify alerts based on actual urgency observed operationally
  5. Consider implementing alert suppression during maintenance windows

Comprehensive system monitoring is essential for maintaining reliable rConfig V8 operations. Organizations must monitor infrastructure components (disk, memory, CPU), application services (web server, PHP-FPM, database), and rConfig-specific functions (queue management, backups, file integrity) to detect issues proactively and maintain service quality.

Key takeaways for effective monitoring:

  • Monitor all infrastructure layers from hardware resources through application components to detect issues at their source
  • Centralize monitoring in a unified platform to simplify correlation, alerting, and troubleshooting
  • Set appropriate thresholds balancing early warning with alert fatigue, adjusting based on operational experience
  • Document response procedures to accelerate incident response and enable effective delegation
  • Test monitoring regularly to ensure alerting functions correctly when incidents occur

Effective monitoring transforms reactive firefighting into proactive issue prevention, maintaining rConfig availability and performance for mission-critical network configuration management operations. Regular review and refinement of monitoring strategies ensures coverage evolves with infrastructure growth and changing operational requirements.