HA Options Overview
HA Options Overview: High Availability Architecture for rConfig V8
Section titled “HA Options Overview: High Availability Architecture for rConfig V8”High Availability (HA) for rConfig V8 addresses the challenge of maintaining continuous network configuration management operations in mission-critical environments where downtime impacts business operations, compliance requirements, or security posture. Organizations can leverage various HA architectures to achieve resilience objectives aligned with their specific Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
There is no universal HA architecture for rConfig deployments. The optimal approach depends on internal SLA requirements, existing infrastructure capabilities, operational team expertise, and budget constraints. This document outlines the most common HA strategies and key considerations for designing resilient rConfig deployments.
Understanding High Availability Requirements
Section titled “Understanding High Availability Requirements”Defining Your HA Objectives
Section titled “Defining Your HA Objectives”Before implementing HA architecture, establish clear objectives that drive design decisions:
Recovery Time Objective (RTO): Maximum acceptable downtime duration before operations must resume. This determines whether manual failover (minutes to hours) or automated failover (seconds to minutes) is required.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. This determines replication frequency, backup strategies, and storage synchronization requirements.
Availability Target: Desired uptime percentage (e.g., 99.9% = 8.76 hours downtime/year). This influences redundancy levels and fault tolerance mechanisms.
Business Impact: Cost of downtime measured in operational disruption, compliance violations, security risks, or revenue impact. This justifies HA investment and complexity.
Common HA Drivers
Section titled “Common HA Drivers”Organizations implement HA for various reasons:
Mission-critical operations: Network configuration management serves as infrastructure foundation. Downtime prevents device provisioning, configuration changes, and compliance monitoring.
Compliance requirements: Regulatory frameworks may mandate continuous monitoring and audit trail availability. HA ensures compliance systems remain operational.
Geographic distribution: Multi-site operations require resilient configuration management across locations. Regional failures shouldn’t impact other sites.
Service level agreements: Internal or customer-facing SLAs may require specific availability guarantees that standard single-server deployments cannot meet.
Change management velocity: High-frequency configuration changes require reliable systems to prevent change backlogs during outages.
High Availability Architecture Options
Section titled “High Availability Architecture Options”Active/Standby Configuration (Recommended)
Section titled “Active/Standby Configuration (Recommended)”The Active/Standby model provides the optimal balance of reliability, operational simplicity, and cost-effectiveness for most rConfig deployments. This architecture maintains two complete rConfig nodes: one actively serving requests while the other remains on standby, ready to assume operations during failures.
Architecture components:
Primary (Active) Node: Serves all web interface traffic, processes device connections, executes scheduled tasks, and handles queue workers.
Secondary (Standby) Node: Maintains identical configuration and application code, monitors primary node health, and activates automatically or manually upon primary failure.
Failover Mechanism: Implements Virtual IP (VIP) using keepalived, corosync/pacemaker, or cloud-native load balancer health checks to redirect traffic upon failure detection.
Shared Storage: Both nodes access common persistent storage containing configuration files, compliance results, backups, and application data.
Database Replication: Database operates in primary/replica mode with automatic or manual promotion of replica to primary during failover.
How Active/Standby failover works:
- Normal operations: Primary node holds VIP and serves all requests. Standby node monitors primary health via heartbeat mechanism.
- Failure detection: Standby detects primary failure through missed heartbeats (typically 3-5 second intervals).
- VIP takeover: Standby node claims VIP, causing all network traffic to route to standby.
- Database promotion: Standby promotes its database replica to primary (automatic with Patroni/repmgr, manual otherwise).
- Service restoration: Standby node fully assumes active role, typically within 30-120 seconds depending on automation level.
- Primary recovery: When original primary recovers, it assumes standby role (preventing “split-brain” scenarios).
Benefits of Active/Standby:
- Simpler to deploy and maintain than Active/Active
- Clear operational state (one node active at any time)
- No session synchronization complexity
- Reduced licensing costs (only one active instance)
- Suitable for manual or automated failover models
- Compatible with existing infrastructure without complex load balancing
Limitations of Active/Standby:
- Standby node resources remain unused during normal operations
- Brief downtime during failover (30-120 seconds typical)
- Manual intervention may be required for some failure scenarios
- Does not provide load distribution across nodes
Recommended for: Production environments requiring high availability without the complexity of Active/Active, organizations with RTO objectives of 1-5 minutes, and deployments where occasional brief downtime is acceptable.
Active/Active Configuration (Advanced)
Section titled “Active/Active Configuration (Advanced)”Active/Active deployments distribute traffic across multiple simultaneously operating rConfig nodes using load balancers. While this architecture provides load distribution and zero-downtime failover, it introduces significant complexity and operational overhead.
Architecture components:
Multiple Active Nodes: Two or more rConfig servers simultaneously serving web traffic and processing operations.
Load Balancer: Distributes incoming requests across active nodes using round-robin, least-connections, or other algorithms. Performs health checks to remove failed nodes from rotation.
Session Management: Requires sticky sessions (session affinity) or external session storage (Redis, Memcached) to maintain user session consistency across nodes.
Shared Storage: All nodes access common persistent storage with locking mechanisms to prevent concurrent modification conflicts.
Database Cluster: Multi-primary database configuration or load-balanced read replicas with write to primary.
Queue Coordination: Queue workers must coordinate across nodes to prevent duplicate job execution.
Challenges with Active/Active:
Session synchronization: User sessions must remain consistent when requests route to different nodes. This requires sticky sessions (complicating load balancing) or external session storage (adding infrastructure complexity).
Queue worker coordination: Multiple nodes running queue workers can execute the same job multiple times without sophisticated coordination mechanisms. This leads to duplicate device connections, redundant compliance checks, and wasted resources.
File locking contention: Shared storage requires distributed file locking to prevent concurrent modifications. This adds latency and potential deadlock scenarios.
Configuration consistency: Application configuration must remain synchronized across nodes. Changes to one node must propagate to others, or inconsistent behavior emerges.
Operational complexity: Troubleshooting becomes more difficult when requests may process on any node. Log correlation and error reproduction require additional tooling.
Cost implications: Requires additional infrastructure (load balancers, session storage), higher licensing costs (multiple active instances), and ongoing Professional Services support.
When Active/Active makes sense:
- Compliance or regulatory requirements mandate zero-downtime availability
- RTO objectives require sub-second failover
- Extremely high request volumes require load distribution (thousands of concurrent users)
- Budget accommodates increased infrastructure and operational costs
- Technical team has expertise in distributed systems and load-balanced architectures
Implementation requirements:
Active/Active deployments require Professional Services engagement for:
- Architecture design and capacity planning
- Load balancer configuration and health check tuning
- Session management implementation
- Queue worker coordination strategy
- Shared storage optimization and locking mechanisms
- Monitoring and alerting specific to distributed architecture
- Operational runbooks and troubleshooting procedures
Horizontal Scaling Without Central Manager
Section titled “Horizontal Scaling Without Central Manager”Organizations managing thousands of devices across diverse environments can implement horizontal scaling by distributing device inventory across multiple independent rConfig instances. This approach provides scalability without the complexity of centralized multi-tenant management.
How horizontal scaling works:
Instance segmentation: Deploy multiple standalone rConfig servers, each managing a subset of total device inventory.
Inventory distribution strategies:
- Geographic: Separate instances per region, data center, or site
- Functional: Different instances for routers, switches, firewalls, etc.
- Customer-based: MSPs assign separate instances per customer
- Technology-based: Separate instances for different vendors or device families
Independence: Each instance operates autonomously with its own database, storage, and user management.
Unified visibility: External tools aggregate data across instances for enterprise-wide reporting, compliance dashboards, or configuration searches.
Benefits of horizontal scaling:
- Avoids single point of failure affecting all managed devices
- Simplifies capacity planning (scale by adding instances)
- Isolates customer or regional data for security or compliance
- Reduces blast radius of configuration errors or security incidents
- Enables gradual migration or technology refresh per instance
Integration approaches:
API aggregation: Custom scripts or applications query multiple rConfig instances via API, aggregating results for unified dashboards or reports.
Git repository synchronization: Configuration files from multiple instances sync to central Git repositories for version control and change tracking.
External monitoring integration: Tools like Zabbix, Prometheus, or Splunk collect metrics from all instances for consolidated monitoring.
Compliance reporting consolidation: Export compliance results from each instance, aggregate in external data warehouse or business intelligence platform.
Recommended for: MSP environments with isolated customer segments, large enterprises with regional operational autonomy, organizations preferring decentralized management, and deployments exceeding capacity of single-server architecture.
Database High Availability
Section titled “Database High Availability”Database resilience is critical for rConfig HA as all operational state, device inventory, compliance history, and audit trails reside in the database. Both PostgreSQL and MariaDB support various HA configurations.
PostgreSQL High Availability
Section titled “PostgreSQL High Availability”PostgreSQL offers mature replication and failover capabilities suitable for rConfig deployments:
Streaming replication with automatic failover:
- Primary database accepts writes, replica(s) receive continuous replication stream
- Tools like Patroni or repmgr monitor primary health and orchestrate automatic failover
- Replica promotion to primary typically completes within 30-60 seconds
- Supports multiple replicas for read scaling or additional redundancy
Configuration requirements:
- Enable WAL archiving and streaming replication in
postgresql.conf
- Configure replication slots to prevent WAL segment deletion before replica consumption
- Set
synchronous_commit
toon
orremote_apply
for zero data loss (impacts performance) - Implement connection pooling (PgBouncer) for efficient connection management
Managed PostgreSQL services:
- AWS RDS for PostgreSQL: Automated failover, point-in-time recovery, read replicas
- Azure Database for PostgreSQL: Built-in HA with zone redundancy
- Google Cloud SQL for PostgreSQL: Regional HA with automatic failover
rConfig configuration: Update .env
database connection string to point to primary database endpoint. During failover, Patroni/repmgr updates DNS or VIP to redirect connections to new primary. No rConfig application changes required.
MariaDB High Availability
Section titled “MariaDB High Availability”MariaDB provides multiple HA topologies suited to different requirements:
Primary/Replica replication:
- Asynchronous replication from primary to one or more replicas
- Manual or automated failover promotes replica to primary
- Simple to configure but brief data loss possible during failover
- Suitable for environments accepting eventual consistency
Galera Cluster:
- Multi-primary synchronous replication across cluster nodes
- Writes can occur on any node with automatic conflict resolution
- No data loss during node failures
- Requires odd number of nodes (3 minimum) for quorum
- More complex to operate but provides active/active database capability
ProxySQL or HAProxy integration:
- Database proxy layer routes connections to healthy primary
- Monitors backend database health via health checks
- Automatically redirects connections during failover
- Provides connection pooling and query routing
rConfig configuration: Point .env
database settings to HAProxy/ProxySQL VIP or load balancer. Proxy handles routing to healthy database backend. For Galera, any node can serve writes, but recommend using proxy to prevent connection storms.
Database High Availability Best Practices
Section titled “Database High Availability Best Practices”Monitor replication lag: Streaming replication can fall behind during high write volume. Alert when lag exceeds acceptable thresholds (typically 5-10 seconds) to prevent data loss during failover.
Test failover procedures regularly: Quarterly failover testing validates automation, identifies issues, and maintains team familiarity with procedures. Document actual failover times and data loss (if any) for RTO/RPO verification.
Implement connection pooling: PgBouncer (PostgreSQL) or ProxySQL (MariaDB) reduce connection overhead, enable seamless database backend changes, and prevent connection exhaustion during high load.
Configure appropriate backup retention: Even with HA replication, maintain independent backups. Replication propagates corruption or accidental deletion to replicas. Backups enable recovery from logical errors.
Size database servers appropriately: Replica servers must handle full production load to serve as primary during failover. Under-provisioned replicas cause performance degradation post-failover.
Persistent Storage High Availability
Section titled “Persistent Storage High Availability”rConfig requires persistent storage for configuration files, compliance results, backups, and system logs. This storage must remain accessible during node failures to maintain operations.
Shared Storage Requirements
Section titled “Shared Storage Requirements”The /var/www/html/rconfig8/current/storage
directory (and particularly the storage/app
subdirectory) contains critical persistent data:
- Configuration repository: Downloaded device configurations organized by device and date
- Compliance results: Policy evaluation outcomes and historical trends
- Backup archives: System backup ZIP files
- Application logs: Laravel framework and application logs
- Uploaded files: User-uploaded files like device import CSVs
In HA deployments, this storage must be accessible to all nodes with consistent view of data.
Storage Architecture Options
Section titled “Storage Architecture Options”Network File System (NFS):
- Mount NFS share to identical path on all rConfig nodes
- NFS server provides centralized storage with file-level access
- Simple to configure and widely supported
- Performance adequate for most deployments
- Single point of failure unless NFS server is also HA
GlusterFS:
- Distributed filesystem replicating data across multiple storage nodes
- No single point of failure with replica count ≥ 2
- Scales horizontally by adding storage nodes
- More complex to deploy and maintain than NFS
- Excellent performance with proper tuning
Cloud-native storage (EFS, Azure Files):
- Managed storage services with built-in redundancy
- Simple to configure with no storage server management
- Scales automatically based on usage
- Higher cost than self-managed options
- Performance varies based on service tier
Considerations for storage selection:
Latency: rConfig frequently reads/writes configuration files. High-latency storage impacts download performance and web interface responsiveness. Target < 10ms latency for file operations.
Throughput: Large deployments downloading thousands of configurations simultaneously require substantial throughput. Ensure storage can handle peak load without bottlenecking.
IOPS: Database operations, compliance checks, and queue processing generate significant IOPS. Under-provisioned storage causes performance degradation.
Locking: Distributed filesystems require proper locking mechanisms to prevent concurrent modification corruption. Verify NFS is configured with appropriate locking (lockd, statd).
Backup integration: Ensure shared storage is included in backup strategies. Storage failure without backups results in complete configuration history loss.
Failover Mechanisms
Section titled “Failover Mechanisms”Virtual IP (VIP) with Keepalived
Section titled “Virtual IP (VIP) with Keepalived”Keepalived implements VRRP (Virtual Router Redundancy Protocol) to provide VIP failover between nodes:
How it works:
- Active node runs keepalived process maintaining VIP on network interface
- Standby node monitors active via VRRP heartbeat (multicast or unicast)
- Upon active failure, standby detects missed heartbeats and claims VIP
- Network traffic immediately routes to standby node
Configuration considerations:
- Both nodes must be on same network segment (Layer 2 adjacency)
- Firewall must permit VRRP protocol (IP protocol 112)
- Configure appropriate heartbeat intervals (default 1 second)
- Implement health check scripts to detect application failures, not just node failures
Corosync/Pacemaker
Section titled “Corosync/Pacemaker”Corosync/Pacemaker provides comprehensive cluster resource management:
Capabilities:
- VIP management with automatic failover
- Service start/stop orchestration
- Fencing (STONITH) to prevent split-brain scenarios
- Resource dependencies and ordering
- Complex failure detection and recovery policies
Complexity trade-off: Significantly more complex to configure than keepalived but provides enterprise-grade cluster management. Recommended for environments with existing Pacemaker expertise or requirements for sophisticated resource management.
Cloud-Native Load Balancers
Section titled “Cloud-Native Load Balancers”AWS, Azure, and GCP provide native load balancing with health checks:
AWS Application Load Balancer (ALB):
- HTTP/HTTPS health checks to rConfig web interface
- Automatically removes failed instances from target group
- Integrates with Auto Scaling for automatic node replacement
Azure Load Balancer:
- Layer 4 load balancing with TCP health probes
- Supports HA ports for all traffic types
- Zone-redundant for cross-availability-zone resilience
Configuration approach: Configure health check endpoint (e.g., /api/health
) that validates:
- Web server responding
- Database connectivity
- Queue worker operational
- Critical services running
Failed health checks trigger automatic traffic redirection to healthy nodes.
Monitoring and Automation
Section titled “Monitoring and Automation”HA deployments require comprehensive monitoring to detect failures before they impact operations:
Node Health Monitoring
Section titled “Node Health Monitoring”System metrics:
- CPU utilization: Alert when sustained > 80%
- Memory usage: Alert when available < 20%
- Disk space: Alert when free space < 20%
- Disk I/O wait: Alert when iowait > 25%
- Network connectivity: Monitor interface status and throughput
Application metrics:
- Web server response time: Alert when > 3 seconds
- Queue depth: Alert when pending jobs > threshold
- Database connection pool: Alert when available connections < 10%
- PHP-FPM worker saturation: Alert when all workers busy
Service Health Monitoring
Section titled “Service Health Monitoring”rConfig services:
- Nginx/Apache web server status
- PHP-FPM process count and responsiveness
- Horizon queue worker processes
- Supervisor status (managing queue workers)
- Database connectivity and replication lag
External dependencies:
- Database server availability and performance
- Shared storage mount status and accessibility
- DNS resolution for database and storage endpoints
Failover Automation
Section titled “Failover Automation”Automated failover: Implement health checks triggering automatic failover without human intervention. Suitable for environments with RTO < 5 minutes requiring 24/7 availability.
Manual failover: Operations team receives alerts about failures and initiates failover procedures. Suitable for environments accepting longer RTO where verification before failover is preferred.
Semi-automated failover: System detects failures and prepares for failover but waits for human approval before executing. Balances automation benefits with human oversight.
Implementation Planning
Section titled “Implementation Planning”Assessment Phase
Section titled “Assessment Phase”Before implementing HA, conduct thorough assessment:
- Define requirements: Document RTO, RPO, availability targets, and business impact of downtime
- Infrastructure audit: Evaluate existing infrastructure capabilities, network topology, and operational team expertise
- Budget determination: Calculate infrastructure costs, licensing, and Professional Services engagement
- Risk analysis: Identify potential failure modes and evaluate HA architecture effectiveness against each
Design Phase
Section titled “Design Phase”Work with rConfig Professional Services to design HA architecture:
- Architecture selection: Choose Active/Standby, Active/Active, or horizontal scaling based on requirements
- Component design: Specify database HA approach, storage solution, and failover mechanism
- Network design: Plan IP addressing, VIP allocation, and network connectivity requirements
- Capacity planning: Size servers, storage, and network bandwidth for production load plus headroom
- Documentation: Create architecture diagrams, configuration specifications, and operational runbooks
Implementation Phase
Section titled “Implementation Phase”Professional Services guides implementation:
- Infrastructure provisioning: Deploy servers, storage, and network components
- Software installation: Install rConfig, database, and supporting components on all nodes
- HA configuration: Implement replication, failover mechanisms, and monitoring
- Testing: Execute failover testing, performance validation, and disaster recovery drills
- Cutover planning: Develop migration plan from existing deployment to HA architecture
Operational Phase
Section titled “Operational Phase”Maintain HA deployment through ongoing activities:
- Regular failover testing: Quarterly validation of failover procedures
- Capacity monitoring: Track growth and scale infrastructure proactively
- Performance tuning: Optimize based on actual usage patterns
- Documentation updates: Maintain runbooks reflecting configuration changes
- Team training: Ensure operations staff remain proficient in HA procedures
Related Documentation
Section titled “Related Documentation”- System Requirements - Infrastructure prerequisites for HA deployments
- Backups - Backup strategies for HA environments
- Horizon Queue Manager - Queue worker configuration in HA deployments
- Database Configuration - Database setup and optimization
Summary
Section titled “Summary”High Availability for rConfig V8 requires careful planning aligned with business requirements, infrastructure capabilities, and operational expertise. The Active/Standby configuration provides the optimal balance of reliability and operational simplicity for most environments, while Active/Active architectures serve specialized requirements where complexity and cost are justified.
Key takeaways for HA planning:
- Define clear RTO and RPO objectives before selecting architecture to ensure alignment with business needs
- Active/Standby is recommended for most deployments requiring high availability without excessive complexity
- Database and storage HA are critical as application-level HA is ineffective without backend resilience
- Test failover regularly to validate automation, maintain team proficiency, and identify issues before they impact production
- Engage Professional Services for architecture design, implementation, and ongoing support of HA deployments
High Availability configurations are available exclusively for rConfig Enterprise and rConfig Vector licenses. Contact rConfig Professional Services to discuss your specific requirements and design an HA architecture aligned with your organization’s objectives.