Skip to content

HA Options Overview

HA Options Overview: High Availability Architecture for rConfig V8

Section titled “HA Options Overview: High Availability Architecture for rConfig V8”

High Availability (HA) for rConfig V8 addresses the challenge of maintaining continuous network configuration management operations in mission-critical environments where downtime impacts business operations, compliance requirements, or security posture. Organizations can leverage various HA architectures to achieve resilience objectives aligned with their specific Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

There is no universal HA architecture for rConfig deployments. The optimal approach depends on internal SLA requirements, existing infrastructure capabilities, operational team expertise, and budget constraints. This document outlines the most common HA strategies and key considerations for designing resilient rConfig deployments.

Understanding High Availability Requirements

Section titled “Understanding High Availability Requirements”

Before implementing HA architecture, establish clear objectives that drive design decisions:

Recovery Time Objective (RTO): Maximum acceptable downtime duration before operations must resume. This determines whether manual failover (minutes to hours) or automated failover (seconds to minutes) is required.

Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. This determines replication frequency, backup strategies, and storage synchronization requirements.

Availability Target: Desired uptime percentage (e.g., 99.9% = 8.76 hours downtime/year). This influences redundancy levels and fault tolerance mechanisms.

Business Impact: Cost of downtime measured in operational disruption, compliance violations, security risks, or revenue impact. This justifies HA investment and complexity.

Organizations implement HA for various reasons:

Mission-critical operations: Network configuration management serves as infrastructure foundation. Downtime prevents device provisioning, configuration changes, and compliance monitoring.

Compliance requirements: Regulatory frameworks may mandate continuous monitoring and audit trail availability. HA ensures compliance systems remain operational.

Geographic distribution: Multi-site operations require resilient configuration management across locations. Regional failures shouldn’t impact other sites.

Service level agreements: Internal or customer-facing SLAs may require specific availability guarantees that standard single-server deployments cannot meet.

Change management velocity: High-frequency configuration changes require reliable systems to prevent change backlogs during outages.

Section titled “Active/Standby Configuration (Recommended)”

The Active/Standby model provides the optimal balance of reliability, operational simplicity, and cost-effectiveness for most rConfig deployments. This architecture maintains two complete rConfig nodes: one actively serving requests while the other remains on standby, ready to assume operations during failures.

Architecture components:

Primary (Active) Node: Serves all web interface traffic, processes device connections, executes scheduled tasks, and handles queue workers.

Secondary (Standby) Node: Maintains identical configuration and application code, monitors primary node health, and activates automatically or manually upon primary failure.

Failover Mechanism: Implements Virtual IP (VIP) using keepalived, corosync/pacemaker, or cloud-native load balancer health checks to redirect traffic upon failure detection.

Shared Storage: Both nodes access common persistent storage containing configuration files, compliance results, backups, and application data.

Database Replication: Database operates in primary/replica mode with automatic or manual promotion of replica to primary during failover.

How Active/Standby failover works:

  1. Normal operations: Primary node holds VIP and serves all requests. Standby node monitors primary health via heartbeat mechanism.
  2. Failure detection: Standby detects primary failure through missed heartbeats (typically 3-5 second intervals).
  3. VIP takeover: Standby node claims VIP, causing all network traffic to route to standby.
  4. Database promotion: Standby promotes its database replica to primary (automatic with Patroni/repmgr, manual otherwise).
  5. Service restoration: Standby node fully assumes active role, typically within 30-120 seconds depending on automation level.
  6. Primary recovery: When original primary recovers, it assumes standby role (preventing “split-brain” scenarios).

Benefits of Active/Standby:

  • Simpler to deploy and maintain than Active/Active
  • Clear operational state (one node active at any time)
  • No session synchronization complexity
  • Reduced licensing costs (only one active instance)
  • Suitable for manual or automated failover models
  • Compatible with existing infrastructure without complex load balancing

Limitations of Active/Standby:

  • Standby node resources remain unused during normal operations
  • Brief downtime during failover (30-120 seconds typical)
  • Manual intervention may be required for some failure scenarios
  • Does not provide load distribution across nodes

Recommended for: Production environments requiring high availability without the complexity of Active/Active, organizations with RTO objectives of 1-5 minutes, and deployments where occasional brief downtime is acceptable.

Active/Active deployments distribute traffic across multiple simultaneously operating rConfig nodes using load balancers. While this architecture provides load distribution and zero-downtime failover, it introduces significant complexity and operational overhead.

Architecture components:

Multiple Active Nodes: Two or more rConfig servers simultaneously serving web traffic and processing operations.

Load Balancer: Distributes incoming requests across active nodes using round-robin, least-connections, or other algorithms. Performs health checks to remove failed nodes from rotation.

Session Management: Requires sticky sessions (session affinity) or external session storage (Redis, Memcached) to maintain user session consistency across nodes.

Shared Storage: All nodes access common persistent storage with locking mechanisms to prevent concurrent modification conflicts.

Database Cluster: Multi-primary database configuration or load-balanced read replicas with write to primary.

Queue Coordination: Queue workers must coordinate across nodes to prevent duplicate job execution.

Challenges with Active/Active:

Session synchronization: User sessions must remain consistent when requests route to different nodes. This requires sticky sessions (complicating load balancing) or external session storage (adding infrastructure complexity).

Queue worker coordination: Multiple nodes running queue workers can execute the same job multiple times without sophisticated coordination mechanisms. This leads to duplicate device connections, redundant compliance checks, and wasted resources.

File locking contention: Shared storage requires distributed file locking to prevent concurrent modifications. This adds latency and potential deadlock scenarios.

Configuration consistency: Application configuration must remain synchronized across nodes. Changes to one node must propagate to others, or inconsistent behavior emerges.

Operational complexity: Troubleshooting becomes more difficult when requests may process on any node. Log correlation and error reproduction require additional tooling.

Cost implications: Requires additional infrastructure (load balancers, session storage), higher licensing costs (multiple active instances), and ongoing Professional Services support.

When Active/Active makes sense:

  • Compliance or regulatory requirements mandate zero-downtime availability
  • RTO objectives require sub-second failover
  • Extremely high request volumes require load distribution (thousands of concurrent users)
  • Budget accommodates increased infrastructure and operational costs
  • Technical team has expertise in distributed systems and load-balanced architectures

Implementation requirements:

Active/Active deployments require Professional Services engagement for:

  • Architecture design and capacity planning
  • Load balancer configuration and health check tuning
  • Session management implementation
  • Queue worker coordination strategy
  • Shared storage optimization and locking mechanisms
  • Monitoring and alerting specific to distributed architecture
  • Operational runbooks and troubleshooting procedures

Horizontal Scaling Without Central Manager

Section titled “Horizontal Scaling Without Central Manager”

Organizations managing thousands of devices across diverse environments can implement horizontal scaling by distributing device inventory across multiple independent rConfig instances. This approach provides scalability without the complexity of centralized multi-tenant management.

How horizontal scaling works:

Instance segmentation: Deploy multiple standalone rConfig servers, each managing a subset of total device inventory.

Inventory distribution strategies:

  • Geographic: Separate instances per region, data center, or site
  • Functional: Different instances for routers, switches, firewalls, etc.
  • Customer-based: MSPs assign separate instances per customer
  • Technology-based: Separate instances for different vendors or device families

Independence: Each instance operates autonomously with its own database, storage, and user management.

Unified visibility: External tools aggregate data across instances for enterprise-wide reporting, compliance dashboards, or configuration searches.

Benefits of horizontal scaling:

  • Avoids single point of failure affecting all managed devices
  • Simplifies capacity planning (scale by adding instances)
  • Isolates customer or regional data for security or compliance
  • Reduces blast radius of configuration errors or security incidents
  • Enables gradual migration or technology refresh per instance

Integration approaches:

API aggregation: Custom scripts or applications query multiple rConfig instances via API, aggregating results for unified dashboards or reports.

Git repository synchronization: Configuration files from multiple instances sync to central Git repositories for version control and change tracking.

External monitoring integration: Tools like Zabbix, Prometheus, or Splunk collect metrics from all instances for consolidated monitoring.

Compliance reporting consolidation: Export compliance results from each instance, aggregate in external data warehouse or business intelligence platform.

Recommended for: MSP environments with isolated customer segments, large enterprises with regional operational autonomy, organizations preferring decentralized management, and deployments exceeding capacity of single-server architecture.

Database resilience is critical for rConfig HA as all operational state, device inventory, compliance history, and audit trails reside in the database. Both PostgreSQL and MariaDB support various HA configurations.

PostgreSQL offers mature replication and failover capabilities suitable for rConfig deployments:

Streaming replication with automatic failover:

  • Primary database accepts writes, replica(s) receive continuous replication stream
  • Tools like Patroni or repmgr monitor primary health and orchestrate automatic failover
  • Replica promotion to primary typically completes within 30-60 seconds
  • Supports multiple replicas for read scaling or additional redundancy

Configuration requirements:

  • Enable WAL archiving and streaming replication in postgresql.conf
  • Configure replication slots to prevent WAL segment deletion before replica consumption
  • Set synchronous_commit to on or remote_apply for zero data loss (impacts performance)
  • Implement connection pooling (PgBouncer) for efficient connection management

Managed PostgreSQL services:

  • AWS RDS for PostgreSQL: Automated failover, point-in-time recovery, read replicas
  • Azure Database for PostgreSQL: Built-in HA with zone redundancy
  • Google Cloud SQL for PostgreSQL: Regional HA with automatic failover

rConfig configuration: Update .env database connection string to point to primary database endpoint. During failover, Patroni/repmgr updates DNS or VIP to redirect connections to new primary. No rConfig application changes required.

MariaDB provides multiple HA topologies suited to different requirements:

Primary/Replica replication:

  • Asynchronous replication from primary to one or more replicas
  • Manual or automated failover promotes replica to primary
  • Simple to configure but brief data loss possible during failover
  • Suitable for environments accepting eventual consistency

Galera Cluster:

  • Multi-primary synchronous replication across cluster nodes
  • Writes can occur on any node with automatic conflict resolution
  • No data loss during node failures
  • Requires odd number of nodes (3 minimum) for quorum
  • More complex to operate but provides active/active database capability

ProxySQL or HAProxy integration:

  • Database proxy layer routes connections to healthy primary
  • Monitors backend database health via health checks
  • Automatically redirects connections during failover
  • Provides connection pooling and query routing

rConfig configuration: Point .env database settings to HAProxy/ProxySQL VIP or load balancer. Proxy handles routing to healthy database backend. For Galera, any node can serve writes, but recommend using proxy to prevent connection storms.

Monitor replication lag: Streaming replication can fall behind during high write volume. Alert when lag exceeds acceptable thresholds (typically 5-10 seconds) to prevent data loss during failover.

Test failover procedures regularly: Quarterly failover testing validates automation, identifies issues, and maintains team familiarity with procedures. Document actual failover times and data loss (if any) for RTO/RPO verification.

Implement connection pooling: PgBouncer (PostgreSQL) or ProxySQL (MariaDB) reduce connection overhead, enable seamless database backend changes, and prevent connection exhaustion during high load.

Configure appropriate backup retention: Even with HA replication, maintain independent backups. Replication propagates corruption or accidental deletion to replicas. Backups enable recovery from logical errors.

Size database servers appropriately: Replica servers must handle full production load to serve as primary during failover. Under-provisioned replicas cause performance degradation post-failover.

rConfig requires persistent storage for configuration files, compliance results, backups, and system logs. This storage must remain accessible during node failures to maintain operations.

The /var/www/html/rconfig8/current/storage directory (and particularly the storage/app subdirectory) contains critical persistent data:

  • Configuration repository: Downloaded device configurations organized by device and date
  • Compliance results: Policy evaluation outcomes and historical trends
  • Backup archives: System backup ZIP files
  • Application logs: Laravel framework and application logs
  • Uploaded files: User-uploaded files like device import CSVs

In HA deployments, this storage must be accessible to all nodes with consistent view of data.

Network File System (NFS):

  • Mount NFS share to identical path on all rConfig nodes
  • NFS server provides centralized storage with file-level access
  • Simple to configure and widely supported
  • Performance adequate for most deployments
  • Single point of failure unless NFS server is also HA

GlusterFS:

  • Distributed filesystem replicating data across multiple storage nodes
  • No single point of failure with replica count ≥ 2
  • Scales horizontally by adding storage nodes
  • More complex to deploy and maintain than NFS
  • Excellent performance with proper tuning

Cloud-native storage (EFS, Azure Files):

  • Managed storage services with built-in redundancy
  • Simple to configure with no storage server management
  • Scales automatically based on usage
  • Higher cost than self-managed options
  • Performance varies based on service tier

Considerations for storage selection:

Latency: rConfig frequently reads/writes configuration files. High-latency storage impacts download performance and web interface responsiveness. Target < 10ms latency for file operations.

Throughput: Large deployments downloading thousands of configurations simultaneously require substantial throughput. Ensure storage can handle peak load without bottlenecking.

IOPS: Database operations, compliance checks, and queue processing generate significant IOPS. Under-provisioned storage causes performance degradation.

Locking: Distributed filesystems require proper locking mechanisms to prevent concurrent modification corruption. Verify NFS is configured with appropriate locking (lockd, statd).

Backup integration: Ensure shared storage is included in backup strategies. Storage failure without backups results in complete configuration history loss.

Keepalived implements VRRP (Virtual Router Redundancy Protocol) to provide VIP failover between nodes:

How it works:

  1. Active node runs keepalived process maintaining VIP on network interface
  2. Standby node monitors active via VRRP heartbeat (multicast or unicast)
  3. Upon active failure, standby detects missed heartbeats and claims VIP
  4. Network traffic immediately routes to standby node

Configuration considerations:

  • Both nodes must be on same network segment (Layer 2 adjacency)
  • Firewall must permit VRRP protocol (IP protocol 112)
  • Configure appropriate heartbeat intervals (default 1 second)
  • Implement health check scripts to detect application failures, not just node failures

Corosync/Pacemaker provides comprehensive cluster resource management:

Capabilities:

  • VIP management with automatic failover
  • Service start/stop orchestration
  • Fencing (STONITH) to prevent split-brain scenarios
  • Resource dependencies and ordering
  • Complex failure detection and recovery policies

Complexity trade-off: Significantly more complex to configure than keepalived but provides enterprise-grade cluster management. Recommended for environments with existing Pacemaker expertise or requirements for sophisticated resource management.

AWS, Azure, and GCP provide native load balancing with health checks:

AWS Application Load Balancer (ALB):

  • HTTP/HTTPS health checks to rConfig web interface
  • Automatically removes failed instances from target group
  • Integrates with Auto Scaling for automatic node replacement

Azure Load Balancer:

  • Layer 4 load balancing with TCP health probes
  • Supports HA ports for all traffic types
  • Zone-redundant for cross-availability-zone resilience

Configuration approach: Configure health check endpoint (e.g., /api/health) that validates:

  • Web server responding
  • Database connectivity
  • Queue worker operational
  • Critical services running

Failed health checks trigger automatic traffic redirection to healthy nodes.

HA deployments require comprehensive monitoring to detect failures before they impact operations:

System metrics:

  • CPU utilization: Alert when sustained > 80%
  • Memory usage: Alert when available < 20%
  • Disk space: Alert when free space < 20%
  • Disk I/O wait: Alert when iowait > 25%
  • Network connectivity: Monitor interface status and throughput

Application metrics:

  • Web server response time: Alert when > 3 seconds
  • Queue depth: Alert when pending jobs > threshold
  • Database connection pool: Alert when available connections < 10%
  • PHP-FPM worker saturation: Alert when all workers busy

rConfig services:

  • Nginx/Apache web server status
  • PHP-FPM process count and responsiveness
  • Horizon queue worker processes
  • Supervisor status (managing queue workers)
  • Database connectivity and replication lag

External dependencies:

  • Database server availability and performance
  • Shared storage mount status and accessibility
  • DNS resolution for database and storage endpoints

Automated failover: Implement health checks triggering automatic failover without human intervention. Suitable for environments with RTO < 5 minutes requiring 24/7 availability.

Manual failover: Operations team receives alerts about failures and initiates failover procedures. Suitable for environments accepting longer RTO where verification before failover is preferred.

Semi-automated failover: System detects failures and prepares for failover but waits for human approval before executing. Balances automation benefits with human oversight.

Before implementing HA, conduct thorough assessment:

  1. Define requirements: Document RTO, RPO, availability targets, and business impact of downtime
  2. Infrastructure audit: Evaluate existing infrastructure capabilities, network topology, and operational team expertise
  3. Budget determination: Calculate infrastructure costs, licensing, and Professional Services engagement
  4. Risk analysis: Identify potential failure modes and evaluate HA architecture effectiveness against each

Work with rConfig Professional Services to design HA architecture:

  1. Architecture selection: Choose Active/Standby, Active/Active, or horizontal scaling based on requirements
  2. Component design: Specify database HA approach, storage solution, and failover mechanism
  3. Network design: Plan IP addressing, VIP allocation, and network connectivity requirements
  4. Capacity planning: Size servers, storage, and network bandwidth for production load plus headroom
  5. Documentation: Create architecture diagrams, configuration specifications, and operational runbooks

Professional Services guides implementation:

  1. Infrastructure provisioning: Deploy servers, storage, and network components
  2. Software installation: Install rConfig, database, and supporting components on all nodes
  3. HA configuration: Implement replication, failover mechanisms, and monitoring
  4. Testing: Execute failover testing, performance validation, and disaster recovery drills
  5. Cutover planning: Develop migration plan from existing deployment to HA architecture

Maintain HA deployment through ongoing activities:

  1. Regular failover testing: Quarterly validation of failover procedures
  2. Capacity monitoring: Track growth and scale infrastructure proactively
  3. Performance tuning: Optimize based on actual usage patterns
  4. Documentation updates: Maintain runbooks reflecting configuration changes
  5. Team training: Ensure operations staff remain proficient in HA procedures

High Availability for rConfig V8 requires careful planning aligned with business requirements, infrastructure capabilities, and operational expertise. The Active/Standby configuration provides the optimal balance of reliability and operational simplicity for most environments, while Active/Active architectures serve specialized requirements where complexity and cost are justified.

Key takeaways for HA planning:

  • Define clear RTO and RPO objectives before selecting architecture to ensure alignment with business needs
  • Active/Standby is recommended for most deployments requiring high availability without excessive complexity
  • Database and storage HA are critical as application-level HA is ineffective without backend resilience
  • Test failover regularly to validate automation, maintain team proficiency, and identify issues before they impact production
  • Engage Professional Services for architecture design, implementation, and ongoing support of HA deployments

High Availability configurations are available exclusively for rConfig Enterprise and rConfig Vector licenses. Contact rConfig Professional Services to discuss your specific requirements and design an HA architecture aligned with your organization’s objectives.