HA Options Overview

HA Options Overview: High Availability Architecture for rConfig V8

High Availability (HA) for rConfig V8 addresses the challenge of maintaining continuous network configuration management operations in mission-critical environments where downtime impacts business operations, compliance requirements, or security posture. Organizations can leverage various HA architectures to achieve resilience objectives aligned with their specific Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

There is no universal HA architecture for rConfig deployments. The optimal approach depends on internal SLA requirements, existing infrastructure capabilities, operational team expertise, and budget constraints. This document outlines the most common HA strategies and key considerations for designing resilient rConfig deployments.

Understanding High Availability Requirements

Defining Your HA Objectives

Before implementing HA architecture, establish clear objectives that drive design decisions:

Recovery Time Objective (RTO): Maximum acceptable downtime duration before operations must resume. This determines whether manual failover (minutes to hours) or automated failover (seconds to minutes) is required.

Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. This determines replication frequency, backup strategies, and storage synchronization requirements.

Availability Target: Desired uptime percentage (e.g., 99.9% = 8.76 hours downtime/year). This influences redundancy levels and fault tolerance mechanisms.

Business Impact: Cost of downtime measured in operational disruption, compliance violations, security risks, or revenue impact. This justifies HA investment and complexity.

Common HA Drivers

Organizations implement HA for various reasons:

Mission-critical operations: Network configuration management serves as infrastructure foundation. Downtime prevents device provisioning, configuration changes, and compliance monitoring.

Compliance requirements: Regulatory frameworks may mandate continuous monitoring and audit trail availability. HA ensures compliance systems remain operational.

Geographic distribution: Multi-site operations require resilient configuration management across locations. Regional failures shouldn’t impact other sites.

Service level agreements: Internal or customer-facing SLAs may require specific availability guarantees that standard single-server deployments cannot meet.

Change management velocity: High-frequency configuration changes require reliable systems to prevent change backlogs during outages.

High Availability Architecture Options

Active/Standby Configuration (Recommended)

The Active/Standby model provides the optimal balance of reliability, operational simplicity, and cost-effectiveness for most rConfig deployments. This architecture maintains two complete rConfig nodes: one actively serving requests while the other remains on standby, ready to assume operations during failures.

Architecture components:

Primary (Active) Node: Serves all web interface traffic, processes device connections, executes scheduled tasks, and handles queue workers.

Secondary (Standby) Node: Maintains identical configuration and application code, monitors primary node health, and activates automatically or manually upon primary failure.

Failover Mechanism: Implements Virtual IP (VIP) using keepalived, corosync/pacemaker, or cloud-native load balancer health checks to redirect traffic upon failure detection.

Shared Storage: Both nodes access common persistent storage containing configuration files, compliance results, backups, and application data.

Database Replication: Database operates in primary/replica mode with automatic or manual promotion of replica to primary during failover.

How Active/Standby failover works:

Normal operations: Primary node holds VIP and serves all requests. Standby node monitors primary health via heartbeat mechanism.
Failure detection: Standby detects primary failure through missed heartbeats (typically 3-5 second intervals).
VIP takeover: Standby node claims VIP, causing all network traffic to route to standby.
Database promotion: Standby promotes its database replica to primary (automatic with Patroni/repmgr, manual otherwise).
Service restoration: Standby node fully assumes active role, typically within 30-120 seconds depending on automation level.
Primary recovery: When original primary recovers, it assumes standby role (preventing “split-brain” scenarios).

Benefits of Active/Standby:

Simpler to deploy and maintain than Active/Active
Clear operational state (one node active at any time)
No session synchronization complexity
Reduced licensing costs (only one active instance)
Suitable for manual or automated failover models
Compatible with existing infrastructure without complex load balancing

Limitations of Active/Standby:

Standby node resources remain unused during normal operations
Brief downtime during failover (30-120 seconds typical)
Manual intervention may be required for some failure scenarios
Does not provide load distribution across nodes

Recommended for: Production environments requiring high availability without the complexity of Active/Active, organizations with RTO objectives of 1-5 minutes, and deployments where occasional brief downtime is acceptable.

Active/Active Configuration (Advanced)

Active/Active deployments distribute traffic across multiple simultaneously operating rConfig nodes using load balancers. While this architecture provides load distribution and zero-downtime failover, it introduces significant complexity and operational overhead.

Architecture components:

Multiple Active Nodes: Two or more rConfig servers simultaneously serving web traffic and processing operations.

Load Balancer: Distributes incoming requests across active nodes using round-robin, least-connections, or other algorithms. Performs health checks to remove failed nodes from rotation.

Session Management: Requires sticky sessions (session affinity) or external session storage (Redis, Memcached) to maintain user session consistency across nodes.

Shared Storage: All nodes access common persistent storage with locking mechanisms to prevent concurrent modification conflicts.

Database Cluster: Multi-primary database configuration or load-balanced read replicas with write to primary.

Queue Coordination: Queue workers must coordinate across nodes to prevent duplicate job execution.

Challenges with Active/Active:

Session synchronization: User sessions must remain consistent when requests route to different nodes. This requires sticky sessions (complicating load balancing) or external session storage (adding infrastructure complexity).

Queue worker coordination: Multiple nodes running queue workers can execute the same job multiple times without sophisticated coordination mechanisms. This leads to duplicate device connections, redundant compliance checks, and wasted resources.

File locking contention: Shared storage requires distributed file locking to prevent concurrent modifications. This adds latency and potential deadlock scenarios.

Configuration consistency: Application configuration must remain synchronized across nodes. Changes to one node must propagate to others, or inconsistent behavior emerges.

Operational complexity: Troubleshooting becomes more difficult when requests may process on any node. Log correlation and error reproduction require additional tooling.

Cost implications: Requires additional infrastructure (load balancers, session storage), higher licensing costs (multiple active instances), and ongoing Professional Services support.

When Active/Active makes sense:

Compliance or regulatory requirements mandate zero-downtime availability
RTO objectives require sub-second failover
Extremely high request volumes require load distribution (thousands of concurrent users)
Budget accommodates increased infrastructure and operational costs
Technical team has expertise in distributed systems and load-balanced architectures

Implementation requirements:

Active/Active deployments require Professional Services engagement for:

Architecture design and capacity planning
Load balancer configuration and health check tuning
Session management implementation
Queue worker coordination strategy
Shared storage optimization and locking mechanisms
Monitoring and alerting specific to distributed architecture
Operational runbooks and troubleshooting procedures

Horizontal Scaling Without Central Manager

Organizations managing thousands of devices across diverse environments can implement horizontal scaling by distributing device inventory across multiple independent rConfig instances. This approach provides scalability without the complexity of centralized multi-tenant management.

How horizontal scaling works:

Instance segmentation: Deploy multiple standalone rConfig servers, each managing a subset of total device inventory.

Inventory distribution strategies:

Geographic: Separate instances per region, data center, or site
Functional: Different instances for routers, switches, firewalls, etc.
Customer-based: MSPs assign separate instances per customer
Technology-based: Separate instances for different vendors or device families

Independence: Each instance operates autonomously with its own database, storage, and user management.

Unified visibility: External tools aggregate data across instances for enterprise-wide reporting, compliance dashboards, or configuration searches.

Benefits of horizontal scaling:

Avoids single point of failure affecting all managed devices
Simplifies capacity planning (scale by adding instances)
Isolates customer or regional data for security or compliance
Reduces blast radius of configuration errors or security incidents
Enables gradual migration or technology refresh per instance

Integration approaches:

API aggregation: Custom scripts or applications query multiple rConfig instances via API, aggregating results for unified dashboards or reports.

Git repository synchronization: Configuration files from multiple instances sync to central Git repositories for version control and change tracking.

External monitoring integration: Tools like Zabbix, Prometheus, or Splunk collect metrics from all instances for consolidated monitoring.

Compliance reporting consolidation: Export compliance results from each instance, aggregate in external data warehouse or business intelligence platform.

Recommended for: MSP environments with isolated customer segments, large enterprises with regional operational autonomy, organizations preferring decentralized management, and deployments exceeding capacity of single-server architecture.

Database High Availability

Database resilience is critical for rConfig HA as all operational state, device inventory, compliance history, and audit trails reside in the database. Both PostgreSQL and MariaDB support various HA configurations.

PostgreSQL High Availability

PostgreSQL offers mature replication and failover capabilities suitable for rConfig deployments:

Streaming replication with automatic failover:

Primary database accepts writes, replica(s) receive continuous replication stream
Tools like Patroni or repmgr monitor primary health and orchestrate automatic failover
Replica promotion to primary typically completes within 30-60 seconds
Supports multiple replicas for read scaling or additional redundancy

Configuration requirements:

Enable WAL archiving and streaming replication in postgresql.conf
Configure replication slots to prevent WAL segment deletion before replica consumption
Set synchronous_commit to on or remote_apply for zero data loss (impacts performance)
Implement connection pooling (PgBouncer) for efficient connection management

Managed PostgreSQL services:

AWS RDS for PostgreSQL: Automated failover, point-in-time recovery, read replicas
Azure Database for PostgreSQL: Built-in HA with zone redundancy
Google Cloud SQL for PostgreSQL: Regional HA with automatic failover

rConfig configuration: Update .env database connection string to point to primary database endpoint. During failover, Patroni/repmgr updates DNS or VIP to redirect connections to new primary. No rConfig application changes required.

MariaDB High Availability

MariaDB provides multiple HA topologies suited to different requirements:

Primary/Replica replication:

Asynchronous replication from primary to one or more replicas
Manual or automated failover promotes replica to primary
Simple to configure but brief data loss possible during failover
Suitable for environments accepting eventual consistency

Galera Cluster:

Multi-primary synchronous replication across cluster nodes
Writes can occur on any node with automatic conflict resolution
No data loss during node failures
Requires odd number of nodes (3 minimum) for quorum
More complex to operate but provides active/active database capability

ProxySQL or HAProxy integration:

Database proxy layer routes connections to healthy primary
Monitors backend database health via health checks
Automatically redirects connections during failover
Provides connection pooling and query routing

rConfig configuration: Point .env database settings to HAProxy/ProxySQL VIP or load balancer. Proxy handles routing to healthy database backend. For Galera, any node can serve writes, but recommend using proxy to prevent connection storms.

Database High Availability Best Practices

Monitor replication lag: Streaming replication can fall behind during high write volume. Alert when lag exceeds acceptable thresholds (typically 5-10 seconds) to prevent data loss during failover.

Test failover procedures regularly: Quarterly failover testing validates automation, identifies issues, and maintains team familiarity with procedures. Document actual failover times and data loss (if any) for RTO/RPO verification.

Implement connection pooling: PgBouncer (PostgreSQL) or ProxySQL (MariaDB) reduce connection overhead, enable seamless database backend changes, and prevent connection exhaustion during high load.

Configure appropriate backup retention: Even with HA replication, maintain independent backups. Replication propagates corruption or accidental deletion to replicas. Backups enable recovery from logical errors.

Size database servers appropriately: Replica servers must handle full production load to serve as primary during failover. Under-provisioned replicas cause performance degradation post-failover.

Persistent Storage High Availability

rConfig requires persistent storage for configuration files, compliance results, backups, and system logs. This storage must remain accessible during node failures to maintain operations.

Shared Storage Requirements

The /var/www/html/rconfig8/current/storage directory (and particularly the storage/app subdirectory) contains critical persistent data:

Configuration repository: Downloaded device configurations organized by device and date
Compliance results: Policy evaluation outcomes and historical trends
Backup archives: System backup ZIP files
Application logs: Laravel framework and application logs
Uploaded files: User-uploaded files like device import CSVs

In HA deployments, this storage must be accessible to all nodes with consistent view of data.

Storage Architecture Options

Network File System (NFS):

Mount NFS share to identical path on all rConfig nodes
NFS server provides centralized storage with file-level access
Simple to configure and widely supported
Performance adequate for most deployments
Single point of failure unless NFS server is also HA

GlusterFS:

Distributed filesystem replicating data across multiple storage nodes
No single point of failure with replica count ≥ 2
Scales horizontally by adding storage nodes
More complex to deploy and maintain than NFS
Excellent performance with proper tuning

Cloud-native storage (EFS, Azure Files):

Managed storage services with built-in redundancy
Simple to configure with no storage server management
Scales automatically based on usage
Higher cost than self-managed options
Performance varies based on service tier

Considerations for storage selection:

Latency: rConfig frequently reads/writes configuration files. High-latency storage impacts download performance and web interface responsiveness. Target < 10ms latency for file operations.

Throughput: Large deployments downloading thousands of configurations simultaneously require substantial throughput. Ensure storage can handle peak load without bottlenecking.

IOPS: Database operations, compliance checks, and queue processing generate significant IOPS. Under-provisioned storage causes performance degradation.

Locking: Distributed filesystems require proper locking mechanisms to prevent concurrent modification corruption. Verify NFS is configured with appropriate locking (lockd, statd).

Backup integration: Ensure shared storage is included in backup strategies. Storage failure without backups results in complete configuration history loss.

Failover Mechanisms

Virtual IP (VIP) with Keepalived

Keepalived implements VRRP (Virtual Router Redundancy Protocol) to provide VIP failover between nodes:

How it works:

Active node runs keepalived process maintaining VIP on network interface
Standby node monitors active via VRRP heartbeat (multicast or unicast)
Upon active failure, standby detects missed heartbeats and claims VIP
Network traffic immediately routes to standby node

Configuration considerations:

Both nodes must be on same network segment (Layer 2 adjacency)
Firewall must permit VRRP protocol (IP protocol 112)
Configure appropriate heartbeat intervals (default 1 second)
Implement health check scripts to detect application failures, not just node failures

Corosync/Pacemaker

Corosync/Pacemaker provides comprehensive cluster resource management:

Capabilities:

VIP management with automatic failover
Service start/stop orchestration
Fencing (STONITH) to prevent split-brain scenarios
Resource dependencies and ordering
Complex failure detection and recovery policies

Complexity trade-off: Significantly more complex to configure than keepalived but provides enterprise-grade cluster management. Recommended for environments with existing Pacemaker expertise or requirements for sophisticated resource management.

Cloud-Native Load Balancers

AWS, Azure, and GCP provide native load balancing with health checks:

AWS Application Load Balancer (ALB):

HTTP/HTTPS health checks to rConfig web interface
Automatically removes failed instances from target group
Integrates with Auto Scaling for automatic node replacement

Azure Load Balancer:

Layer 4 load balancing with TCP health probes
Supports HA ports for all traffic types
Zone-redundant for cross-availability-zone resilience

Configuration approach: Configure health check endpoint (e.g., /api/health) that validates:

Web server responding
Database connectivity
Queue worker operational
Critical services running

Failed health checks trigger automatic traffic redirection to healthy nodes.

Monitoring and Automation

HA deployments require comprehensive monitoring to detect failures before they impact operations:

Node Health Monitoring

System metrics:

CPU utilization: Alert when sustained > 80%
Memory usage: Alert when available < 20%
Disk space: Alert when free space < 20%
Disk I/O wait: Alert when iowait > 25%
Network connectivity: Monitor interface status and throughput

Application metrics:

Web server response time: Alert when > 3 seconds
Queue depth: Alert when pending jobs > threshold
Database connection pool: Alert when available connections < 10%
PHP-FPM worker saturation: Alert when all workers busy

Service Health Monitoring

rConfig services:

Nginx/Apache web server status
PHP-FPM process count and responsiveness
Horizon queue worker processes
Supervisor status (managing queue workers)
Database connectivity and replication lag

External dependencies:

Database server availability and performance
Shared storage mount status and accessibility
DNS resolution for database and storage endpoints

Failover Automation

Automated failover: Implement health checks triggering automatic failover without human intervention. Suitable for environments with RTO < 5 minutes requiring 24/7 availability.

Manual failover: Operations team receives alerts about failures and initiates failover procedures. Suitable for environments accepting longer RTO where verification before failover is preferred.

Semi-automated failover: System detects failures and prepares for failover but waits for human approval before executing. Balances automation benefits with human oversight.

Implementation Planning

Assessment Phase

Before implementing HA, conduct thorough assessment:

Define requirements: Document RTO, RPO, availability targets, and business impact of downtime
Infrastructure audit: Evaluate existing infrastructure capabilities, network topology, and operational team expertise
Budget determination: Calculate infrastructure costs, licensing, and Professional Services engagement
Risk analysis: Identify potential failure modes and evaluate HA architecture effectiveness against each

Design Phase

Work with rConfig Professional Services to design HA architecture:

Architecture selection: Choose Active/Standby, Active/Active, or horizontal scaling based on requirements
Component design: Specify database HA approach, storage solution, and failover mechanism
Network design: Plan IP addressing, VIP allocation, and network connectivity requirements
Capacity planning: Size servers, storage, and network bandwidth for production load plus headroom
Documentation: Create architecture diagrams, configuration specifications, and operational runbooks

Implementation Phase

Professional Services guides implementation:

Infrastructure provisioning: Deploy servers, storage, and network components
Software installation: Install rConfig, database, and supporting components on all nodes
HA configuration: Implement replication, failover mechanisms, and monitoring
Testing: Execute failover testing, performance validation, and disaster recovery drills
Cutover planning: Develop migration plan from existing deployment to HA architecture

Operational Phase

Maintain HA deployment through ongoing activities:

Regular failover testing: Quarterly validation of failover procedures
Capacity monitoring: Track growth and scale infrastructure proactively
Performance tuning: Optimize based on actual usage patterns
Documentation updates: Maintain runbooks reflecting configuration changes
Team training: Ensure operations staff remain proficient in HA procedures

System Requirements - Infrastructure prerequisites for HA deployments
Backups - Backup strategies for HA environments
Horizon Queue Manager - Queue worker configuration in HA deployments
Database Configuration - Database setup and optimization

Summary

High Availability for rConfig V8 requires careful planning aligned with business requirements, infrastructure capabilities, and operational expertise. The Active/Standby configuration provides the optimal balance of reliability and operational simplicity for most environments, while Active/Active architectures serve specialized requirements where complexity and cost are justified.

Key takeaways for HA planning:

Define clear RTO and RPO objectives before selecting architecture to ensure alignment with business needs
Active/Standby is recommended for most deployments requiring high availability without excessive complexity
Database and storage HA are critical as application-level HA is ineffective without backend resilience
Test failover regularly to validate automation, maintain team proficiency, and identify issues before they impact production
Engage Professional Services for architecture design, implementation, and ongoing support of HA deployments

High Availability configurations are available exclusively for rConfig Enterprise and rConfig Vector licenses. Contact rConfig Professional Services to discuss your specific requirements and design an HA architecture aligned with your organization’s objectives.