Əməliyyatlar2026-04-087 dəq

APIC klasterləşmə: flapping olmadan failover

Beyrak A.

Founder & Security/System Engineer

The problem with single-controller monitoring

Cisco ACI fabrics are managed by APIC controllers — typically three per site, forming a cluster. Most monitoring tools connect to one controller and assume it will always be available. When that controller goes down for maintenance, firmware upgrade, or failure, the monitoring tool goes blind. Worse, it may start throwing errors that trigger false alerts, or it may silently stop collecting data without anyone noticing for hours.

SAMURAI treats APIC controllers as clusters from the start. You register the controllers, and the system automatically groups them by site, elects a primary, and handles failover without operator intervention.

Cluster structure

Clusters are stored in the apic_clusters collection, keyed by _id = ndoID|siteName. This composite key ensures that multi-site deployments managed by NDO (Nexus Dashboard Orchestrator) group correctly — controllers belonging to the same site under the same NDO instance form one cluster, even if they were registered at different times.

Each cluster tracks its members, the current primary, and a failure counter per member. The primary is the controller that SAMURAI actively syncs from. All other members are standby — they exist in the system, but sync operations skip them via ClusterResolveDevice(), which redirects any request for a standby member to the current primary.

Failover logic

When a sync to the primary controller fails, the failure counter increments. A single failure does not trigger failover — network blips, brief maintenance windows, and transient API errors are too common to react to immediately. Failover only triggers after failover_threshold consecutive failures (default: 3).

When the threshold is reached, SAMURAI promotes the next healthy member to primary and resets the failure counter. But it also starts a failover_cooldown_seconds timer (default: 600 seconds). During cooldown, no further failover can occur. This prevents flapping — the scenario where two controllers alternate between primary and standby because both are intermittently reachable.

Why cooldown matters

Without cooldown, a degraded controller that responds to every other health check would cause rapid primary oscillation. Each failover resets counters, re-authenticates API sessions, and potentially produces duplicate or incomplete sync data. The 600-second cooldown gives the original primary time to either fully recover or fully fail, producing a clean signal for the next failover decision.

NDO-aware multi-site grouping

In multi-site ACI deployments, Nexus Dashboard Orchestrator manages policies across sites. SAMURAI’s cluster grouping is NDO-aware: controllers are grouped by their NDO association and site name, not just by IP range or manual assignment. This means a three-site, nine-controller deployment automatically produces three clusters, each with independent failover — no manual configuration required.

The sync engine skips non-primary members entirely. This is not just an optimization — it prevents duplicate data. All three controllers in a cluster serve the same fabric state. Syncing all three would produce three identical snapshots, waste API capacity, and complicate change detection. One primary, authoritative sync per cluster is the correct model.

Şəbəkəniz nizam-intizama layiqdir

SAMURAI-ni real mühitinizə qarşı işlədiyini görün. Demoların əksəriyyəti 24 saat ərzində planlaşdırılır.

Demonuzu sifariş edin Canlı tur