pfn-header-logo
pfn-logo-white

eBPF and Azure CNI Powered by Cilium: A Deep Dive into AKS Network Performance

professnet-hero-14-security
professnet-hero-14-security

Your node is under load. CPU is climbing. You check your workloads, nothing unusual. Then you look at kube-proxy and conntrack, and there it is: the kernel is busy processing network rules, not your application. At around 5,000 pods this becomes noticeable. At 20,000, it becomes a serious problem. This post is about why that happens and what Cilium actually does to fix it.


The iptables Problem Is Not a “Future Concern”

kube-proxy has been the default Kubernetes network component for years. It works by translating Service abstractions into iptables rules, long chains of them. Every packet traversing the node goes through those chains sequentially. The time complexity is O(n), where n is the number of rules.

The math catches up fast. A cluster with 5,000 services and a reasonable number of endpoints can accumulate tens of thousands of iptables rules. At that scale, the per-packet cost stops being a rounding error. Measured on a Standard D4v3 instance, P99 latency for pod-to-service traffic sits around 1.8ms under that load. That number alone may not alarm anyone, but the behavior is what matters: the latency grows non-linearly as the cluster scales. It does not plateau.

The second problem is conntrack. Linux connection tracking maintains state for every active TCP connection passing through the host network namespace. It is one of the most CPU-intensive operations in the kernel networking stack, and in a dense Kubernetes node with hundreds of concurrent pods it runs constantly. This is the CPU you are “paying” without getting anything useful in return.

Together, iptables rule evaluation and conntrack account for a significant and measurable share of node CPU in large AKS clusters. CPU that could be running your actual workloads.


Where Cilium Enters the Stack

Cilium replaces the entire kube-proxy model. It does not layer on top of iptables; it bypasses it. The mechanism is eBPF: programs that run inside the Linux kernel in a sandboxed environment, attached to specific hook points in the network stack.

The key hook points in the Azure CNI Powered by Cilium implementation are:

TC Ingress on veth interfaces. Every pod connects to the host via a virtual Ethernet pair (veth). Cilium attaches eBPF programs directly to the TC Ingress hook on those interfaces, which means it intercepts packets immediately after they leave the container’s network namespace. At this point, the packet has not yet touched the host network namespace at all. This is where conntrack elimination happens. Cilium makes forwarding decisions without maintaining per-connection state in the traditional sense, using eBPF hash maps with O(1) lookup instead.

XDP (eXpress Data Path). For nodes exposed to external traffic, XDP hooks fire even earlier, directly in the NIC driver, before the kernel allocates an sk_buff structure for the packet. This is where DDoS mitigation and early ingress filtering run in Azure CNI Powered by Cilium. The cost is near zero because packets that should be dropped never enter the kernel stack.

Socket operations via cgroups. For pod-to-pod traffic on the same node, Cilium hooks into socket-level syscalls (connect, sendmsg). This allows intra-node communication to be short-circuited before it even becomes a network packet. The kernel redirects data directly between sockets. No veth traversal, no routing table lookup, no conntrack entry.

The practical result: service routing uses eBPF hash map lookups (O(1)) rather than sequential iptables chains. On the same D4v3 instance where iptables produced 1.8ms P99 at scale, Cilium with eBPF datapath delivers around 0.9ms. Unlike iptables, that number stays stable as the cluster grows. Pod-to-service throughput increases from roughly 22 Gbps to 28 Gbps. These are not synthetic micro-benchmark differences; they show up in real cluster behavior.


eBPF Host Routing: The Next Layer

The default Cilium setup already eliminates kube-proxy. eBPF Host Routing, available through Advanced Container Networking Services (ACNS), goes further.

Without it, packets traveling between pods on different nodes still pass through the host’s IP routing layer, which means kernel routing table lookups and potential iptables traversal for rules that live in the host namespace. eBPF Host Routing moves that decision entirely into the eBPF program. The forwarding path never touches those kernel layers.

On Azure Linux 3.0 (kernel 6.6+) and Ubuntu 24.04, this is paired with BpfVeth mode, which replaces the standard veth driver with an optimized implementation that reduces the CPU context-switch cost between the container namespace and the host interface. For AI/ML workloads generating large volumes of small packets (distributed training, parameter servers, inference serving) this is where 10-15% of node CPU can be recovered.


Choosing Your CNI Model in AKS

There are three meaningful options, and the choice has long-term consequences for IP space management and routing architecture.

Azure CNI (VNet IP) allocates pod IPs directly from the VNet subnet. Every pod is a first-class citizen on the network, directly reachable without any overlay. The cost is IP exhaustion: a node pool of 100 nodes with 30 pods per node consumes 3,000 IPs from your subnet. For enterprise environments with constrained RFC 1918 space, this becomes a planning problem fast.

Azure CNI Overlay separates pod addressing from VNet addressing. Pods get IPs from a private CIDR that does not consume VNet space; Azure’s SDN handles routing between the overlay and VNet. There is no VXLAN encapsulation at the node level (unlike most other cloud overlay implementations), which keeps the performance impact minimal.

Azure CNI Powered by Cilium supports both VNet IP and Overlay addressing, but replaces the entire datapath with eBPF. This is the model that enables eBPF Network Policies, Hubble observability, and L7 filtering.

Our default recommendation for new enterprise clusters: Overlay with Cilium. It solves the IP exhaustion problem without sacrificing performance, supports up to 5,000 nodes and 200,000 pods per cluster, and gives you the full eBPF feature set. The only reason to choose VNet IP with Cilium is when pods need to be directly routable from on-premises networks without additional route configuration, for example when legacy systems connect directly to pod IPs.

On the Node Subnet vs Pod Subnet question: if you are running in an environment where Network Security Groups need to apply differently to pod traffic vs node traffic, or where compliance requires pod-level IP visibility in network logs, use Pod Subnet (dynamic IP allocation). For everything else, Node Subnet is simpler to operate and eliminates one layer of subnet management.


Managed Cilium: What You Get and What You Give Up

The “Powered by Cilium” designation means Microsoft manages the Cilium agents, versioning, and compatibility with the AKS control plane. This is not a cosmetic distinction.

In a self-managed (BYOCNI) setup, you own the Cilium upgrade path, kernel compatibility testing, and, critically, BPF map sizing. BPF maps are kernel data structures with fixed capacities that must be set at agent startup. On large nodes running many pods, under-sized maps produce errors (map pressure, map insertion failure) that are silent until they cause dropped packets or broken service discovery. Correctly sizing these maps requires knowing your node’s CPU count, memory, maximum pod density, and expected connection concurrency. Get it wrong on a 1,000-node cluster and you are debugging kernel-level data structure exhaustion under production load.

The managed version handles this automatically, sizing BPF maps based on the VM SKU. That alone justifies the managed approach for most enterprise deployments.

What you cannot do with managed Cilium: configure BGP (Azure SDN handles routing), enable IPsec tunnel encryption (use Azure-native VNet encryption or WireGuard if available), or modify cilium-config parameters directly. CiliumClusterwideNetworkPolicy CRDs can be applied but are not officially supported. If they cause issues, Microsoft support will not cover them. Use CiliumNetworkPolicy scoped to namespaces instead.

Windows node pools are not supported. Cilium in AKS runs exclusively on Linux node pools; Windows pools continue to use Azure NPM.


Security Model: Identity Over IP

The fundamental shift in Cilium’s security model is worth understanding clearly, because it changes how you write and reason about policies.

Traditional network policies (Azure NPM, Calico in iptables mode, native Kubernetes NetworkPolicy) operate on IP addresses. In Kubernetes, pod IPs are ephemeral. Every restart potentially changes the IP, which means your policy enforcement mechanism is chasing a moving target. The workaround is CIDR-based rules or keeping policies broad enough to survive pod restarts, both of which erode security precision.

Cilium assigns a numeric Security Identity to each pod based on its labels. All pods sharing the same label set share the same identity. When a packet arrives at a destination pod, the eBPF program on the destination node evaluates the packet’s identity against the policy table in a single hash map lookup. The result: policy enforcement does not break on pod restart as long as labels stay the same, there are no race conditions when IP assignments change, and adding new pod replicas never requires a policy update.

For L3/L4 policies, this runs entirely in eBPF with negligible overhead. For L7 policies (HTTP path filtering, gRPC method filtering, Kafka topic filtering) Cilium redirects matching traffic to a local Envoy instance running as a DaemonSet. This is the decoupled Envoy model: no sidecar injection, no changes to application manifests, no per-pod memory overhead from a proxy process. The cost is measurable latency for L7-inspected traffic (typically 0.2-0.5ms depending on rule complexity), but it is contained to traffic that requires inspection, not all traffic.


Observability with Hubble: What You Can Actually See

Hubble is Cilium’s observability layer, and the key thing about it is that it requires no changes to applications, no sidecar proxies, and no instrumentation. Everything it sees comes from eBPF programs already running in the kernel.

In the ACNS integration, Hubble provides three things that matter operationally:

Flow logs to Azure Log Analytics. Every connection (source pod, destination pod, namespace, policy verdict, drop reason) is logged. The “drop reason” field is what makes this genuinely useful: instead of seeing a connection refused and guessing whether it is a misconfigured NetworkPolicy, the service being down, or a DNS failure, you get a specific reason attached to every dropped packet. Post-incident analysis goes from “let’s reproduce this” to “let’s read the logs from when it happened.”

Metrics to Azure Managed Prometheus. TCP retransmissions between nodes, DNS query latency and error rates, HTTP response code distributions, all visible without deploying a service mesh. For SRE teams building SLO dashboards, this closes a gap that previously required either Istio or application-level instrumentation.

Service Map. A real-time topology of which services are communicating with which, with traffic volume and error rates. Useful for auditing actual versus intended communication patterns, particularly after migrating to stricter NetworkPolicy configurations.

The “Stored Logs Mode” in ACNS retains flow data after pods are deleted. For regulated environments where security incidents require network-level audit trails, this matters. A pod that was compromised and deleted is gone, but its network activity is preserved.


The Problems We Have Actually Run Into

Identity exhaustion with Spark. Cilium supports a maximum of 65,535 unique Security Identities per cluster. Apache Spark jobs running on Kubernetes often use pod labels that include unique job or executor IDs. Each unique label combination is a distinct identity. A busy Spark cluster running many short-lived jobs can exhaust this limit, at which point the Cilium agent cannot assign identities to new pods and network connectivity breaks. The fix is to configure Cilium’s identity exclusion rules to ignore the labels that carry unique IDs, but you need to catch this before hitting the limit in production. If your workloads use Spark, Flink, or anything generating pods with unique label values, audit your label schema before enabling Cilium.

The migration reimage problem. Upgrading an existing AKS cluster to Azure CNI Powered by Cilium triggers a simultaneous reimage of all node pools. There is no rolling migration. All nodes are re-imaged at once, which means pods are evicted and rescheduled cluster-wide. Plan a maintenance window. If you have PodDisruptionBudgets that prevent eviction, the migration can get stuck. Review your PDBs before starting.

Istio compatibility with eBPF Host Routing. Some versions of Istio use iptables rules in the pod’s network namespace to intercept traffic and redirect it to the Envoy sidecar. eBPF Host Routing can interfere with this interception mechanism. The specific failure mode is traffic bypassing the Istio sidecar entirely, which breaks mTLS and policy enforcement without obvious error messages. Before enabling eBPF Host Routing on clusters running Istio, check the Cilium-Istio compatibility matrix for your specific versions. As of Cilium 1.14/1.15, there are known-good configurations, but they require explicit tuning.


When to Move to Cilium (and When to Wait)

Below 200 nodes and without strict security policy requirements, the performance gains from eBPF are real but not dramatic. kube-proxy works. If your cluster is stable and your team’s bandwidth is limited, this is not an emergency migration.

Above 500 nodes, the CPU overhead from iptables and conntrack becomes measurable in cost terms. At this scale, BPF map auto-sizing alone in the managed version is worth the migration effort, before considering the performance improvements. The identity-based security model also becomes increasingly valuable as the number of services grows and IP-based policies become harder to reason about.

For AI/ML workloads specifically (distributed training, large-scale inference) the combination of eBPF Host Routing and the intra-node socket-level optimization can recover meaningful compute from networking overhead. On GPU-heavy node types where the VM cost is high, that recovery has direct financial impact.

The strongest argument for Cilium in any cluster, regardless of size, is Hubble. The operational visibility it provides (without service mesh complexity, without sidecar injection, without application changes) addresses a real gap in how Kubernetes networking is observable today. That benefit is not tied to cluster scale.


Key Takeaways

eBPF datapath is not a performance tweak; it is a different execution model for packet processing. The performance numbers are real, but the more durable benefit is that they do not degrade as the cluster grows. iptables-based systems are bounded; eBPF-based systems scale to what AKS currently supports (200,000 pods) without architectural changes.

The managed version of Cilium in AKS is the right default for enterprise deployments. The BPF map management alone justifies it. Accept the constraints (no BGP, no custom config) and plan your NetworkPolicy architecture around CiliumNetworkPolicy namespaced resources.

If you are designing a new AKS cluster and not planning for Cilium, you are planning to migrate later under more pressure than you would like.

Table of contents

We are always happy to talk

Reach out to us about a project, consultation, or to explore other collaboration opportunities.

© 2026 Professnet. All rights reserved.