June 27, 2024
Prologue: The IP Famine of '94
In the primordial era of the internet, as engineers watched IPv4 addresses deplete like water in desert sands, a solution emerged from the kernel's depths: Network Address Translation (NAT). Picture your private network as a medieval castle. The public internet is the wilderness beyond your walls. NAT became your gatekeeper—translating internal whispers into external proclamations, shielding your servants (private IPs) while presenting a unified banner (public IP) to the world. For backend engineers, this isn't magic—it's a symphony of kernel heaps, conntrack tables, and register dances. Let's lift the portcullis.
netfilter
subsystem—a gatekeeper between eth0
(public) and eth1
(private).// IPv4 address: 32-bit integer (heap-allocated in kernel structs)
struct in_addr {
uint32_t s_addr; // Stored in network byte order
};
// Without NAT: 1 IP = 1 machine
// With NAT: 1 IP = 2^16 ports (65,536 concurrent connections)
Analogy: Like apartment mailboxes (ports) sharing a building address (public IP).
When a packet leaves your private network:
send()
→ data copies to kernel heap via copy_from_user()
.netfilter
's PREROUTING
chain.struct nf_conn {
tuplehash[IP_CT_DIR_ORIGINAL].tuple = {src_ip:192.168.1.5, src_port:54321}
tuplehash[IP_CT_DIR_REPLY].tuple = {src_ip:203.0.113.1, src_port:60001} // NAT!
};
EAX
) scrub private IP from packet headers.Packet Before NAT:
| SRC_IP: 192.168.1.5 | SRC_PORT:54321 | DEST_IP: 93.184.216.34 |
Packet After NAT:
| SRC_IP: 203.0.113.1 | SRC_PORT:60001 | DEST_IP: 93.184.216.34 |
conntrack
matches it to nf_conn
entry.DEST_IP:PORT
back to private address.route table
(kernel memory).# External view (what attackers see)
$ nmap 203.0.113.1
PORT STATE SERVICE
80/tcp open http # NAT gateway only
443/tcp open https # (not your internal app servers!)
conntrack -L # View active NAT translations (kernel heap)
When exposing an internal service (e.g., web server):
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-dest 192.168.1.10:80
crypto offload engine
).Analogy: Like a castle gatekeeper redirecting "Tax Collector" (port 80) directly to the treasury (internal server).
Each connection consumes ~300 bytes in kernel heap for nf_conn
tracking. At 10M connections:
10,000,000 * 300 bytes = 3 GB kernel memory
(Why cloud NAT gateways scale vertically!)
nf_conntrack
is a hash table in kernel heap tracking all active flows:
// Simplified kernel struct (linux/netfilter/nf_conntrack_core.h)
struct nf_conn {
struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]; // Source/dest tuples
unsigned long status; // Bitmask in register-sized word
struct timer_list timeout; // Kernel timer for GC
};
In Kubernetes:
kube-proxy
uses iptables/IPVS for NAT.conntrack -L | grep <IP>
dmesg | grep nf_conntrack
// Go: Detect public IP (behind NAT)
func GetPublicIP() string {
resp, _ := http.Get("https://api.ipify.org")
body, _ := io.ReadAll(resp.Body)
return string(body) // Returns NAT gateway's IP!
}
When ISPs share one IP across thousands of homes:
eBPF programs bypass netfilter
overhead:
// eBPF NAT example (kernel 5.18+)
SEC("xdp_nat")
int xdp_nat_handler(struct xdp_md *ctx) {
bpf_fib_lookup(...); // Hardware-accelerated NAT
}
nf_conntrack_max
based on RAM: sysctl -w net.netfilter.nf_conntrack_max=1000000
cat /proc/slabinfo | grep nf_conntrack
iptables -A FORWARD
for real security."In the kernel's heart, where packets dance and registers shimmer,
NAT stands as a relic of scarcity—a necessary illusion.
We, the backend lords, must wield its power without succumbing to its deceptions."
When your microservices traverse NAT gateways, remember: Every packet is a lie told for the greater good. The kernel’s conntrack table is the ledger of these lies—and your key to taming them.