Back to Blog

June 25, 2024

The Network Alchemist: Decoding TCP Congestion Control for Backend Sorcerers

Prologue: The Packet Deluge
Picture your microservice as a medieval city. Data packets are merchant carts carrying goods (bytes) across a shared stone bridge (network link). When too many carts flood the bridge simultaneously, gridlock ensues—packets collide, latency spikes, and goods rot in transit. This is network congestion—and TCP’s congestion control is your city’s master bridgekeeper. Let’s dissect its machinery through kernel heaps, algorithm state machines, and hardware interrupts.


Chapter 1: The Congestion Window – Kernel Heap’s Thermostat

At the heart of congestion control lies the congestion window (cwnd)—a dynamic limit stored in the kernel’s TCP Control Block (TCB):

struct tcp_sock {  
    u32 cwnd;         // Congestion window (in MSS-sized chunks)  
    u32 ssthresh;     // Slow start threshold  
    // ... other state variables  
};  

Unlike flow control’s receiver-managed window (rcv_wnd), cwnd lives purely in the sender’s kernel heap. It’s a self-imposed throttle—think of it as your city guard limiting bridge traffic based on observed congestion.

Key Analogy:

  • cwnd = Number of merchant carts allowed on the bridge simultaneously
  • ssthresh = The bridge’s known safe capacity (when to switch from rapid to cautious expansion)

Chapter 2: Slow Start – The Kernel’s Exponential Probe

Phase: Initial Data Transmission

When a connection opens, cwnd starts at 2-4 MSS (Max Segment Size). For every ACK received:

cwnd += MSS;  // Exponential growth: 2 → 4 → 8 → 16 MSS  

Kernel Mechanics:

  1. NIC receives ACK → triggers hardware interrupt
  2. Kernel soft-IRQ context updates TCB in kernel heap
  3. tcp_slow_start() increments cwnd atomically

Why exponential? Like sending scouts to map uncharted terrain before deploying armies.


Chapter 3: Congestion Avoidance – The Additive Dance

Phase: When cwnd >= ssthresh

Growth shifts to linear scaling:

cwnd += (MSS * MSS) / cwnd;  // ≈1 extra packet per RTT  

System Impact:

  • Kernel computes this in tcp_cong_avoid() during ACK processing
  • State stored in TCB’s heap memory, modified via atomic ops
  • Analogy: Adding one cart per hour after the bridge reaches 50% capacity

Chapter 4: Congestion Collapse – Packet Loss as Oracle

Scenario: Packet Loss Detection

When ACKs timeout (RTO) or duplicate ACKs arrive (Fast Retransmit):

  1. Kernel sets ssthresh = max(cwnd / 2, 2)
  2. Resets cwnd = 1 MSS (timeout) or cwnd = ssthresh + 3 (Fast Retransmit)

Kernel Stack Trace:

  • Timer interrupt fires → tcp_retransmit_timer()
  • Updates TCB → forces cwnd reduction
  • Analogy: Bridge collapse forces city to rebuild at half-width

Chapter 5: ECN – The Router’s Whisper

Explicit Congestion Notification (ECN) lets routers mark packets before loss occurs:

  1. Sender sets ECN bits in IP header during tcp_transmit_skb()
  2. Congested router flips CE (Congestion Experienced) bit
  3. Receiver echoes CE in ACK via ECN-Echo flag
  4. Kernel reacts without waiting for loss:
    cwnd = max(cwnd * beta, 2); // Beta=0.7 in Linux  
    ssthresh = cwnd;  
    

Hardware/Kernel Handshake:

  • Router’s ASIC marks packets at line rate
  • NIC DMA copies packet to kernel heap → soft-IRQ decodes ECN

Chapter 6: Memory & Registers – The Silent Stagehands

Where the magic materializes:

  1. TCB State: Lives in kernel heap (dynamic, resizable)
  2. cwnd Updates: Applied in soft-IRQ context (preempts user-space)
  3. Timers: RTO tracked via kernel’s hrtimer (high-resolution timer)
  4. Packet Buffering:
    • Outgoing packets: Queued in sk_buff structures (kernel heap)
    • NIC Ring Buffer: DMA region (shared between kernel and hardware)
[User-Space] → [syscall] → [Kernel Heap (TCB, sk_buff)]  
                      ↓  
              [NIC Ring Buffer (Registers)]  
                      ↓  
                  [Physical Network]  

Chapter 7: Modern Alchemy – BBR, Cubic, and Kernel Tunables

Advanced Algorithms:

  • CUBIC: Default in Linux (net.ipv4.tcp_congestion_control=cubic). Uses cubic function for cwnd growth.
  • BBR: Models bandwidth-delay product, ignores packet loss.

Tuning for Backend Engineers:

# Increase kernel heap buffers  
sysctl -w net.core.wmem_max=16777216  
sysctl -w net.ipv4.tcp_wmem='4096 16384 16777216'  

# Enable ECN  
sysctl -w net.ipv4.tcp_ecn=1  

Epilogue: The Engineer’s Scroll of Wisdom

  1. cwnd ≠ rcv_wnd: Flow control prevents receiver heap overflow; congestion control guards network links.
  2. Loss is Feedback: Packet drops aren’t failures—they’re the network’s backpressure signal.
  3. ECN as Canary: Leverage ECN to avoid congestion collapse in cloud networks.

"In the realm of packets, the wise engineer knows:
The heap is finite, the bridge is shared,
And kernel timers watch while user-threads sleep."

When your service scales, remember TCP’s ancient pact: Probe aggressively, retreat gracefully, and let routers whisper their warnings. The network’s chaos bows to algorithms forged in kernel fires.