When Idle Time Breaks Congestion Control: The CUBIC Bug in QUIC

By

The Unexpected Failure in QUIC’s Congestion Control

Cloudflare’s open-source QUIC implementation, quiche, relies on the CUBIC congestion controller as its default. CUBIC, standardized in RFC 9438, is the most widely used congestion control algorithm on the internet, governing TCP and QUIC connections alike. When a critical bug emerged—one that permanently pinned the congestion window (cwnd) at its minimum after a collapse—it threatened the performance of a significant share of Cloudflare’s traffic. This article tells the story of how a well-intentioned Linux kernel optimization, designed to align CUBIC with the “app-limited” exclusion rules from RFC 9438, inadvertently caused this issue when ported to QUIC. The resolution? A remarkably simple, near-one-line fix.

When Idle Time Breaks Congestion Control: The CUBIC Bug in QUIC
Source: blog.cloudflare.com

Understanding CUBIC’s Logic

Congestion Control Algorithms (CCAs) manage the flow of data across networks. Their primary tool is the congestion window (cwnd), which limits how many bytes a sender can have in flight (sent but not yet acknowledged) at any moment. A larger cwnd allows more data per round trip, while a smaller cwnd throttles the sender. Loss-based algorithms like CUBIC follow a simple premise:

  • If there is no packet loss, increase the sending rate to utilize more bandwidth.
  • If loss occurs, assume the network is overloaded and reduce the sending rate (cwnd).

This approach aims to maximize throughput by inferring available bandwidth. However, the classic loss-based logic has been revisited over the years, particularly regarding how to handle periods when the sender is application-limited—i.e., not sending data because the application has no data to send, not because the network is congested.

The App-Limited Exclusion

RFC 9438, which standardizes CUBIC, includes a crucial nuance in sections 4.2-12: during app-limited periods (when the sender has no data to send), the algorithm should not count that time as “idle” for the purpose of congestion window reduction or growth. This prevents the controller from overreacting when the application simply pauses. The Linux kernel implemented a fix to enforce this exclusion correctly—a change that later became the source of the bug.

The Bug: When “Idle” Isn’t Idle

The kernel change aimed to prevent CUBIC from incorrectly treating app-limited intervals as network idle time. Under TCP, this worked as intended. But when the same logic was ported to quiche—which runs in user space and handles QUIC connections—subtle differences in how quiche manages timers and state led to unexpected behavior. Specifically, after a congestion collapse (e.g., a burst of packet loss early in the connection), CUBIC’s cwnd would drop to its minimum value and never recover. The window stayed pinned at the floor, causing permanent throughput degradation.

The Symptom: Test Failures 61% of the Time

The bug surfaced during integration tests for Cloudflare’s ingress proxy. In scenarios where heavy packet loss occurred early in a QUIC connection, the tests failed unpredictably—about 61% of the time. This was alarming because recovery after congestion collapse is exactly what a congestion controller is designed to handle. Most test suites focus on steady-state behavior, but this corner case—where the cwnd is at its minimum—was rarely exercised. The failures pointed to a fundamental flaw in how CUBIC handled the app-limited exclusion in the context of QUIC’s connection lifecycle.

When Idle Time Breaks Congestion Control: The CUBIC Bug in QUIC
Source: blog.cloudflare.com

Diagnosis and Root Cause

Cloudflare engineers traced the issue to a state machine interaction. After loss reduces cwnd to the minimum, the controller enters a phase where it expects the sender to generate traffic based on the current window. However, because of the way quiche implemented the new idle-handling logic, the algorithm incorrectly believed the sender was still app-limited even when data was available. This caused CUBIC to remain stuck in its minimal window, never attempting to probe for more bandwidth.

The bug was effectively a mismatch between TCP kernel assumptions—where the sender’s “idle” state is cleanly tracked—and QUIC’s user-space implementation, where event timing and buffer management differ. The kernel optimization assumed that if no data is transmitted, the sender must be idle; but in QUIC, the cwnd itself can prevent transmission without the application being idle.

The Elegant (Near) One-Line Fix

The solution was deceptively simple: reset the internal “idle timestamp” when the cwnd is increased after a loss recovery. This single change broke the cycle of the controller believing it was app-limited when it was not. The fix was:

  • Clear the idle timer on any cwnd adjustment that moves the window out of the minimum.
  • This ensures that after a congestion event, CUBIC correctly recognizes that the sender is now active and free to grow its window.

The line of code re-enabled normal congestion window growth, reducing the test failure rate from 61% to nearly zero. The fix is now integrated into quiche and has been validated across Cloudflare’s production traffic.

Lessons Learned

This bug highlights the difficulty of porting kernel-level TCP optimizations to user-space QUIC implementations. Even subtle differences in timer granularity, event handling, and the definition of “idle” can produce surprising results. For network engineers and developers, it underscores the importance of testing congestion control algorithms in edge cases—especially the recovery phase after heavy loss. The fix itself is a reminder that sometimes the most impactful solutions are the simplest.

Cloudflare’s experience with this CUBIC bug serves as a case study in the challenges of building reliable, high-performance transport protocols. As QUIC adoption grows, similar careful adaptation of TCP algorithms will be critical to ensure a seamless and efficient internet.

Tags:

Related Articles

Recommended

Discover More

AI Security Classifier Fails: $2.44M Loss Blamed on Biased Data and Silent Library UpdateUnderstanding the Resurgence of Cyber Extortion in Germany: A Comprehensive GuideRussia-Linked Hackers Hijack Routers to Steal Microsoft Office Authentication Tokens: Q&AHow Astronomers Discovered a Surprising Atmosphere on a Tiny World at the Edge of the Solar SystemNew Information-Based Metric Revolutionizes Imaging System Design