OpenAI Fixes 18-Year-Old libunwind Bug and Azure Host Fault in Rockset Crashes
Updated
Updated · OpenAI · Jun 30
OpenAI Fixes 18-Year-Old libunwind Bug and Azure Host Fault in Rockset Crashes
1 articles · Updated · OpenAI · Jun 30
Summary
OpenAI said a months-long Rockset crash investigation uncovered two separate causes: an 18-year-old race condition in GNU libunwind and silent hardware corruption on one Azure host.
A population-wide review of a year of production core dumps—using an automated script ChatGPT helped write—split crashes into return-to-null and misaligned-stack groups, revealing software and hardware patterns that manual debugging missed.
The libunwind flaw hit during C++ exception unwinding when a signal arrived in a one-instruction race window, corrupting a stack-allocated ucontext_t; OpenAI said that matched more than a dozen daily crashes fleetwide.
OpenAI mitigated the issue by switching Rockset from GNU libunwind to libgcc's unwinder, upstreaming a reproducer and fix, and denylisting the faulty Azure host after the misaligned-stack crashes disappeared.