Updated
Updated · OpenAI · Jun 30
OpenAI Fixes 18-Year-Old libunwind Bug and Azure Host Fault in Rockset Crashes
Updated
Updated · OpenAI · Jun 30

OpenAI Fixes 18-Year-Old libunwind Bug and Azure Host Fault in Rockset Crashes

1 articles · Updated · OpenAI · Jun 30

Summary

  • OpenAI said a months-long Rockset crash investigation uncovered two separate causes: an 18-year-old race condition in GNU libunwind and silent hardware corruption on one Azure host.
  • A population-wide review of a year of production core dumps—using an automated script ChatGPT helped write—split crashes into return-to-null and misaligned-stack groups, revealing software and hardware patterns that manual debugging missed.
  • The libunwind flaw hit during C++ exception unwinding when a signal arrived in a one-instruction race window, corrupting a stack-allocated ucontext_t; OpenAI said that matched more than a dozen daily crashes fleetwide.
  • OpenAI mitigated the issue by switching Rockset from GNU libunwind to libgcc's unwinder, upstreaming a reproducer and fix, and denylisting the faulty Azure host after the misaligned-stack crashes disappeared.

Insights

OpenAI treated software crashes like a disease outbreak. Is this 'epidemiological' approach the future of debugging for complex AI systems?
An 18-year-old bug nearly broke ChatGPT. What other digital time bombs are ticking inside the world’s most critical software?