Account Access and Trading Outage Postmortem 04/26/2022

April 26, 2022
Newton Team
April 26, 2022
No items found.
Account Access and Trading Outage Postmortem 04/26/2022

Root causes:

  • Unexpected networking issues bumped our market data services off of the production load balancer, this led to timeouts on the core application's "rates" endpoint which caused trades to fail.
  • These timeouts caused a backlog of requests for our core application's containers to build up. This backlog quickly overwhelmed the containers, crashing them and triggering a replacement. The replacements would then quickly get overwhelmed, crashing them, and continuing the cycle.

Fixes:

  • Set a short timeout on calls to our market data services from the core application to prevent timeouts from tying up threads for extended periods of time.
  • Improve image caching and tune CPU resources to reduce the risk of a cascading failure
  • In the long term we will continue moving more functionality out of the core application and into microservices so that future failures won't impact unrelated parts of the application.

Thank you for your patience and understanding while we worked to fix this issue.

- The Newton team

No items found.
Newton Team
Follow Newton on LinkedIn
Follow Newton on YouTube
Follow Newton on LinkedIn
Follow Newton on Twitter

BACK TO BLOG
join our research group