Overall, how would you assess the reliability of your current architecture under stress?, Can you describe a bug that slipped through code review and testing? Which is why do you think it wasn’t caught earlier?, Have you ever noticed that small inefficiencies add up over time and eventually impact system stability?, What typically causes a sudden spike in load in your system, and how do you handle it?, When explaining incidents to non-technical stakeholders, how do you summarise what happened basically, without oversimplifying?, Can you recall a change that seemed harmless but caused unexpected behaviour in production?, Have you experienced a failure which then triggered retries or fallback mechanisms that overloaded upstream services?, During an outage, how do you explain what that meant in practice is that users were unable to complete critical flows?, Have you dealt with requests that didn’t get through to downstream services because of timeouts or circuit breakers?, Because of that, did the system end up degrading gradually, or did it fail fast?, Has retry logic ever ended up making the situation worse instead of improving resilience?, First, what signals do you look at when investigating a production incident?, How do you design rate limiting so that one overloaded dependency doesn’t impact the entire system?, What kind of monitoring around third-party integrations do you usually put in place?, How quickly can your team spot anomalies before they escalate into user-visible failures?, Once the issue was mitigated, how did the system behave once traffic returned to normal levels?.

Leaderboard

Visual style

Options

Switch template

)
Continue editing: ?