Help families in GazaDonate to the IRC to provide humanitarian aid and resettlement resources to families in Gaza
JT's Space
Articles
About
Journal
RaW
GitHub
Articles
About
Journal
RaW
GitHub

Wednesday March 4, 2026

JT Houk
Published March 4, 2026
incident-responseteam

Incidents

Reporting Consumer Lag (SEV-3): Post-Incident Review

Updated the Post-Incident Review document to address the comments and suggestions from Daniela.

Team

Sprint Retro (Feb 18 to Mar 4, 2026)

What went well. Shipped the sort sales by invoice number feature, merging and deploying to production across both the search indexing service and the GraphQL service. When the keyword_sort_normalizer task failed during deploy, I identified and resolved the root cause (the normalizer was not in the existing settings) and re-ran successfully the same day. Strong code review output this sprint, with 6+ PRs reviewed across the internal team (Dania, Zhijie, Vikki) and cross-team contributions. Led triage as Incident Manager for the SEV-3 reporting consumer lag incident; service redeployment resolved the lag and the Post-Incident Review was opened and completed with feedback addressed. Also identified gaps in Transifex onboarding docs and search indexing service documentation, and raised concrete proposals to improve both.

What didn't go well. Two incidents hit in two days: a SEV-1 (sales not appearing in Sales History) and a SEV-3 (reporting consumer lag) both in the same week. The recurring consumer lag pattern (third occurrence) suggests insufficient root cause resolution from prior incidents. Neither incident triggered critical alerts for Kafka consumer lag; the gap was noted but not yet addressed. Docker containers timing out on docker pull during task daily consumed a full morning on Feb 18. The invoice_number index mapping task failed in production due to a missing normalizer that wasn't caught in pre-deploy testing, requiring a same-day hotfix.

Action items:

  • Follow up with the reporting team to identify root cause of recurring consumer lag (SEV-3 pattern recurring 3x)
  • Review and improve alerting coverage for Kafka consumer lag trends
  • Investigate and resolve docker pull timeout issue in local dev environment
  • Add pre-deploy validation step or documentation for Elasticsearch index mapping tasks to catch normalizer/settings mismatches before production