Having caused massive data incidents by making 2-lines-of-SQL “hotfixes”, I’ve come to believe that: In my experience (Daniel Kahneman rolls his eyes), we significantly overestimate the percentage of externally-caused data issues and dramatically underestimate how often we (people) break things for ourselves and others. Could be us pausing ads, could be a delay in the data ingestion pipeline, or it could be a real problem, but we won't know until someone spends a couple of hours digging into it.
Vendor providing financial data shipped us a dataset omitting three markets.Event was duplicated in the streaming pipeline, causing a fanout in the warehouse.Airflow scheduler errors out a task never ran but shows as completed.Analytics engineer renames a field in a dbt model “for consistency.” An online machine learning model that powers search is no longer online.ĭata breaks for reasons outside of our control:.The same scenario applies to changes to transactional tables replicated from OLTP databases into the warehouse for analytics. Fixing it takes two weeks since it requires new instrumentation.