for the 'The Numbers That Should Scare You Straight' section, I wonder if you have numbers or a feeling for the one that i've been dealing with too often - and what one would mark down as the gains of not unknowingly publishing dubious results.
monitoring and alerting for discontinuities or irregularities in the data streams (both input and output)
way too many data projects I've stumbled into don't have any indicators when the output is garbage, and worse, don't find out about the input suffering until way after it happens - partially due to the batching, but often because there literally is no monitoring of issues during the pipeline.
last week's example - the log has an ERROR line telling me that the pipeline failed to get an item of data - the pipeline continues on and reports success. in production ... the logs are discarded. because .... it's a data pipeline, not a software pipeline?
mmmm, especially given you start to go into it later in the chapter.
for the 'The Numbers That Should Scare You Straight' section, I wonder if you have numbers or a feeling for the one that i've been dealing with too often - and what one would mark down as the gains of not unknowingly publishing dubious results.
monitoring and alerting for discontinuities or irregularities in the data streams (both input and output)
way too many data projects I've stumbled into don't have any indicators when the output is garbage, and worse, don't find out about the input suffering until way after it happens - partially due to the batching, but often because there literally is no monitoring of issues during the pipeline.
last week's example - the log has an ERROR line telling me that the pipeline failed to get an item of data - the pipeline continues on and reports success. in production ... the logs are discarded. because .... it's a data pipeline, not a software pipeline?
mmmm, especially given you start to go into it later in the chapter.