Skip to content

[Chapter 1] Feedback #1

@nph

Description

@nph

Hi @aronchick

Really enjoyed reading this - you've managed to articulate many important aspects of making ML work in the real-world that don't get enough discussion. And it's entertaining to boot! Great job. My comments / feedback on Chapter 1 below.

Opening Blurb

Last year, a Fortune 500 company spent $50 million on ML infrastructure and got beaten to market by three engineers with a laptop and clean data. This isn't a David and Goliath story - it's Tuesday in the ML world.

Okay, let's start with another story.

But I was actually looking forward to reading more about the opening story! (even if it's apocryphal...). Instead of dropping it altogether right after you've hooked the reader, perhaps you could revisit it later in the chapter? That said - I did enjoy the Team A vs Team B story and it gets the central message across well.

Welcome to the Data-Centric AI revolution. Population: not enough people yet, but we have physics and the real world on our side.

I'm not sure "physics" is quite the right word here. Maybe "science" is more in line with what you're driving at. Team B certainly embodies a lot more of the scientific method than Team A.

As an aside, I should stress (!), There's nothing wrong with Team A. They're doing everything that we need to move the industry forward. But more often than not, there's a big gap between the latest research and actually using those breakthroughs.

Feel like this detracts from your message. I'd say there's very much something wrong with Team A. If they've been tasked with shipping a working ML product then spending their time playing around with the latest ML architectures when they haven't yet done a proper evaluation of logistic regression, let alone visualized the dataset, seems... bad. Of course if they're working for Deepmind and tasked with pushing forward the state of the art and publishing papers then that's different (although even those guys should be checking their data carefully...), but it doesn't seem like that's the setup here.

1.1 Andrew Ng's Paradigm Shift

Let me throw some numbers at you:

Model architecture improvements: 1-2% accuracy gain (if you're lucky and the moon is in the right phase)
Cleaning your data: 5-10% accuracy gain
Actually understanding your data: 20-30% accuracy gain (I've seen this with my own eyes, multiple times)

I think it would be helpful to explain how data understanding differs from data cleaning. My "2 pence" worth would be that a core aspect of data understanding is figuring out what we're not currently capturing about the problem that is required to achieve good performance, i.e. what's currently missing in our model inputs / features / architecture / batch class proportions / learning methodology etc, that is limiting our predictive accuracy. Whereas data cleaning is more narrow and fundamentally about fixing errors in the data / features / labels.

The "The Amazon Hiring Fiasco" is a good example of why data understanding is important, and one where data cleaning would not have solved the core issue.

based on an older natural sciences paper called "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"

Maybe make this detail a footnote (the paragraph is a bit wordy and hard to read at present).

A Real Story That Still Makes Me Laugh (And Cry)

Enjoyed this one!!

1.3 Data-Centric vs Model-Centric: The Middle Path

I think the code samples would be clearer if you explicitly showed where/how accuracy is computed:

while accuracy < target:
    data = understand_and_fix_data(data)
    accuracy = train_and_evaluate_model(model, data)
    documentation += 1
    understanding += 1
    sanity += 1  # Yes, plus!

Also it might be worth briefly saying something about how we should choose the initial model to work with if we're following the data-centric approach (given that the focus is squarely on the data).

Focus on Models When:
...

Maybe also add to this list - "You're 100% confident that your features fully encode all aspects of the problem to enable the model to achieve high performance" - which obviously gets back to the data understanding discussion.

The Balanced Approach: Having Your Cake and Eating It Too

This is good stuff 👍

1.4 The Ten Fundamentals of Data-Centric AI

Ditto - love these 10 commandments.

1.5 Learning from Failures: The Hall of Shame (And Fame)

Feel like these case studies need fleshing out a bit more - especially Case Study 1. Didn't grok why everyone got recommended bestsellers?

Case Study 3: Predictive Maintenance - Schrödinger's Failure
The Comedy of Errors:
Sensors sampled at different rates (1Hz, 1/minute, 1/hour)

But sensors are frequently sampled at different rates in real-world systems (and for good reasons)! I'm guessing you mean that the error was in how the multi-rate sampling data was treated during modeling.

1.6 The Economics of Data Quality (Or: How to Justify This to Your Boss)

This is great stuff. And you're 100% right that none of this gets talked about anywhere near enough.

Exercise 1: The Reality Check (Time: ~2 hours)

Excellent advice and reminds me of this Karpathy tweet.

Looking forward to reading more. Hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions