mlops-zoomcamp/cohorts/2025/03-orchestration/homework.md at main · JunzheJoe/mlops-zoomcamp

Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, and a modern data workflow orchestration tool such as Mage or Prefect or others as listed in the course material under '03-orchestration'.

We'll use the same NYC taxi dataset, the Yellow taxi data for March, 2023.

Question 1. Select the Tool

You can use the same tool you used when completing the module, or choose a different one for your homework.

What's the name of the orchestrator you chose?

Question 2. Version

What's the version of the orchestrator?

Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load?

3,003,766
3,203,766
3,403,766
3,603,766

(Include a print statement in your code)

Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously.

This is what we used (adjusted for yellow dataset):

def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df

Let's apply to the data we loaded in question 3.

What's the size of the result?

2,903,766
3,103,766
3,316,216
3,503,766

Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

Fit a dict vectorizer.
Train a linear regression with default parameters.
Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model?

Hint: print the intercept_ field in the code block

21.77
24.77
27.77
31.77

Question 6. Register the model

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (model_size_bytes field):

14,534
9,534
4,534
1,534

Submit the results

Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw3
If your answer doesn't match options exactly, select the closest one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homework

Question 1. Select the Tool

Question 2. Version

Question 3. Creating a pipeline

Question 4. Data preparation

Question 5. Train a model

Question 6. Register the model

Submit the results

FilesExpand file tree

homework.md

Latest commit

History

homework.md

File metadata and controls

Homework

Question 1. Select the Tool

Question 2. Version

Question 3. Creating a pipeline

Question 4. Data preparation

Question 5. Train a model

Question 6. Register the model

Submit the results