Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions solution/SOLUTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
## Challenge 1 - Refactor DEV code

The refactorization was done taking into consideration three main objectives:

- Time optimization
- Increase code maintainability
- Make the code testable at different stages

### Time optimization

To increase the time efficiency of the code, I analyzed the different fragments of code in the notebooks.
They were four fragments that could potentially be optimized. Mainly apply pandas functions. For each function to be refactored,
I created a function to apply the original and the refactored code to different data inputs, measure the time and created a plot.

Three of the refactored code snippets achieved the reduction in time, the difference increased linearly with the data size. To reproduce
this test, I created a python script called `generate_plots.py`, the plots are stored in the folder `plots`. Here are the plots to each
function. In all the plots, the blue line corresponds to the original code.

#### 1. Parse bathroom text to integer
![parse_bathroom](/solution/plots/bathroom.png)

#### 2. Extract amenities
This function extracts the different amenities, while the time is close to the implemented function the code
could easily lead to error due to the copy/paste of the same code with different columns, my implementation only relies on
a list with the different amenities to extract. This code while tested sometimes it did outperform the original code, either way, I
think that the new implementation is better to code maintenance.

![Amenities](/solution/plots/cat_encoder.png)

#### 3. Pandas cut function

The numpy implementation is similar in code complexity but pandas cut is easier to understand so I kept this part as it is.


![Pandas_cut](/solution/plots/pd_cut.png)

#### 4. Parse the string price to int
![Parse_price](/solution/plots/price.png)

Note: Some of this conclusions may vary slightly if the code is executed in docker or local.

In addition, there are two scripts dedicated to test the original implementation with mine, one for each notebook. This tests can
be found in the the path `code/test/develope_test` the `test_eda.py` compares the notebook `01-experatory-data-analysis.ipynb` and
the `test_explore_classifier.py` compares the notebook `02-explore-classifier-model.ipynb`. The results of this code can be seen in
the logs folder `test_eda.log` abd `test_explorer.log` respectibely.

The result for the first one is always a bit worse than the original code. This is due to the implementation of the different steps
of the cleaning a processing process via Skelean `ColumnTransformer` and `Pipelines` the fitting method add some extra time to
compute the result. Nevertheless, I think that this delay is worth it because it allows to apply the same preprocessing steps to
unseen data which is usually desired when trying the generated model in unseen data.

In regards to the seccond script, the execution time es slightly better in the refactored code, but the difference in the implementation is very small.

### Maintainability

As mentioned before, to improve maintainability, I decided to implement the various steps as separate custom column transformers using
`sklearn`. This approach allows for easier modification of the process and the addition of new steps. The different transformers are
saved in `code/src/transformer.py`. Unit tests are also included to ensure that the code behaves as expected. By using pipelines, the
entire process can be summarized as follows:

```python
preprocessing_pipeline = Pipeline(steps=[
('col_selector', ColumnSelector(COLUMNS)),
('bathroom_processing', StringToFloatTransformer({'bathrooms_text': 'bathrooms'})),
('cast_price', StringToInt('price', r"(\d+)")),
('filter_rows', QueryFilter("price >= 10")),
('drop_na', DropNan(axis=0)),
('bin_price', DiscretizerTransformer('price', 'category', bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3])),
('array_to_cat', ArrayOneHotEncoder('amenities', CAT_COLS)),
('col_renamer_conditioning', ColumnRenamer(columns={'Air conditioning': 'Air_conditioning', 'neighbourhood_group_cleansed': 'neighbourhood'})),
('drop_cols', DropColumns('amenities'))
]
)

ct = ColumnTransformer(
[
('ordinal_encoder', CustomOrdinalEncoder(categories=[list(MAP_NEIGHB.keys()), list(MAP_ROOM_TYPE.keys())], start_category=1), ["neighbourhood", "room_type"])
],
remainder = "passthrough",
verbose_feature_names_out=False
)

processing_pipeline = Pipeline(steps=[
('drop_na', DropNan(axis=0)),
('categorical', ct),
('col_selector', ColumnSelector(FEATURE_NAMES + [TARGET_VARIABLE]))
]
)

data_pipeline = Pipeline(steps=[
('data_preprocessing', preprocessing_pipeline),
('data_processing', processing_pipeline)
])
```

To apply all the transformations at once, it is only necessary to call `data_pipeline`. The process is divided in order to facilitate
the testing of the
different transformations. This could also be implemented by creating different regular Python functions, but, in my opinion, this
approach is easier to
understand, export to other environments, and allows the trained transformers to be applied to new data, avoiding data leakage.

The different transformers could probably be improved or even merged for a cleaner implementation of the transformations. However, I tried to focus more
on the whole solution rather than aiming for an excellent transformation code, as that part is easier to fix.

### Code testeable

To make the code testable, I separated the different stages of development into different scripts as already explained above. I also
added unit tests for the transformers to ensure that the results remain correct after changes. And the tests for the results from the
original code are usefull to check debiations in the global result.

To facilitate the use of the code in different stages within CI, I divided the cleaning process into different pipelines according
to the notebooks. These pipelines are saved using joblib to make them reusable. Additionally, I deployed an `MLflow` instance to
save the model and the pipeline using the `MLflow.Pyfunc` class for the entire pipeline, the processing pipeline, and the trained
model. This makes it easier to use this code in the API, avoiding issues with the environment, code changes, or updates in the
models themselves.

## Challenge 2 - Build an API


To implement the API, I used the `FastAPI` framework along with Pydantic for validation of input/output data. The API is hosted
locally on `localhost:8000`. FastAPI includes an automatically generated documentation interface, `http://localhost:8000/`, where
example calls can be tested interactively.

The primary endpoint for this API can be accessed programmatically at `http://localhost:8000/model-inference`. The expected input
and ouput matches the format in the README file. Additionally, the endpoint also supports an array of elements, provided all
elements have the same length and adhere to the defined input schema. Here an example calling the endpoint programatically:

```python
import requests

payload = {
"accommodates": [4, 4],
"bathrooms": [2,2],
"bedrooms": [1,1],
"beds": [2,2],
"elevator": [1,1],
"id": [1001,1001],
"internet": [0,0],
"latitude": [40.71383, 456],
"longitude": [-73.9658, 56],
"neighbourhood": ["Brooklyn", "Brooklyn"],
"room_type": ["Entire home/apt", "Entire home/apt"],
"tv": [1, 1]
}
response = requests.post("http://localhost:8080/model-inference", json=payload)
response.json()
# expected output
{'id': [1001, 1001], 'price_category': ['High', 'High']}
```

## Challenge 3 - Dockerize your solution

To dockerize the solution I used Docker Compose with three Docker Images:
- **App**: Which creates the endpoint for the API to get the predictions.
- **Mlflow**: Which creates a server to save and load the model without copying the enviroment from one place to another.
- **Pipeline**: This image contains all the code explained before, it saves the models, the logs and the plots. This image takes a bit of time due to the testing of bigger sample data for the plots.

There is alson a `.env` file to store the endpoint of MLFlow in the other images to grant conectibity to the MLFlow server. To deploy
the solution it is only necessary to run `docker compose up --build` in the docker directory and wait arround one minute to have
everything ready.


Note: To run the different scripts locally, execute the code from the solution folder as follows:
```bash
PYTHONPATH="${PYTHONPATH}:../" python code/generate_plots.py
PYTHONPATH="${PYTHONPATH}:../" python code/pipeline.py
PYTHONPATH="${PYTHONPATH}:../" python code/test/test_transformers.py
PYTHONPATH="${PYTHONPATH}:../" python code/test/develope_tests/test_eda.py
PYTHONPATH="${PYTHONPATH}:../" python code/test/test_transformers.py
```
46 changes: 46 additions & 0 deletions solution/app/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from fastapi import FastAPI, HTTPException
import pandas as pd
import numpy as np
from app.models import ModelInput, ModelOutput
from app.utils import load_model, load_transformer

FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms']
OUT_CLASSES = np.array(['Low', 'Mid', 'High', 'Lux'])

app = FastAPI(
title= "Building Category prediction",
description= "Api to infer the price category of a building from its characteristics",
version= "1.0.0",
docs_url="/"
)


@app.post("/model-inference")
async def infer_price_caegory(input: ModelInput):

model = load_model()
transformer = load_transformer()

model_input = dict(input)

if model:
try:
# build data frame with the input to the transformer
# if all the field dont have the same lenght it will raise an error
if isinstance(model_input['id'], int):
input_data = pd.DataFrame(model_input, index=[0])
else:
input_data = pd.DataFrame(model_input, index=list(range(len(model_input['id']))))

# preprocess the data
data = transformer.predict(input_data)
data = data[FEATURE_NAMES].dropna(axis=0)
category = model.predict(data)
# parse the numerical outpout to the corresponding classes
category_str = OUT_CLASSES[category]

return ModelOutput(id=input.id, price_category=category_str[0] if len(category) == 1 else category_str)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Error during the prediction: {str(e)}")
else:
return HTTPException(status_code=500, detail="Model or pipeline not ready")
73 changes: 73 additions & 0 deletions solution/app/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from pydantic import BaseModel, field_validator, conint, confloat, Field
from pydantic.functional_validators import AfterValidator
from enum import Enum
from typing import List, Union
from typing_extensions import Annotated


def validate_one_hot(self, value: Union[int, List[int]])-> Union[int, List[int]]:

if isinstance(value, int):
if value not in [0, 1]:
raise ValueError("The input should be wither 1 or 0")
if isinstance(value, list):
if not all(map(lambda x: x in [0, 1], value)):
raise ValueError("All inputs in the list should be wither 1 or 0")
return value

OneZero = Annotated[Union[int, List[int]], AfterValidator(validate_one_hot)]


class RoomTypeEnum(str, Enum):
shared_room = "Shared room"
private_room = "Private room"
entire_home_apt = "Entire home/apt"
hotel_room = "Hotel room"

class NeighbourhoodEnum(str, Enum):
bronx = "Bronx"
queens = "Queens"
staten_island = "Staten Island"
brooklyn = "Brooklyn"
manhattan = "Manhattan"


class ModelInput(BaseModel):
id: Union[int, List[int]]
accommodates: Union[conint(ge=0), List[conint(ge=0)]]
room_type: Union[RoomTypeEnum, list[RoomTypeEnum]]
beds: Union[conint(ge=0), List[conint(ge=0)]]
bedrooms: Union[conint(ge=0), List[conint(ge=0)]]
bathrooms: Union[conint(ge=0), List[conint(ge=0)], confloat(ge=0), List[confloat(ge=0)]]
neighbourhood: Union[NeighbourhoodEnum, list[NeighbourhoodEnum]]
tv: OneZero
elevator: OneZero
internet: OneZero
latitude: Union[float, List[float]]
longitude: Union[float, List[float]]


class Config:
json_schema_extra = {
"examples": [
{
"id": 1001,
"accommodates": 4,
"room_type": "Entire home/apt",
"beds": 2,
"bedrooms": 1,
"bathrooms": 2,
"neighbourhood": "Brooklyn",
"tv": 1,
"elevator": 1,
"internet": 0,
"latitude": 40.71383,
"longitude": -73.9658
}
]
}


class ModelOutput(BaseModel):
id: Union[int, List[int]]
price_category: Union[str, List[str]]
31 changes: 31 additions & 0 deletions solution/app/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import os
from pathlib import Path
import mlflow


mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))

def load_model():

try:
return mlflow.sklearn.load_model("models:/price_category_clf@prod")
except Exception as e:
print(f"Error loading the model: {e}")
return None

def load_pipeline():

try:
return mlflow.pyfunc.load_model("models:/processing_pipeline@prod")
except Exception as e:
print(f"Error loading the pipeline: {e}")
return None


def load_transformer():

try:
return mlflow.pyfunc.load_model("models:/mapping_transformer@prod")
except Exception as e:
print(f"Error loading the transformer: {e}")
return None
Empty file added solution/code/__init__.py
Empty file.
39 changes: 39 additions & 0 deletions solution/code/generate_plots.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from code.src.plots import *

DIR_REPO = Path.cwd().parent
DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw"
FILEPATH_DATA = DIR_DATA_RAW / "listings.csv"
FILEPATH_PLOTS = Path(DIR_REPO) / "solution" / "plots"
CAT_COLS = ['TV', 'Internet', 'Air conditioning', 'Kitchen', 'Heating', 'Wifi', 'Elevator', 'Breakfast']



df_raw = pd.read_csv(FILEPATH_DATA, low_memory=False)

print("Generating bathroom func time test plot")
plot = plot_num_bathroom_from_text_time_test(df_raw)
fig1 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: num_bathroom_from_text_time')
plt.savefig(FILEPATH_PLOTS / "bathroom.png")
plt.close()

print("Generating price func time test plot")
plot = plot_price_to_test_time_test(df_raw)
fig2 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: price_text')
plt.savefig(FILEPATH_PLOTS / "price.png")
plt.close()

print("Generating pd.cut func time test plot")
plot = plot_pd_cut_time_test(df_raw)
fig3 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: price_text')
plt.savefig(FILEPATH_PLOTS / "pd_cut.png")
plt.close()

print("Generating category encoder func time test plot")
plot = plot_category_encoder_time_test(df_raw, CAT_COLS)
fig4 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: preprocess_amenities_column')
plt.savefig(FILEPATH_PLOTS / "cat_encoder.png")
plt.close()
Loading