diff --git a/solution/SOLUTIONS.md b/solution/SOLUTIONS.md new file mode 100644 index 0000000..9c2d99c --- /dev/null +++ b/solution/SOLUTIONS.md @@ -0,0 +1,170 @@ +## Challenge 1 - Refactor DEV code + +The refactorization was done taking into consideration three main objectives: + +- Time optimization +- Increase code maintainability +- Make the code testable at different stages + +### Time optimization + +To increase the time efficiency of the code, I analyzed the different fragments of code in the notebooks. +They were four fragments that could potentially be optimized. Mainly apply pandas functions. For each function to be refactored, +I created a function to apply the original and the refactored code to different data inputs, measure the time and created a plot. + +Three of the refactored code snippets achieved the reduction in time, the difference increased linearly with the data size. To reproduce +this test, I created a python script called `generate_plots.py`, the plots are stored in the folder `plots`. Here are the plots to each +function. In all the plots, the blue line corresponds to the original code. + +#### 1. Parse bathroom text to integer +![parse_bathroom](/solution/plots/bathroom.png) + +#### 2. Extract amenities +This function extracts the different amenities, while the time is close to the implemented function the code +could easily lead to error due to the copy/paste of the same code with different columns, my implementation only relies on +a list with the different amenities to extract. This code while tested sometimes it did outperform the original code, either way, I +think that the new implementation is better to code maintenance. + +![Amenities](/solution/plots/cat_encoder.png) + +#### 3. Pandas cut function + +The numpy implementation is similar in code complexity but pandas cut is easier to understand so I kept this part as it is. + + +![Pandas_cut](/solution/plots/pd_cut.png) + +#### 4. Parse the string price to int +![Parse_price](/solution/plots/price.png) + +Note: Some of this conclusions may vary slightly if the code is executed in docker or local. + +In addition, there are two scripts dedicated to test the original implementation with mine, one for each notebook. This tests can +be found in the the path `code/test/develope_test` the `test_eda.py` compares the notebook `01-experatory-data-analysis.ipynb` and +the `test_explore_classifier.py` compares the notebook `02-explore-classifier-model.ipynb`. The results of this code can be seen in +the logs folder `test_eda.log` abd `test_explorer.log` respectibely. + +The result for the first one is always a bit worse than the original code. This is due to the implementation of the different steps +of the cleaning a processing process via Skelean `ColumnTransformer` and `Pipelines` the fitting method add some extra time to +compute the result. Nevertheless, I think that this delay is worth it because it allows to apply the same preprocessing steps to +unseen data which is usually desired when trying the generated model in unseen data. + +In regards to the seccond script, the execution time es slightly better in the refactored code, but the difference in the implementation is very small. + +### Maintainability + +As mentioned before, to improve maintainability, I decided to implement the various steps as separate custom column transformers using +`sklearn`. This approach allows for easier modification of the process and the addition of new steps. The different transformers are +saved in `code/src/transformer.py`. Unit tests are also included to ensure that the code behaves as expected. By using pipelines, the +entire process can be summarized as follows: + +```python +preprocessing_pipeline = Pipeline(steps=[ + ('col_selector', ColumnSelector(COLUMNS)), + ('bathroom_processing', StringToFloatTransformer({'bathrooms_text': 'bathrooms'})), + ('cast_price', StringToInt('price', r"(\d+)")), + ('filter_rows', QueryFilter("price >= 10")), + ('drop_na', DropNan(axis=0)), + ('bin_price', DiscretizerTransformer('price', 'category', bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3])), + ('array_to_cat', ArrayOneHotEncoder('amenities', CAT_COLS)), + ('col_renamer_conditioning', ColumnRenamer(columns={'Air conditioning': 'Air_conditioning', 'neighbourhood_group_cleansed': 'neighbourhood'})), + ('drop_cols', DropColumns('amenities')) + ] + ) + +ct = ColumnTransformer( + [ + ('ordinal_encoder', CustomOrdinalEncoder(categories=[list(MAP_NEIGHB.keys()), list(MAP_ROOM_TYPE.keys())], start_category=1), ["neighbourhood", "room_type"]) + ], + remainder = "passthrough", + verbose_feature_names_out=False + ) + +processing_pipeline = Pipeline(steps=[ + ('drop_na', DropNan(axis=0)), + ('categorical', ct), + ('col_selector', ColumnSelector(FEATURE_NAMES + [TARGET_VARIABLE])) + ] + ) + +data_pipeline = Pipeline(steps=[ + ('data_preprocessing', preprocessing_pipeline), + ('data_processing', processing_pipeline) + ]) +``` + +To apply all the transformations at once, it is only necessary to call `data_pipeline`. The process is divided in order to facilitate +the testing of the +different transformations. This could also be implemented by creating different regular Python functions, but, in my opinion, this +approach is easier to +understand, export to other environments, and allows the trained transformers to be applied to new data, avoiding data leakage. + +The different transformers could probably be improved or even merged for a cleaner implementation of the transformations. However, I tried to focus more +on the whole solution rather than aiming for an excellent transformation code, as that part is easier to fix. + +### Code testeable + +To make the code testable, I separated the different stages of development into different scripts as already explained above. I also +added unit tests for the transformers to ensure that the results remain correct after changes. And the tests for the results from the +original code are usefull to check debiations in the global result. + +To facilitate the use of the code in different stages within CI, I divided the cleaning process into different pipelines according +to the notebooks. These pipelines are saved using joblib to make them reusable. Additionally, I deployed an `MLflow` instance to +save the model and the pipeline using the `MLflow.Pyfunc` class for the entire pipeline, the processing pipeline, and the trained +model. This makes it easier to use this code in the API, avoiding issues with the environment, code changes, or updates in the +models themselves. + +## Challenge 2 - Build an API + + +To implement the API, I used the `FastAPI` framework along with Pydantic for validation of input/output data. The API is hosted +locally on `localhost:8000`. FastAPI includes an automatically generated documentation interface, `http://localhost:8000/`, where +example calls can be tested interactively. + +The primary endpoint for this API can be accessed programmatically at `http://localhost:8000/model-inference`. The expected input +and ouput matches the format in the README file. Additionally, the endpoint also supports an array of elements, provided all +elements have the same length and adhere to the defined input schema. Here an example calling the endpoint programatically: + +```python +import requests + +payload = { + "accommodates": [4, 4], + "bathrooms": [2,2], + "bedrooms": [1,1], + "beds": [2,2], + "elevator": [1,1], + "id": [1001,1001], + "internet": [0,0], + "latitude": [40.71383, 456], + "longitude": [-73.9658, 56], + "neighbourhood": ["Brooklyn", "Brooklyn"], + "room_type": ["Entire home/apt", "Entire home/apt"], + "tv": [1, 1] +} +response = requests.post("http://localhost:8080/model-inference", json=payload) +response.json() +# expected output +{'id': [1001, 1001], 'price_category': ['High', 'High']} +``` + +## Challenge 3 - Dockerize your solution + +To dockerize the solution I used Docker Compose with three Docker Images: +- **App**: Which creates the endpoint for the API to get the predictions. +- **Mlflow**: Which creates a server to save and load the model without copying the enviroment from one place to another. +- **Pipeline**: This image contains all the code explained before, it saves the models, the logs and the plots. This image takes a bit of time due to the testing of bigger sample data for the plots. + +There is alson a `.env` file to store the endpoint of MLFlow in the other images to grant conectibity to the MLFlow server. To deploy +the solution it is only necessary to run `docker compose up --build` in the docker directory and wait arround one minute to have +everything ready. + + +Note: To run the different scripts locally, execute the code from the solution folder as follows: +```bash +PYTHONPATH="${PYTHONPATH}:../" python code/generate_plots.py +PYTHONPATH="${PYTHONPATH}:../" python code/pipeline.py +PYTHONPATH="${PYTHONPATH}:../" python code/test/test_transformers.py +PYTHONPATH="${PYTHONPATH}:../" python code/test/develope_tests/test_eda.py +PYTHONPATH="${PYTHONPATH}:../" python code/test/test_transformers.py +``` \ No newline at end of file diff --git a/solution/app/main.py b/solution/app/main.py new file mode 100644 index 0000000..16f8fa0 --- /dev/null +++ b/solution/app/main.py @@ -0,0 +1,46 @@ +from fastapi import FastAPI, HTTPException +import pandas as pd +import numpy as np +from app.models import ModelInput, ModelOutput +from app.utils import load_model, load_transformer + +FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms'] +OUT_CLASSES = np.array(['Low', 'Mid', 'High', 'Lux']) + +app = FastAPI( + title= "Building Category prediction", + description= "Api to infer the price category of a building from its characteristics", + version= "1.0.0", + docs_url="/" +) + + +@app.post("/model-inference") +async def infer_price_caegory(input: ModelInput): + + model = load_model() + transformer = load_transformer() + + model_input = dict(input) + + if model: + try: + # build data frame with the input to the transformer + # if all the field dont have the same lenght it will raise an error + if isinstance(model_input['id'], int): + input_data = pd.DataFrame(model_input, index=[0]) + else: + input_data = pd.DataFrame(model_input, index=list(range(len(model_input['id'])))) + + # preprocess the data + data = transformer.predict(input_data) + data = data[FEATURE_NAMES].dropna(axis=0) + category = model.predict(data) + # parse the numerical outpout to the corresponding classes + category_str = OUT_CLASSES[category] + + return ModelOutput(id=input.id, price_category=category_str[0] if len(category) == 1 else category_str) + except Exception as e: + raise HTTPException(status_code=500, detail=f"Error during the prediction: {str(e)}") + else: + return HTTPException(status_code=500, detail="Model or pipeline not ready") diff --git a/solution/app/models.py b/solution/app/models.py new file mode 100644 index 0000000..962af50 --- /dev/null +++ b/solution/app/models.py @@ -0,0 +1,73 @@ +from pydantic import BaseModel, field_validator, conint, confloat, Field +from pydantic.functional_validators import AfterValidator +from enum import Enum +from typing import List, Union +from typing_extensions import Annotated + + +def validate_one_hot(self, value: Union[int, List[int]])-> Union[int, List[int]]: + + if isinstance(value, int): + if value not in [0, 1]: + raise ValueError("The input should be wither 1 or 0") + if isinstance(value, list): + if not all(map(lambda x: x in [0, 1], value)): + raise ValueError("All inputs in the list should be wither 1 or 0") + return value + +OneZero = Annotated[Union[int, List[int]], AfterValidator(validate_one_hot)] + + +class RoomTypeEnum(str, Enum): + shared_room = "Shared room" + private_room = "Private room" + entire_home_apt = "Entire home/apt" + hotel_room = "Hotel room" + +class NeighbourhoodEnum(str, Enum): + bronx = "Bronx" + queens = "Queens" + staten_island = "Staten Island" + brooklyn = "Brooklyn" + manhattan = "Manhattan" + + +class ModelInput(BaseModel): + id: Union[int, List[int]] + accommodates: Union[conint(ge=0), List[conint(ge=0)]] + room_type: Union[RoomTypeEnum, list[RoomTypeEnum]] + beds: Union[conint(ge=0), List[conint(ge=0)]] + bedrooms: Union[conint(ge=0), List[conint(ge=0)]] + bathrooms: Union[conint(ge=0), List[conint(ge=0)], confloat(ge=0), List[confloat(ge=0)]] + neighbourhood: Union[NeighbourhoodEnum, list[NeighbourhoodEnum]] + tv: OneZero + elevator: OneZero + internet: OneZero + latitude: Union[float, List[float]] + longitude: Union[float, List[float]] + + + class Config: + json_schema_extra = { + "examples": [ + { + "id": 1001, + "accommodates": 4, + "room_type": "Entire home/apt", + "beds": 2, + "bedrooms": 1, + "bathrooms": 2, + "neighbourhood": "Brooklyn", + "tv": 1, + "elevator": 1, + "internet": 0, + "latitude": 40.71383, + "longitude": -73.9658 + } + ] + } + + +class ModelOutput(BaseModel): + id: Union[int, List[int]] + price_category: Union[str, List[str]] diff --git a/solution/app/utils.py b/solution/app/utils.py new file mode 100644 index 0000000..3cba0b4 --- /dev/null +++ b/solution/app/utils.py @@ -0,0 +1,31 @@ +import os +from pathlib import Path +import mlflow + + +mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI")) + +def load_model(): + + try: + return mlflow.sklearn.load_model("models:/price_category_clf@prod") + except Exception as e: + print(f"Error loading the model: {e}") + return None + +def load_pipeline(): + + try: + return mlflow.pyfunc.load_model("models:/processing_pipeline@prod") + except Exception as e: + print(f"Error loading the pipeline: {e}") + return None + + +def load_transformer(): + + try: + return mlflow.pyfunc.load_model("models:/mapping_transformer@prod") + except Exception as e: + print(f"Error loading the transformer: {e}") + return None \ No newline at end of file diff --git a/solution/code/__init__.py b/solution/code/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/solution/code/generate_plots.py b/solution/code/generate_plots.py new file mode 100644 index 0000000..052178c --- /dev/null +++ b/solution/code/generate_plots.py @@ -0,0 +1,39 @@ +from pathlib import Path +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns +from code.src.plots import * + +DIR_REPO = Path.cwd().parent +DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw" +FILEPATH_DATA = DIR_DATA_RAW / "listings.csv" +FILEPATH_PLOTS = Path(DIR_REPO) / "solution" / "plots" +CAT_COLS = ['TV', 'Internet', 'Air conditioning', 'Kitchen', 'Heating', 'Wifi', 'Elevator', 'Breakfast'] + + + +df_raw = pd.read_csv(FILEPATH_DATA, low_memory=False) + +print("Generating bathroom func time test plot") +plot = plot_num_bathroom_from_text_time_test(df_raw) +fig1 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: num_bathroom_from_text_time') +plt.savefig(FILEPATH_PLOTS / "bathroom.png") +plt.close() + +print("Generating price func time test plot") +plot = plot_price_to_test_time_test(df_raw) +fig2 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: price_text') +plt.savefig(FILEPATH_PLOTS / "price.png") +plt.close() + +print("Generating pd.cut func time test plot") +plot = plot_pd_cut_time_test(df_raw) +fig3 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: price_text') +plt.savefig(FILEPATH_PLOTS / "pd_cut.png") +plt.close() + +print("Generating category encoder func time test plot") +plot = plot_category_encoder_time_test(df_raw, CAT_COLS) +fig4 = sns.lineplot(plot, x='n', y='time', hue='method').set_title('Time optimizer function: preprocess_amenities_column') +plt.savefig(FILEPATH_PLOTS / "cat_encoder.png") +plt.close() diff --git a/solution/code/pipeline.py b/solution/code/pipeline.py new file mode 100644 index 0000000..56260e3 --- /dev/null +++ b/solution/code/pipeline.py @@ -0,0 +1,286 @@ +import os +from pathlib import Path +import joblib +import numpy as np +import pandas as pd +from sklearn.pipeline import Pipeline +from sklearn.compose import ColumnTransformer +from sklearn.model_selection import train_test_split +from sklearn.ensemble import RandomForestClassifier +from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix +import mlflow +from mlflow.models import infer_signature + +# Import custom functions +from code.src.transformer import ( + ArrayOneHotEncoder, + ColumnRenamer, + ColumnSelector, + CustomOrdinalEncoder, + DropColumns, + DropNan, + DiscretizerTransformer, + QueryFilter, + StringToFloatTransformer, + StringToInt, +) + +# Global variables +DIR_REPO = Path.cwd().parent +DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw" +FILEPATH_DATA = DIR_DATA_RAW / "listings.csv" +FILEPATH_PLOTS = Path(DIR_REPO) / "solution" / "plots" +MODEL_PATH = DIR_REPO / "solution"/ "models" + +COLUMNS = ['id', 'neighbourhood_group_cleansed', 'property_type', 'room_type', 'latitude', 'longitude', 'accommodates', 'bathrooms_text', 'bedrooms', 'beds','amenities', 'price'] +CAT_COLS = ['TV', 'Internet', 'Air conditioning', 'Kitchen', 'Heating', 'Wifi', 'Elevator', 'Breakfast'] + +MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4} +MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5} + +FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms'] +TARGET_VARIABLE = "category" + + + +mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI")) + +if __name__ == "__main__": + + print("Reading data") + df_raw = pd.read_csv(FILEPATH_DATA) + df_raw.head() + + + print("Building preprocessing pipeline") + preprocessing_pipeline = Pipeline(steps=[ + ('col_selector', ColumnSelector(COLUMNS)), + ('bathroom_processing', StringToFloatTransformer({'bathrooms_text': 'bathrooms'})), + ('cast_price', StringToInt('price', r"(\d+)")), + ('filter_rows', QueryFilter("price >= 10")), + ('drop_na', DropNan(axis=0)), + ('bin_price', DiscretizerTransformer('price', 'category', bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3])), + ('array_to_cat', ArrayOneHotEncoder('amenities', CAT_COLS)), + ('col_renamer_conditioning', ColumnRenamer(columns={'Air conditioning': 'Air_conditioning', 'neighbourhood_group_cleansed': 'neighbourhood'})), + ('drop_cols', DropColumns('amenities')) + ] + ) + preprocessing_pipeline.set_output(transform='pandas') + + print("Building processing pipeline") + + ct = ColumnTransformer( + [ + ('ordinal_encoder', CustomOrdinalEncoder(categories=[list(MAP_NEIGHB.keys()), list(MAP_ROOM_TYPE.keys())], start_category=1), ["neighbourhood", "room_type"]) + ], + remainder = "passthrough", + verbose_feature_names_out=False + ) + + ct.set_output(transform='pandas') + + processing_pipeline = Pipeline(steps=[ + ('drop_na', DropNan(axis=0)), + ('categorical', ct), + ('col_selector', ColumnSelector(FEATURE_NAMES + [TARGET_VARIABLE])) + ] + ) + + processing_pipeline.set_output(transform='pandas') + + data_pipeline = Pipeline(steps=[ + ('data_preprocessing', preprocessing_pipeline), + ('data_processing', processing_pipeline) + ]) + + # fit the pipeline only with the training data + print("Fitting pipeline") + data_pipeline.fit(df_raw) + + os.makedirs(MODEL_PATH.parent, exist_ok=True) + print("Saving col transformer") + try: + joblib.dump(ct, open(MODEL_PATH / "col_transformer.joblib", "wb+")) + except FileNotFoundError: + joblib.dump(ct, open(MODEL_PATH / "col_transformer.joblib", "wb+")) + + print("Saving preprocessing pipeline") + try: + joblib.dump(preprocessing_pipeline, open(MODEL_PATH / "preprocessing_pipeline.joblib", "wb+")) + except FileNotFoundError: + joblib.dump(preprocessing_pipeline, open(MODEL_PATH / "preprocessing_pipeline.joblib", "wb+")) + + print("Saving processing pipeline") + try: + joblib.dump(processing_pipeline, open(MODEL_PATH / "processing_pipeline.joblib", "wb+")) + except FileNotFoundError: + joblib.dump(processing_pipeline, open(MODEL_PATH / "processing_pipeline.joblib", "wb+")) + + print("Saving pipeline") + try: + joblib.dump(data_pipeline, open(MODEL_PATH / "pipeline.joblib", "wb+")) + except FileNotFoundError: + joblib.dump(data_pipeline, open(MODEL_PATH / "pipeline.joblib", "wb+")) + + print("Saving pipeline artifacts to mlflow") + + class ProcessingPipeline(mlflow.pyfunc.PythonModel): + + def __init__(self): + self.whole_pipeline = None + self.preprocessing_pipeline = None + self.processing_pipeline = None + self.column_transformer = None + + + def load_artifacts(self, context): + + self.whole_pipeline = joblib.load(open(context.artifacts['whole_pipe'], 'rb')) + self.preprocessing_pipeline = joblib.load(open(context.artifacts['prepro_pipe'], 'rb')) + self.processing_pipeline = joblib.load(open(context.artifacts['proc_pipe'], 'rb')) + self.column_transformer = joblib.load(open(context.artifacts['col_trans'], 'rb')) + + def predict(self, context, model_input): + + if self.whole_pipeline: + return self.whole_pipeline.transform(model_input) + else: + raise ValueError("The model has not been loaded") + + class ProcessingPipeline(mlflow.pyfunc.PythonModel): + + """Class that applies the fitted processing pipeline to new data""" + def __init__(self): + self.column_transformer = None + + def load_context(self, context): + + self.whole_pipeline = joblib.load(context.artifacts['whole_pipe']) + self.preprocessing_pipeline = joblib.load(context.artifacts['prepro_pipe']) + self.processing_pipeline = joblib.load(context.artifacts['proc_pipe']) + self.column_transformer = joblib.load(context.artifacts['col_trans']) + + def predict(self, context, model_input): + + if self.whole_pipeline: + return self.whole_pipeline.transform(model_input) + else: + raise ValueError("The model has not been loaded") + + class MappingTransformer(mlflow.pyfunc.PythonModel): + + """Class that applies the category mapping to new data""" + + def __init__(self): + pass + + def load_context(self, context): + self.column_transformer = joblib.load(context.artifacts['col_trans']) + + def predict(self, context, model_input): + + columns_to_apply = ["neighbourhood", "room_type"] + if self.column_transformer: + try: + return self.column_transformer.transform(model_input) + except: + try: + model_input[columns_to_apply] = self.column_transformer['ordinal_encoder'].transform(model_input[columns_to_apply]) + return model_input + except: + raise ValueError(f"Necessary columns not present: {columns_to_apply}") + else: + raise ValueError("The model has not been loaded") + + + print("Applying pipeline") + df_processed = data_pipeline.transform(df_raw) + X = df_processed[FEATURE_NAMES] + y = df_processed[TARGET_VARIABLE] + + print("Splitting data") + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1) + + print("Processed train data") + print(f"Dataset shape: {X_train.shape}") + + print("Processed test data") + print(f"Dataset shape: {X_train.shape}") + + print("Training Random Forest Model") + + print("Train model") + clf = RandomForestClassifier(n_estimators=500, random_state=0, class_weight='balanced', n_jobs=4) + clf.fit(X_train, y_train) + + print("Model evaluation") + y_pred = clf.predict(X_test) + print(f"Accuracy: {accuracy_score(y_test, y_pred):1.4f}") + + y_proba = clf.predict_proba(X_test) + roc_auc_score(y_test, y_proba, multi_class='ovr') + print(f"ROC score: {roc_auc_score(y_test, y_proba, multi_class='ovr'):1.4f}") + + print("Saving model") + try: + joblib.dump(clf, open(MODEL_PATH / "classifier.joblib", "wb+")) + except FileNotFoundError: + joblib.dump(clf, open(MODEL_PATH / "classifier.joblib", "wb+")) + + + mlflow.set_experiment(experiment_name="price_category_predictor") + client = mlflow.client.MlflowClient() + with mlflow.start_run() as run: + + + print("Logging pipeline to mlflow") + # log prod pipeline model + + mlflow.pyfunc.log_model( + artifact_path="processing_pipeline", + python_model = ProcessingPipeline(), + registered_model_name="processing_pipeline", + artifacts = { + 'whole_pipe': str(MODEL_PATH / "pipeline.joblib"), + 'prepro_pipe': str(MODEL_PATH / "preprocessing_pipeline.joblib"), + 'proc_pipe': str(MODEL_PATH / "processing_pipeline.joblib"), + 'col_trans': str(MODEL_PATH / "col_transformer.joblib") + }, + pip_requirements = open(DIR_REPO / 'solution' / "requirements.txt", 'r').read().split('\n'), + code_paths = [ str(DIR_REPO / 'solution' / "code") ] + + ) + latest_version = client.search_registered_models(filter_string="name = 'processing_pipeline'")[0].latest_versions[0].version + client.set_registered_model_alias('processing_pipeline', 'prod', latest_version) + + + print("Logging transformer to mlflow") + # save column transformer + mlflow.pyfunc.log_model( + artifact_path="transformer", + python_model = MappingTransformer(), + registered_model_name="mapping_transformer", + artifacts = { + 'col_trans': str(MODEL_PATH / "col_transformer.joblib") + }, + pip_requirements = ['pandas', 'scikit-learn', 'numpy'], + code_paths = [ str(DIR_REPO / 'solution' / "code")] + ) + + latest_version = client.search_registered_models(filter_string="name = 'mapping_transformer'")[0].latest_versions[0].version + client.set_registered_model_alias('mapping_transformer', 'prod', latest_version) + + print("Logging model to mlflow") + # log prod training model + signature = infer_signature(X_test, y_train) + + mlflow.sklearn.log_model( + clf, + artifact_path = "artifacts", + signature = signature, + registered_model_name="price_category_clf", + input_example = X_test[:1] + ) + + latest_version = client.search_registered_models(filter_string="name = 'price_category_clf'")[0].latest_versions[0].version + client.set_registered_model_alias('price_category_clf', 'prod', latest_version) \ No newline at end of file diff --git a/solution/code/src/__init__.py b/solution/code/src/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/solution/code/src/plots.py b/solution/code/src/plots.py new file mode 100644 index 0000000..706363b --- /dev/null +++ b/solution/code/src/plots.py @@ -0,0 +1,206 @@ +from pathlib import Path +import datetime +import pandas as pd +import numpy as np +import seaborn as sns +import re +import src.transformer as tr + + + +def plot_num_bathroom_from_text_time_test(df_raw): + + plot_data = [] + for n in range(4,8): + sample_data = np.resize(df_raw.bathrooms_text, 10**n) + sample = pd.Series(sample_data) + t1 = datetime.datetime.now() + _ = sample.apply(tr.num_bathroom_from_text) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'apply', + + } + ) + + t1 = datetime.datetime.now() + _ = list(map(tr.num_bathroom_from_text, sample_data)) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'map', + + } + ) + + t1 = datetime.datetime.now() + _ = [tr.num_bathroom_from_text(text) for text in sample_data] + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'inline_for_loop', + + } + ) + + return pd.DataFrame(plot_data) + + +def plot_price_to_test_time_test(df_raw): + + plot_data = [] + for n in range(4,8): + sample_data = np.resize(df_raw.price, 10**n) + sample = pd.Series(sample_data) + t1 = datetime.datetime.now() + _ = sample.str.extract(r"(\d+).").astype(int) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'apply', + + } + ) + + compiled_pattern = re.compile(r'\d+') + t1 = datetime.datetime.now() + _ = list(map(lambda x: int(tr.apply_regex(x, compiled_pattern)), sample_data)) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'map', + + } + ) + + t1 = datetime.datetime.now() + _ = [int(tr.apply_regex(text, compiled_pattern)) for text in sample_data] + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'inline_for_loop', + + } + ) + + return pd.DataFrame(plot_data) + + +def plot_pd_cut_time_test(df_raw): + + plot_data = [] + for n in range(4,8): + sample_data = np.resize(df_raw.price, 10**n) + sample = pd.Series(sample_data).str.extract(r"(\d+).").astype(int).to_numpy().flatten() + t1 = datetime.datetime.now() + _ = pd.cut(sample, bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3]) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'pd.cut', + + } + ) + + t1 = datetime.datetime.now() + _ = tr.array_binding(sample, bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3]) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'numpy', + + } + ) + + return pd.DataFrame(plot_data) + + + +def plot_category_encoder_time_test(df_raw, cols): + + plot_data = [] + for n in range(2,6): + sample_data = pd.Series(np.resize(df_raw.amenities, 10**n), name='amenities') + sample = sample_data.reset_index() + t1 = datetime.datetime.now() + _ = tr.preprocess_amenities_column(sample) + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'custom_function', + + } + ) + + t1 = datetime.datetime.now() + _ = sample_data.apply(lambda x: tr.find_categories(x, cols)) + + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'pd.apply', + + } + ) + + t1 = datetime.datetime.now() + _ = [tr.find_categories(x, cols) for x in sample_data] + + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'loop', + + } + ) + + + t1 = datetime.datetime.now() + _ = list(map(lambda x: tr.find_categories(x, cols), sample_data)) + + t2 = datetime.datetime.now() + + plot_data.append( + { + 'n': 10**n, + 'time': (t2-t1).total_seconds(), + 'method': 'map', + + } + ) + + return pd.DataFrame(plot_data) diff --git a/solution/code/src/transformer.py b/solution/code/src/transformer.py new file mode 100644 index 0000000..b99dd5c --- /dev/null +++ b/solution/code/src/transformer.py @@ -0,0 +1,426 @@ +from sklearn.base import BaseEstimator, TransformerMixin +from sklearn.preprocessing import OrdinalEncoder +from typing import Dict, List +import re + +import numpy as np +import numpy.typing as npt +import pandas as pd +from pandas import DataFrame +from typing import List, Any, Dict + + +# Get number of bathrooms from `bathrooms_text` +def num_bathroom_from_text(text): + try: + if isinstance(text, str): + bath_num = text.split(" ")[0] + return float(bath_num) + else: + return np.nan + except ValueError: + return np.nan + +def array_binding(array: List[int | float], bins: List[int | float], labels: List[Any]): + """ + Function to replicate the behabiour of pandas.cut() function but with numpy + + Parameters + ---------- + values : array-like of length n_samples + The input data to be binned + + bins : array-like of length n_bins + 1 + The bin edges, representing the intervals for binning. + + labels : list or array-like, optional, default=None + Labels corresponding to the bins. + + Returns + ------- + An array of the same shape as `values`, where each element corresponds to + the label of the bin in which that element falls. + + """ + + bin_indices = np.digitize(array, bins) + + bin_indices = np.clip(bin_indices - 1, 0, len(bins) - 1) + + + return np.array(labels)[bin_indices] + +def preprocess_amenities_column(df: DataFrame) -> DataFrame: + + df['TV'] = df['amenities'].str.contains('TV') + df['TV'] = df['TV'].astype(int) + df['Internet'] = df['amenities'].str.contains('Internet') + df['Internet'] = df['Internet'].astype(int) + df['Air_conditioning'] = df['amenities'].str.contains('Air conditioning') + df['Air_conditioning'] = df['Air_conditioning'].astype(int) + df['Kitchen'] = df['amenities'].str.contains('Kitchen') + df['Kitchen'] = df['Kitchen'].astype(int) + df['Heating'] = df['amenities'].str.contains('Heating') + df['Heating'] = df['Heating'].astype(int) + df['Wifi'] = df['amenities'].str.contains('Wifi') + df['Wifi'] = df['Wifi'].astype(int) + df['Elevator'] = df['amenities'].str.contains('Elevator') + df['Elevator'] = df['Elevator'].astype(int) + df['Breakfast'] = df['amenities'].str.contains('Breakfast') + df['Breakfast'] = df['Breakfast'].astype(int) + + df.drop('amenities', axis=1, inplace=True) + + return df + + +def find_categories(array, categories)-> Dict[str, int]: + string = str(array) + return {category: int(category in string) for category in categories} + +# Same regex function +def apply_regex(text, pattern): + match = pattern.search(text) + return match.group(0) if match else None + +# Custom transformer split a string by spaces and cast to float the first element +class StringToFloatTransformer(BaseEstimator, TransformerMixin): + def __init__(self, columns: Dict[str, str] | str): + # Specify which columns to apply the transformation to and the new name (optional) + self.columns = columns + + def fit(self, X, y=None): + # No fitting needed + return self + + def transform(self, X: DataFrame | np.ndarray): + X_copy = X.copy() + if self.columns: + if isinstance(X, DataFrame): + if isinstance(self.columns, dict): + for old_name, new_name in self.columns.items(): + X_copy[new_name] =list(map(num_bathroom_from_text, X_copy[old_name])) + else: + for col in self.columns: + X_copy[col] =list(map(num_bathroom_from_text, X_copy[col])) + self.out_cols = list(X_copy.columns) + if isinstance(X, np.ndarray): + if isinstance(self.columns, dict): + self.out_cols = list(self.columns.values()) + for i in range(len(self.columns)): + X_copy[i] =list(map(num_bathroom_from_text, X_copy[i])) + else: + self.out_cols = self.columns + for i in self.columns: + X_copy[i] =list(map(num_bathroom_from_text, X_copy[i])) + + return X_copy + + def get_feature_names_out(self, columns): + return self.out_cols + +# Custom transformer to parse to numeric the number of bathrooms +class ColumnSelector(BaseEstimator, TransformerMixin): + def __init__(self, columns: List[str] | str): + # Specify which columns to select + self.columns = columns + + def fit(self, X, y=None): + # No fitting needed + return self + + def transform(self, X: DataFrame): + X_copy = X.copy() + if self.columns: + if isinstance(X_copy, DataFrame): + X_copy = X_copy[self.columns if isinstance(self.columns, list) else [self.columns]] + self.out_cols = X_copy.columns + return X_copy + else: + raise ValueError("The data provided must be a pandas.Dataframe") + self.out_cols = X_copy.columns + return X_copy + + def get_feature_names_out(self, columns): + return self.columns + +# Custom transformer to rename columns +class ColumnRenamer(BaseEstimator, TransformerMixin): + def __init__(self, columns: Dict[str, str]): + if not isinstance(columns, dict): + raise ValueError("The columns must be passed as a dict with the format {'old_key':'new_key'}") + # Specify which columns to rename + self.columns = columns + + def fit(self, X, y=None): + # No fitting needed for this transformer + return self + + def transform(self, X: DataFrame): + X_copy = X.copy() + if self.columns: + if isinstance(X_copy, DataFrame): + X_copy.rename(columns=self.columns, inplace=True) + self.out_cols = X_copy.columns + return X_copy + else: + raise ValueError("The data provided must be a pandas.Dataframe") + self.out_cols = X_copy.columns + return X_copy + def get_feature_names_out(self, columns): + return self.out_cols + +# Custom transformer to drop NAs in columns or rows +class DropNan(BaseEstimator, TransformerMixin): + def __init__(self, axis: int=0): + # Specify which axis to evaluate + self.axis = axis + + def fit(self, X, y=None): + return self + + def transform(self, X: DataFrame | np.ndarray): + + if isinstance(X, DataFrame): + X_copy = X.copy() + X_copy = X_copy.dropna(axis=self.axis) + return X_copy + elif isinstance(X, np.ndarray): + X_copy = X.copy() + X_copy = X_copy[~np.isnan(X_copy).any(axis=self.axis)] + return X_copy + else: + raise ValueError("The data provided must be a pandas.Dataframe or np.array") + + def get_feature_names_out(self, columns): + return columns + +# Custom transformer to drop NAs in columns or rows +class DropColumns(BaseEstimator, TransformerMixin): + def __init__(self, columns: str | List[str]): + # Specify which axis to evaluate + self.columns = columns + + def fit(self, X, y=None): + # No fitting needed for this transformer + return self + + def transform(self, X: DataFrame): + + if isinstance(X, DataFrame) and isinstance(self.columns, (str, list)): + X_copy = X.copy() + X_copy.drop(columns=self.columns, inplace=True) + else: + raise ValueError("The data provided must be a pandas.Dataframe and columns must be a string or list of strings") + self.cols_out = X_copy.columns + return X_copy + + def get_feature_names_out(self, columns): + return self.cols_out + +# Custom transformer to cast string to int applying regexpatter +class StringToInt(BaseEstimator, TransformerMixin): + def __init__(self, columns: List[str] | str, patterns: List[str] | str): + + if type(columns) != type(patterns): + raise ValueError("The columns and patters must have the same data type") + elif isinstance(columns, list): + if len(columns) != len(patterns): + raise ValueError("columns and patterns list must have the same leght") + elif not isinstance(columns, str): + raise ValueError("columnas and patters must be or a single string or a list of strings") + + self.columns = columns + self.patterns = patterns + + def fit(self, X, y=None): + # No fitting needed for this transformer + return self + + # Function that applies a regex pattern to a string and returns the match + def _apply_regex(self, text, pattern): + match = pattern.search(text) + return match.group(0) if match else None + + def transform(self, X: DataFrame | npt.ArrayLike): + + X_copy = X.copy() + if isinstance(X_copy, DataFrame): + if isinstance(self.columns, str): + comp_patter = re.compile(self.patterns) + X_copy[self.columns] = list(map(lambda x: int(self._apply_regex(x, comp_patter)), X_copy[self.columns])) + else: + for col, pattern in zip(self.columns, self.patterns): + comp_patter = re.compile(pattern) + X_copy[col] = list(map(lambda x: int(self._apply_regex(x, comp_patter)), X_copy[col])) + self.out_cols = X_copy.columns + return X_copy + elif isinstance(X_copy, np.ndarray): + self.out_cols = self.columns + if isinstance(self.columns, str): + comp_patter = re.compile(self.patterns) + return X_copy.apply(lambda x: int(self._apply_regex(x, comp_patter))) + else: + for i, pattern in enumerate(self.patterns): + comp_patter = re.compile(pattern) + X_copy[i] = list(map(lambda x: int(self._apply_regex(x, comp_patter)), X_copy[i])) + return X_copy + elif isinstance(X_copy, list): + self.out_cols = self.columns + if isinstance(self.columns, str): + comp_patter = re.compile(self.patterns) + return list(map(lambda x: int(self._apply_regex(x, comp_patter)), X_copy)) + else: + for i, pattern in enumerate(self.patterns): + comp_patter = re.compile(pattern) + X_copy[i] = list(map(lambda x: int(self._apply_regex(x, comp_patter)), X_copy[i])) + return X_copy + def get_feature_names_out(self, columns): + return self.out_cols + +# Custom transformer to filter rows of a pandas.Dataframe +class QueryFilter(BaseEstimator, TransformerMixin): + def __init__(self, query_string: str): + # Specify which axis to evaluate + self.query_string = query_string + + def fit(self, X, y=None): + # No fitting needed for this transformer + return self + + def transform(self, X: DataFrame): + + if self.query_string: + X_copy = X.copy() + if isinstance(X_copy, DataFrame): + try: + X_copy.query(self.query_string, inplace=True) + self.out_cols = X_copy.columns + return X_copy + except Exception as e: + ValueError(f"Error applying the query string: {str(e)}") + else: + raise ValueError("The data provided must be a pandas.Dataframe") + self.out_cols = X.columns + return X + + def get_feature_names_out(self, columns): + return self.out_cols + +# Custom transformer to aaply pandas cut +class DiscretizerTransformer(BaseEstimator, TransformerMixin): + def __init__( + self, + columns: str | List[str], + new_colnames: str | List[str], + bins: List[float|int] | List[List[float|int]], + labels: List[float|int] | List[List[float|int]] + ): + + # Validate the input parameters + if isinstance(columns, list): + if not isinstance(bins, list) or not isinstance(labels, list): + raise ValueError("If 'columns' is a list, 'bins' and 'labels' must also be lists.") + + for bin, label in zip(bins, labels): + if len(bin) != (len(label) +1): + raise ValueError("'bins' must have the same length as 'labels' + 1.") + + if len(columns) != len(bins) or len(columns) != len(labels) or len(columns) != len(new_colnames): + raise ValueError("'columns', 'bins', 'labels' and 'new_colnames' must have the same length when 'columns' is a list.") + + if isinstance(columns, str): + if not isinstance(bins, list) or not isinstance(labels, list): + raise ValueError("If 'columns' is a string, 'bins' and 'labels' must be lists.") + + if len(bins) != (len(labels)+1): + raise ValueError("'bins' must have the same length as 'labels' + 1.") + + self.columns = columns + self.bins = bins + self.labels = labels + self.new_colnames = new_colnames + + + def fit(self, X, y=None): + # No fitting needed for this transformer + return self + + def transform(self, X: DataFrame): + X_copy = X.copy() + if isinstance(X_copy, DataFrame): + if isinstance(self.columns, str): + X_copy[self.new_colnames] = pd.cut(X_copy[self.columns], bins=self.bins, labels=self.labels) + + if isinstance(self.columns, list): + for col, new_name, bin, label in zip(self.columns, self.new_colnames, self.bins, self.labels): + X_copy[new_name] = pd.cut(X_copy[col], bins=bin, labels=label) + self.out_cols = X_copy.columns + return X_copy + else: + self.out_cols = self.new_colnames + if isinstance(self.columns, str): + return pd.cut(X_copy, bins=self.bins, labels=self.labels).to_numpy() + + if isinstance(self.columns, list): + for i, bin, label in enumerate(zip(self.bins, self.labels)): + X_copy[i] = pd.cut(X_copy[i], bins=bin, labels=label).to_numpy() + return X_copy + def get_feature_names_out(self, columns): + return self.out_cols + +# Custom transformer to parse to numeric the number of bathrooms +class ArrayOneHotEncoder(BaseEstimator, TransformerMixin): + def __init__(self, column: str, categories: List[str]): + # Specify which columns to select + self.column = column + self.categories = categories + + def fit(self, X, y=None): + # No fitting needed + return self + + def transform(self, X: DataFrame | np.ndarray): + X_copy = X.copy() + if isinstance(X_copy, DataFrame): + if isinstance(self.column, str): + + cat_df = pd.DataFrame(X_copy[self.column].apply(lambda x: find_categories(x, self.categories)).to_list(), index=X_copy.index) + X_copy = pd.concat([X_copy, cat_df], axis=1, ignore_index=False) + self.out_cols = X_copy.columns + return X_copy + if isinstance(X_copy, np.ndarray | pd.Series): + self.out_cols = self.categories + return pd.DataFrame(list(map(lambda x: find_categories(x, self.categories), X_copy))).to_numpy() + + + def get_feature_names_out(self, columns): + return self.out_cols + + # Custom transformer to rename columns + + +class CustomOrdinalEncoder(BaseEstimator, TransformerMixin): + def __init__(self, categories: List[List[str]], start_category: int = 0): + if not isinstance(categories, list): + raise ValueError("The categories must be passed as a list os list and columns must be also a list, one list for each columns") + if not isinstance(categories[0], list): + raise ValueError("The categories must be passed as a list os list, one list for each columns") + + if not isinstance(start_category, int): + raise ValueError("The starting category must be a integer") + + self.categories = categories + self.start_category = start_category + self.encoder = OrdinalEncoder(categories=categories) + + def fit(self, X, y=None): + self.encoder.fit(X) + return self + + def transform(self, X): + return self.encoder.transform(X) + self.start_category + + def get_feature_names_out(self, columns): + return self.encoder.get_feature_names_out(columns) + \ No newline at end of file diff --git a/solution/code/test/__init__.py b/solution/code/test/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/solution/code/test/develope_tests/__init__.py b/solution/code/test/develope_tests/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/solution/code/test/develope_tests/test_eda.py b/solution/code/test/develope_tests/test_eda.py new file mode 100644 index 0000000..43b209c --- /dev/null +++ b/solution/code/test/develope_tests/test_eda.py @@ -0,0 +1,168 @@ +import os +import sys +import logging +from pathlib import Path +import datetime +import numpy as np +import pandas as pd +from pandas import DataFrame +from sklearn.pipeline import Pipeline +from sklearn.compose import ColumnTransformer + +DIR_REPO = Path(__file__).parent.parent.parent.parent.parent +os.chdir(DIR_REPO) + + +# Import custom functions +from code.src.transformer import ( + ArrayOneHotEncoder, + ColumnRenamer, + ColumnSelector, + DropColumns, + DropNan, + DiscretizerTransformer, + QueryFilter, + StringToFloatTransformer, + StringToInt, +) + + +LOG_DIR = DIR_REPO / 'solution' / 'logs' +os.makedirs(LOG_DIR, exist_ok=True) + +log_file = os.path.join(LOG_DIR, "test_eda.log") + +# Configure logging +logging.basicConfig( + filename=log_file, + level=logging.DEBUG, + filemode='w+' +) + +logger = logging.getLogger(__name__) + + +pd.set_option('display.max_columns', 150) + +DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw" +FILEPATH_DATA = DIR_DATA_RAW / "listings.csv" + +COLUMNS = ['id', 'neighbourhood_group_cleansed', 'property_type', 'room_type', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds','amenities', 'price'] +COLUMNS_PIPE = ['id', 'neighbourhood_group_cleansed', 'property_type', 'room_type', 'latitude', 'longitude', 'accommodates', 'bathrooms_text', 'bedrooms', 'beds','amenities', 'price'] +CAT_COLS = ['TV', 'Internet', 'Air conditioning', 'Kitchen', 'Heating', 'Wifi', 'Elevator', 'Breakfast'] + +MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4} +MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5} + + +df_raw = pd.read_csv(FILEPATH_DATA, low_memory=False) + +logger.info("Aplying old code") +df_raw_old = df_raw.copy() +t1_old_code = datetime.datetime.now() + +df_raw_old_code = df_raw_old.drop(columns=['bathrooms']) + +# Get number of bathrooms from `bathrooms_text` +def num_bathroom_from_text(text): + try: + if isinstance(text, str): + bath_num = text.split(" ")[0] + return float(bath_num) + else: + return np.nan + except ValueError: + return np.nan +df_raw_old_code['bathrooms'] = df_raw_old_code['bathrooms_text'].apply(num_bathroom_from_text) +df = df_raw_old_code[COLUMNS].copy() +df.rename(columns={'neighbourhood_group_cleansed': 'neighbourhood'}, inplace=True) +df = df.dropna(axis=0) + +# Convert string to numeric +df['price'] = df['price'].str.extract(r"(\d+).") +df['price'] = df['price'].astype(int) + +df = df[df['price'] >= 10] + +df['category'] = pd.cut(df['price'], bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3]) + + +def preprocess_amenities_column(df: DataFrame) -> DataFrame: + + df['TV'] = df['amenities'].str.contains('TV') + df['TV'] = df['TV'].astype(int) + df['Internet'] = df['amenities'].str.contains('Internet') + df['Internet'] = df['Internet'].astype(int) + df['Air_conditioning'] = df['amenities'].str.contains('Air conditioning') + df['Air_conditioning'] = df['Air_conditioning'].astype(int) + df['Kitchen'] = df['amenities'].str.contains('Kitchen') + df['Kitchen'] = df['Kitchen'].astype(int) + df['Heating'] = df['amenities'].str.contains('Heating') + df['Heating'] = df['Heating'].astype(int) + df['Wifi'] = df['amenities'].str.contains('Wifi') + df['Wifi'] = df['Wifi'].astype(int) + df['Elevator'] = df['amenities'].str.contains('Elevator') + df['Elevator'] = df['Elevator'].astype(int) + df['Breakfast'] = df['amenities'].str.contains('Breakfast') + df['Breakfast'] = df['Breakfast'].astype(int) + + df.drop('amenities', axis=1, inplace=True) + + return df + + +df = preprocess_amenities_column(df) + +t2_old_code = datetime.datetime.now() + + +logger.info("Aplying new code") +t1_new_code = datetime.datetime.now() + +# ct = ColumnTransformer( +# transformers=[ +# ('bathroom_processing', StringToFloatTransformer({'bathrooms_text': 'bathrooms'}), ['bathrooms_text']), +# ('array_to_cat', ArrayOneHotEncoder('amenities', CAT_COLS), 'amenities') +# ], +# remainder='passthrough', +# n_jobs= -1, +# verbose_feature_names_out=False, +# ) + +# ct.set_output(transform='pandas') + +# preprocessing_pipeline = Pipeline(steps=[ +# ('col_selector', ColumnSelector(COLUMNS_PIPE)), +# ('column_transformer', ct), +# ('drop_na', DropNan(axis=0)), +# ('cast_price', StringToInt('price', r"(\d+)")), +# ('bin_price', DiscretizerTransformer('price', 'category', bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3])), +# ('filter_rows', QueryFilter("price >= 10")), +# ('col_renamer_conditioning', ColumnRenamer(columns={'Air conditioning': 'Air_conditioning', 'neighbourhood_group_cleansed': 'neighbourhood'})), +# ('drop_cols', DropColumns('bathrooms_text')) +# ] +# ) + +preprocessing_pipeline = Pipeline(steps=[ + ('col_selector', ColumnSelector(COLUMNS_PIPE)), + ('bathroom_processing', StringToFloatTransformer({'bathrooms_text': 'bathrooms'})), + ('cast_price', StringToInt('price', r"(\d+)")), + ('filter_rows', QueryFilter("price >= 10")), + ('drop_na', DropNan(axis=0)), + ('bin_price', DiscretizerTransformer('price', 'category', bins=[10, 90, 180, 400, np.inf], labels=[0, 1, 2, 3])), + ('array_to_cat', ArrayOneHotEncoder('amenities', CAT_COLS)), + ('col_renamer_conditioning', ColumnRenamer(columns={'Air conditioning': 'Air_conditioning', 'neighbourhood_group_cleansed': 'neighbourhood'})), + ('drop_cols', DropColumns('amenities')) +]) + +preprocessing_pipeline.set_output(transform='pandas') + +df_processed = preprocessing_pipeline.fit_transform(df_raw) + +t2_new_code = datetime.datetime.now() + +logger.info(f""" +Old code time: {t2_old_code - t1_old_code} +New code time: {t2_new_code - t1_new_code} +Same result: {all(df == df_processed[df.columns])} +""") \ No newline at end of file diff --git a/solution/code/test/develope_tests/test_explore_classifier.py b/solution/code/test/develope_tests/test_explore_classifier.py new file mode 100644 index 0000000..8217ee1 --- /dev/null +++ b/solution/code/test/develope_tests/test_explore_classifier.py @@ -0,0 +1,143 @@ +import os +import sys +from pathlib import Path +import logging +import datetime +import numpy as np +import pandas as pd +from pandas import DataFrame +from sklearn.pipeline import Pipeline +from sklearn.compose import ColumnTransformer +from sklearn.ensemble import RandomForestClassifier +from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix +from sklearn.model_selection import train_test_split + +DIR_REPO = Path(__file__).parent.parent.parent.parent.parent +os.chdir(DIR_REPO) + +# Import custom functions +from code.src.transformer import ( + ColumnSelector, + CustomOrdinalEncoder, + DropNan +) + +LOG_DIR = DIR_REPO / 'solution' / 'logs' +os.makedirs(LOG_DIR, exist_ok=True) + +log_file = os.path.join(LOG_DIR, "test_explore.log") + +# Configure logging +logging.basicConfig( + filename=log_file, + level=logging.DEBUG, + filemode='w+' +) + +logger = logging.getLogger(__name__) + + +DIR_DATA_PROCESSED = Path(DIR_REPO) / "data" / "processed" +DIR_MODELS = Path(DIR_REPO) / "models" +FILEPATH_PROCESSED = DIR_DATA_PROCESSED / "preprocessed_listings.csv" + +MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4} +MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5} +FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms'] +TARGET_VARIABLE = "category" + +df = pd.read_csv(FILEPATH_PROCESSED, index_col=0) + +logging.info("Aplying old code") +df_old = df.copy() + +t1_old_code = datetime.datetime.now() + +df_old = df_old.dropna(axis=0) +# Map categorical features +df_old["neighbourhood"] = df_old["neighbourhood"].map(MAP_NEIGHB) +df_old["room_type"] = df_old["room_type"].map(MAP_ROOM_TYPE) +X_old = df_old[FEATURE_NAMES] +y_old = df_old['category'] + + +X_train_old, X_test_old, y_train_old, y_test_old = train_test_split(X_old, y_old, test_size=0.15, random_state=1) + +clf = RandomForestClassifier(n_estimators=500, random_state=0, class_weight='balanced', n_jobs=4) +clf.fit(X_train_old, y_train_old) + +y_pred_old = clf.predict(X_test_old) + +acc_old = accuracy_score(y_test_old, y_pred_old) + +y_proba_old = clf.predict_proba(X_test_old) +roc_old = roc_auc_score(y_test_old, y_proba_old, multi_class='ovr') + +maps = {'0.0': 'low', '1.0': 'mid', '2.0': 'high', '3.0': 'lux'} + +report = classification_report(y_test_old, y_pred_old, output_dict=True) +df_report = pd.DataFrame.from_dict(report).T[:-3] +df_report.index = [maps[i] for i in df_report.index] +df_report_old = df_report.copy() + +t2_old_code = datetime.datetime.now() + +logging.info("Aplying new code") +t1_new_code = datetime.datetime.now() + + +ct = ColumnTransformer( + [ + ('ordinal_encoder', CustomOrdinalEncoder(categories=[list(MAP_NEIGHB.keys()), list(MAP_ROOM_TYPE.keys())], start_category=1), ["neighbourhood", "room_type"]) + ], + remainder = "passthrough", + verbose_feature_names_out=False +) + +processing_pipeline = Pipeline(steps=[ + ('drop_na', DropNan(axis=0)), + ('categorical', ct), + ('col_selector', ColumnSelector(FEATURE_NAMES + [TARGET_VARIABLE])) + ] +) + +processing_pipeline.set_output(transform='pandas') + +df_processed = processing_pipeline.fit_transform(df) + +X_new = df_processed[FEATURE_NAMES] +y_new = df_processed['category'] + + +X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.15, random_state=1) + +clf = RandomForestClassifier(n_estimators=500, random_state=0, class_weight='balanced', n_jobs=4) +clf.fit(X_train_new, y_train_new) + +y_pred_new = clf.predict(X_test_new) + +acc_new = accuracy_score(y_test_new, y_pred_new) + +y_proba_new = clf.predict_proba(X_test_new) +roc_new = roc_auc_score(y_test_new, y_proba_new, multi_class='ovr') + +maps = {'0.0': 'low', '1.0': 'mid', '2.0': 'high', '3.0': 'lux'} + +report = classification_report(y_test_old, y_pred_old, output_dict=True) +df_report = pd.DataFrame.from_dict(report).T[:-3] +df_report.index = [maps[i] for i in df_report.index] +df_report_new = df_report.copy() + +t2_new_code = datetime.datetime.now() + +df_old = pd.concat([X_old, y_old], axis=1) + + +logging.info(f""" +Old code time: {t2_old_code - t1_old_code} +New code time: {t2_new_code - t1_new_code} +Same accuracy: {acc_old == acc_new} +Same roc: {roc_old == roc_new} +Same result: {all(df_processed == df_old)} +Same report: {all(df_report_new == df_report_old)} +""") \ No newline at end of file diff --git a/solution/code/test/test_transformers.py b/solution/code/test/test_transformers.py new file mode 100644 index 0000000..e0c81b6 --- /dev/null +++ b/solution/code/test/test_transformers.py @@ -0,0 +1,427 @@ +import os +from pathlib import Path +import unittest +import numpy as np +import pandas as pd +import re +from code.src.transformer import ( + ArrayOneHotEncoder, + ColumnRenamer, + ColumnSelector, + CustomOrdinalEncoder, + DropColumns, + DropNan, + DiscretizerTransformer, + QueryFilter, + StringToFloatTransformer, + StringToInt +) + +DIR_REPO = Path(__file__).parent.parent.parent.parent +log_dir = os.path.join(DIR_REPO, "solution", "logs") +os.makedirs(log_dir, exist_ok=True) +log_file = os.path.join(log_dir, "unittests.log") + + +class TestStringToFloatTransformer(unittest.TestCase): + + def setUp(self): + # Sample data for testing + self.data = pd.DataFrame({ + 'bathrooms_text': ['1 private bath', '1 bath', 'NaN', '1.5 baths'], + }) + + def test_transform_with_dict_column(self): + transformer = StringToFloatTransformer(columns={"bathrooms_text": "bathrooms"}) + transformed_data = transformer.transform(self.data) + expected_data = self.data.copy() + expected_data["bathrooms"] = pd.Series([1, 1, np.nan, 1.5]) + pd.testing.assert_frame_equal(transformed_data, expected_data) + + def test_transform_with_list_column(self): + transformer = StringToFloatTransformer(columns=["bathrooms_text"]) + transformed_data = transformer.transform(self.data) + expected_data = self.data.copy() + expected_data["bathrooms_text"] = pd.Series([1, 1, np.nan, 1.5]) + pd.testing.assert_frame_equal(transformed_data, expected_data) + + def test_get_feature_names_out(self): + transformer = StringToFloatTransformer(columns={"bathrooms_text": "bathrooms"}) + transformer.transform(self.data) + self.assertListEqual(transformer.get_feature_names_out(None), list(transformer.out_cols)) + +# Test for ColumnSelector +class TestColumnSelector(unittest.TestCase): + + def setUp(self): + self.data = pd.DataFrame({ + 'price': [10, 20, 30], + 'quantity': [1, 2, 3], + 'description': ['A', 'B', 'C'] + }) + + + def test_transform_single_column(self): + transformer = ColumnSelector(columns="price") + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({'price': [10, 20, 30]}) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_multiple_columns(self): + transformer = ColumnSelector(columns=["price", "quantity"]) + transformed_data = transformer.transform(self.data) + expected_output = self.data[["price", "quantity"]] + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_get_feature_names_out(self): + transformer = ColumnSelector(columns=["price", "quantity"]) + transformer.transform(self.data) + self.assertListEqual(list(transformer.get_feature_names_out(None)), transformer.columns) + + +class TestColumnRenamer(unittest.TestCase): + + def setUp(self): + self.data = pd.DataFrame({ + 'old_price': [10, 20, 30], + 'quantity': [1, 2, 3] + }) + + def test_transform_column_renaming(self): + transformer = ColumnRenamer(columns={"old_price": "price", "old_description": "description"}) + transformed_data = transformer.transform(self.data) + expected_output = self.data.rename(columns={"old_price": "price", "old_description": "description"}) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_invalid_data_type(self): + transformer = ColumnRenamer(columns={"old_price": "price"}) + with self.assertRaises(ValueError): + transformer.transform(["Not a DataFrame"]) + + def test_get_feature_names_out(self): + transformer = ColumnRenamer(columns={"old_price": "price", "old_description": "description"}) + transformer.transform(self.data) + self.assertListEqual(list(transformer.get_feature_names_out(None)), list(transformer.out_cols)) + + +class TestDropNan(unittest.TestCase): + + def setUp(self): + self.data = pd.DataFrame({ + 'price': [10, np.nan, 30, 40, 50], + 'quantity': [1, 2, 3, 4, 5] + }, + dtype=np.float32 + ) + + def test_transform_rows(self): + transformer = DropNan(axis=0) + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'price': [10.0, 30.0, 40.0, 50.0], + 'quantity': [1.0, 3.0, 4.0, 5.0] + }, + dtype= np.float32 + ) + pd.testing.assert_frame_equal(transformed_data.reset_index(drop=True), expected_output) + + def test_transform_columns(self): + transformer = DropNan(axis=1) + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'quantity': [1.0, 2.0, 3.0, 4.0, 5.0] + }, + dtype= np.float32 + ) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_invalid_data_type(self): + transformer = DropNan(axis=0) + with self.assertRaises(ValueError): + transformer.transform("This is not valid") + + def test_get_feature_names_out(self): + transformer = DropNan(axis=0) + transformed_data = transformer.transform(self.data) + feature_names = transformer.get_feature_names_out(transformed_data.columns) + self.assertListEqual(list(feature_names), list(self.data.columns)) + + +class TestDropColumns(unittest.TestCase): + + def setUp(self): + self.data = pd.DataFrame({ + 'price': [10, 20, 30, 40, 50], + 'quantity': [1, 2, 3, 4, 5], + 'description': ['A', 'B', 'C', 'D', 'E'] + }) + + def test_transform_single_column(self): + transformer = DropColumns(columns="price") + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'quantity': [1, 2, 3, 4, 5], + 'description': ['A', 'B', 'C', 'D', 'E'] + }) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_multiple_columns(self): + transformer = DropColumns(columns=["price", "quantity"]) + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'description': ['A', 'B', 'C', 'D', 'E'] + }) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_invalid_data_type(self): + transformer = DropColumns(columns="price") + with self.assertRaises(ValueError): + transformer.transform("This is not valid") + + def test_get_feature_names_out(self): + transformer = DropColumns(columns="price") + transformed_data = transformer.transform(self.data) + feature_names = transformer.get_feature_names_out(transformed_data.columns) + self.assertListEqual(list(feature_names), list(transformed_data.columns)) + + +class TestStringToInt(unittest.TestCase): + + def setUp(self): + # Sample data for testing + self.data = pd.DataFrame({ + 'price': ["$10", "$20", "$30", "$40", "$50"], + 'quantity': ["1 unit", "2 units", "3 units", "4 units", "5 units"] + }) + + self.columns = ['price', 'quantity'] + self.patterns = [r'\d+', r'\d+'] + + def test_initialization_mismatched_types(self): + with self.assertRaises(ValueError): + StringToInt(columns=['price', 'quantity'], patterns=r'\d+') + + def test_initialization_mismatched_list_lengths(self): + with self.assertRaises(ValueError): + StringToInt(columns=['price', 'quantity'], patterns=[r'\d+']) + + def test_apply_regex(self): + transformer = StringToInt(columns=self.columns, patterns=self.patterns) + match = transformer._apply_regex("$10", re.compile(r'\d+')) + self.assertEqual(match, "10") + + def test_transform_single_column(self): + transformer = StringToInt(columns=self.columns[0], patterns=self.patterns[0]) + transformed_data = transformer.transform(self.data.drop(columns='quantity')) + expected_output = pd.DataFrame({'price': [10, 20, 30, 40, 50]}) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_multiple_columns(self): + transformer = StringToInt(columns=self.columns, patterns=self.patterns) + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'price': [10, 20, 30, 40, 50], + 'quantity': [1, 2, 3, 4, 5] + }) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + + def test_get_feature_names_out(self): + transformer = StringToInt(columns=self.columns[0], patterns=self.patterns[0]) + transformer.transform(self.data.drop(columns='quantity')) + feature_names = transformer.get_feature_names_out(self.data.columns[0]) + self.assertListEqual(list(feature_names), [self.data.columns[0]]) + + +class TestQueryFilter(unittest.TestCase): + + def setUp(self): + # Sample data for testing + self.data = pd.DataFrame({ + 'price': [10, 20, 30, 40, 50], + 'quantity': [1, 2, 3, 4, 5] + }) + + def test_transform_valid_query(self): + filter_transformer = QueryFilter(query_string="price > 20") + transformed_data = filter_transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'price': [30, 40, 50], + 'quantity': [3, 4, 5] + }) + pd.testing.assert_frame_equal(transformed_data.reset_index(drop=True), expected_output) + + + def test_get_feature_names_out(self): + filter_transformer = QueryFilter(query_string="price > 20") + filter_transformer.transform(self.data) + feature_names = filter_transformer.get_feature_names_out(self.data.columns) + self.assertListEqual(list(feature_names), list(self.data.columns)) + + +class TestDiscretizerTransformer(unittest.TestCase): + + def setUp(self): + self.data = pd.DataFrame({ + 'price': [10, 30, 50, 70, 100], + 'age': [20, 40, 60, 80, 100] + }) + + self.bins = [[0, 25, 50, 75, np.inf], [0, 35, 70, np.inf]] + self.labels = [[0, 1, 2, 3], [0, 1, 2]] + self.columns = ['price', 'age'] + self.new_colnames = ['price_category', 'age_category'] + + def test_initialization_invalid_params(self): + with self.assertRaises(ValueError): + DiscretizerTransformer( + columns=['price', 'age'], + new_colnames=['price_category'], + bins=[self.bins[0]], + labels=[self.labels] + ) + + with self.assertRaises(ValueError): + DiscretizerTransformer( + columns='price', + new_colnames='price_category', + bins=self.bins, + labels=[0, 1] + ) + + def test_transform_single_column(self): + transformer = DiscretizerTransformer( + columns='price', + new_colnames='price_category', + bins=self.bins[0], + labels=self.labels[0] + ) + + transformed_data = transformer.fit_transform(self.data.drop(columns=['age'])) + expected_output = pd.DataFrame({ + 'price': [10, 30, 50, 70, 100], + 'price_category': pd.Categorical([0, 1, 1, 2, 3], categories=self.labels[0], ordered=True) + }) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_transform_multiple_columns(self): + transformer = DiscretizerTransformer( + columns=self.columns, + new_colnames=self.new_colnames, + bins=self.bins, + labels=self.labels + ) + transformed_data = transformer.transform(self.data) + expected_output = pd.DataFrame({ + 'price': [10, 30, 50, 70, 100], + 'age': [20, 40, 60, 80, 100], + 'price_category': pd.Categorical([0, 1, 1, 2, 3], categories=self.labels[0], ordered=True), + 'age_category': pd.Categorical([0, 1, 1, 2, 2], categories=self.labels[1], ordered=True) + }) + pd.testing.assert_frame_equal(transformed_data, expected_output) + + def test_get_feature_names_out(self): + transformer = DiscretizerTransformer( + columns=self.columns, + new_colnames=self.new_colnames, + bins=self.bins, + labels=self.labels + ) + + transformer.fit_transform(self.data) + feature_names = transformer.get_feature_names_out(self.columns) + expected_output = list(self.data.columns) + self.new_colnames + self.assertListEqual(list(feature_names), expected_output) + + +class TestArrayOneHotEncoder(unittest.TestCase): + + def setUp(self): + self.data_df = pd.DataFrame({ + 'amenities': [ + '["Extra pillows and blankets", "Baking sheet", "Wifi", "Heating", "Dishes and silverware", "Essentials", ]', + '["Extra pillows and blankets", "Luggage dropoff allowed", "Free parking on premises", "Wifi", "Heating"]', + '["Kitchen", "Long term stays allowed", "Heating", "Air conditioning", "Pool"]' + ] + }) + self.categories = ['Wifi', 'Parking', 'Pool', 'Heating'] + self.data_ndarray = self.data_df.amenities.to_numpy() + self.encoder = ArrayOneHotEncoder(column='amenities', categories=self.categories) + + def test_initialization(self): + self.assertEqual(self.encoder.column, 'amenities') + self.assertEqual(self.encoder.categories, self.categories) + + + def test_transform_with_dataframe(self): + transformed_df = self.encoder.transform(self.data_df) + expected_columns = list(self.data_df.columns) + self.categories + self.assertTrue(all(col in transformed_df.columns for col in expected_columns)) + expected_output = pd.DataFrame({ + 'amenities': self.data_df.amenities.tolist(), + 'Wifi': [1, 1, 0], + 'Parking': [0, 0, 0], + 'Pool': [0, 0, 1], + 'Heating': [1, 1, 1] + }) + pd.testing.assert_frame_equal(transformed_df, expected_output) + + def test_transform_with_ndarray(self): + transformed_array = self.encoder.transform(self.data_ndarray) + expected_output = [[1, 0, 0, 1], [1, 0, 0, 1], [0, 0, 1, 1]] + np.testing.assert_array_equal(transformed_array, expected_output) + + def test_get_feature_names_out(self): + self.encoder.transform(self.data_df) + feature_names = self.encoder.get_feature_names_out(['amenities']) + self.assertListEqual(list(feature_names), list(self.data_df.columns) + self.categories) + + +class TestCustomOrdinalEncoder(unittest.TestCase): + + def setUp(self): + self.categories = [['low', 'medium', 'high']] + self.data = pd.DataFrame({'quality': ['low', 'medium', 'high', 'low']}) + + def test_initialization_invalid_categories_type(self): + with self.assertRaises(ValueError) as context: + CustomOrdinalEncoder(categories="not list") + self.assertIn("The categories must be passed as a list os list", str(context.exception)) + + def test_initialization_invalid_categories_format(self): + with self.assertRaises(ValueError) as context: + CustomOrdinalEncoder(categories=["not nested list"]) + self.assertIn("The categories must be passed as a list os list", str(context.exception)) + + def test_initialization_invalid_start_category(self): + with self.assertRaises(ValueError) as context: + CustomOrdinalEncoder(categories=self.categories, start_category="not int") + self.assertIn("The starting category must be a integer", str(context.exception)) + + def test_fit_transform_with_start_category(self): + encoder = CustomOrdinalEncoder(categories=self.categories, start_category=1) + transformed_data = encoder.fit_transform(self.data) + expected_output = np.array([[1], [2], [3], [1]]) + np.testing.assert_array_equal(transformed_data, expected_output) + + def test_fit_transform_no_start_category(self): + encoder = CustomOrdinalEncoder(categories=self.categories) + transformed_data = encoder.fit_transform(self.data) + expected_output = np.array([[0], [1], [2], [0]]) + np.testing.assert_array_equal(transformed_data, expected_output) + + def test_get_feature_names_out(self): + encoder = CustomOrdinalEncoder(categories=self.categories) + encoder.fit(self.data) + feature_names = encoder.get_feature_names_out(['quality']) + self.assertEqual(feature_names, ['quality']) + + + + + +if __name__ == '__main__': + # Open the log file in write mode and run tests with custom TextTestRunner + with open(log_file, "w") as f: + runner = unittest.TextTestRunner(stream=f) + unittest.main(testRunner=runner, exit=False) diff --git a/solution/docker/.env b/solution/docker/.env new file mode 100644 index 0000000..0dc84bf --- /dev/null +++ b/solution/docker/.env @@ -0,0 +1 @@ + MLFLOW_TRACKING_URI="http://mlflow:5000" \ No newline at end of file diff --git a/solution/docker/Dockerfile.app b/solution/docker/Dockerfile.app new file mode 100644 index 0000000..bdc1d1d --- /dev/null +++ b/solution/docker/Dockerfile.app @@ -0,0 +1,16 @@ +FROM python:3.12-slim + +COPY solution/app /the-real-mle-challenge/solution/app +COPY solution/code /the-real-mle-challenge/solution/code +COPY solution/requirements.txt requirements.txt + +RUN pip install -r requirements.txt + +ENV PYTHONPATH="${PYTHONPATH}:/the-real-mle-challenge/solution/app:/the-real-mle-challenge/solution\ +:/the-real-mle-challenge/solution/code:/the-real-mle-challenge/solution/code/src:/the-real-mle-challenge/solution" + +WORKDIR /the-real-mle-challenge/solution/app + +EXPOSE 8080 + +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"] \ No newline at end of file diff --git a/solution/docker/Dockerfile.mlflow b/solution/docker/Dockerfile.mlflow new file mode 100644 index 0000000..2354505 --- /dev/null +++ b/solution/docker/Dockerfile.mlflow @@ -0,0 +1,7 @@ +FROM python:3.12-slim + +RUN pip install mlflow + +EXPOSE 5000 + +CMD mlflow server --host 0.0.0.0 --port 5000 \ No newline at end of file diff --git a/solution/docker/Dockerfile.pipe b/solution/docker/Dockerfile.pipe new file mode 100644 index 0000000..e775cd1 --- /dev/null +++ b/solution/docker/Dockerfile.pipe @@ -0,0 +1,24 @@ +FROM python:3.12-slim + +RUN apt-get update && \ + apt-get upgrade -y && \ + apt-get install -y git + +COPY solution/plots /the-real-mle-challenge/solution/plots +COPY solution/code /the-real-mle-challenge/solution/code +COPY data /the-real-mle-challenge/data +COPY solution/requirements.txt /the-real-mle-challenge/solution/requirements.txt + +RUN pip install -r /the-real-mle-challenge/solution/requirements.txt + +ENV PYTHONPATH="${PYTHONPATH}:/the-real-mle-challenge/solution/" + +WORKDIR /the-real-mle-challenge/solution/ + +CMD sh -c "\ +python code/test/develope_tests/test_eda.py && \ +python code/test/develope_tests/test_explore_classifier.py && \ +python code/test/test_transformers.py && \ +python code/generate_plots.py && \ +python code/pipeline.py" + diff --git a/solution/docker/docker-compose.yml b/solution/docker/docker-compose.yml new file mode 100644 index 0000000..a117085 --- /dev/null +++ b/solution/docker/docker-compose.yml @@ -0,0 +1,32 @@ +services: + + training-pipeline: + container_name: training-pipeline + build: + context: ../../ + dockerfile: ./solution/docker/Dockerfile.pipe + volumes: + - ../models:/the-real-mle-challenge/solution/models + - ../plots:/the-real-mle-challenge/solution/plots + - ../logs:/the-real-mle-challenge/solution/logs + env_file: .env + + app: + container_name: app + build: + context: ../../ + dockerfile: ./solution/docker/Dockerfile.app + volumes: + - ../models:/the-real-mle-challenge/solution/models + ports: + - 8080:8080 + env_file: .env + + mlflow: + container_name: mlflow + build: + context: ../../ + dockerfile: ./solution/docker/Dockerfile.mlflow + ports: + - "5000:5000" + restart: unless-stopped diff --git a/solution/logs/test_eda.log b/solution/logs/test_eda.log new file mode 100644 index 0000000..dd2919b --- /dev/null +++ b/solution/logs/test_eda.log @@ -0,0 +1,7 @@ +INFO:__main__:Aplying old code +INFO:__main__:Aplying new code +INFO:__main__: +Old code time: 0:00:00.131001 +New code time: 0:00:00.168337 +Same result: True + diff --git a/solution/logs/test_explore.log b/solution/logs/test_explore.log new file mode 100644 index 0000000..785564a --- /dev/null +++ b/solution/logs/test_explore.log @@ -0,0 +1,10 @@ +INFO:root:Aplying old code +INFO:root:Aplying new code +INFO:root: +Old code time: 0:00:01.193487 +New code time: 0:00:01.176504 +Same accuracy: True +Same roc: True +Same result: True +Same report: True + diff --git a/solution/logs/unittests.log b/solution/logs/unittests.log new file mode 100644 index 0000000..b26bb2b --- /dev/null +++ b/solution/logs/unittests.log @@ -0,0 +1,5 @@ +....................................... +---------------------------------------------------------------------- +Ran 39 tests in 0.016s + +OK diff --git a/solution/models/classifier.joblib b/solution/models/classifier.joblib new file mode 100644 index 0000000..95b3f06 Binary files /dev/null and b/solution/models/classifier.joblib differ diff --git a/solution/models/col_transformer.joblib b/solution/models/col_transformer.joblib new file mode 100644 index 0000000..084167c Binary files /dev/null and b/solution/models/col_transformer.joblib differ diff --git a/solution/models/pipeline.joblib b/solution/models/pipeline.joblib new file mode 100644 index 0000000..cfecdb6 Binary files /dev/null and b/solution/models/pipeline.joblib differ diff --git a/solution/models/preprocessing_pipeline.joblib b/solution/models/preprocessing_pipeline.joblib new file mode 100644 index 0000000..cb65910 Binary files /dev/null and b/solution/models/preprocessing_pipeline.joblib differ diff --git a/solution/models/processing_pipeline.joblib b/solution/models/processing_pipeline.joblib new file mode 100644 index 0000000..8184c50 Binary files /dev/null and b/solution/models/processing_pipeline.joblib differ diff --git a/solution/plots/bathroom.png b/solution/plots/bathroom.png new file mode 100644 index 0000000..18b4bb4 Binary files /dev/null and b/solution/plots/bathroom.png differ diff --git a/solution/plots/cat_encoder.png b/solution/plots/cat_encoder.png new file mode 100644 index 0000000..5c17a2f Binary files /dev/null and b/solution/plots/cat_encoder.png differ diff --git a/solution/plots/pd_cut.png b/solution/plots/pd_cut.png new file mode 100644 index 0000000..ad338d2 Binary files /dev/null and b/solution/plots/pd_cut.png differ diff --git a/solution/plots/price.png b/solution/plots/price.png new file mode 100644 index 0000000..173b9af Binary files /dev/null and b/solution/plots/price.png differ diff --git a/solution/requirements.txt b/solution/requirements.txt new file mode 100644 index 0000000..b3eff92 --- /dev/null +++ b/solution/requirements.txt @@ -0,0 +1,99 @@ +alembic==1.13.3 +annotated-types==0.7.0 +anyio==4.6.2.post1 +appnope==0.1.4 +asttokens==2.4.1 +blinker==1.8.2 +cachetools==5.5.0 +certifi==2024.8.30 +charset-normalizer==3.4.0 +click==8.1.7 +cloudpickle==3.1.0 +comm==0.2.2 +contourpy==1.3.0 +cycler==0.12.1 +databricks-sdk==0.36.0 +debugpy==1.8.7 +decorator==5.1.1 +Deprecated==1.2.14 +docker==7.1.0 +executing==2.1.0 +fastapi==0.115.3 +Flask==3.0.3 +fonttools==4.54.1 +gitdb==4.0.11 +GitPython==3.1.43 +google-auth==2.35.0 +graphene==3.4.1 +graphql-core==3.2.5 +graphql-relay==3.2.0 +gunicorn==23.0.0 +h11==0.14.0 +idna==3.10 +importlib_metadata==8.4.0 +ipykernel==6.29.5 +ipython==8.29.0 +itsdangerous==2.2.0 +jedi==0.19.1 +Jinja2==3.1.4 +joblib==1.4.2 +jupyter_client==8.6.3 +jupyter_core==5.7.2 +kiwisolver==1.4.7 +Mako==1.3.6 +Markdown==3.7 +MarkupSafe==3.0.2 +matplotlib==3.9.2 +matplotlib-inline==0.1.7 +mlflow==2.17.1 +mlflow-skinny==2.17.1 +nest-asyncio==1.6.0 +numpy==2.1.2 +opentelemetry-api==1.27.0 +opentelemetry-sdk==1.27.0 +opentelemetry-semantic-conventions==0.48b0 +packaging==24.1 +pandas==2.2.3 +parso==0.8.4 +pexpect==4.9.0 +pillow==11.0.0 +platformdirs==4.3.6 +prompt_toolkit==3.0.48 +protobuf==5.28.3 +psutil==6.1.0 +ptyprocess==0.7.0 +pure_eval==0.2.3 +pyarrow==17.0.0 +pyasn1==0.6.1 +pyasn1_modules==0.4.1 +pydantic==2.9.2 +pydantic_core==2.23.4 +Pygments==2.18.0 +pyparsing==3.2.0 +python-dateutil==2.9.0.post0 +pytz==2024.2 +PyYAML==6.0.2 +pyzmq==26.2.0 +requests==2.32.3 +rsa==4.9 +scikit-learn==1.5.2 +scipy==1.14.1 +seaborn==0.13.2 +six==1.16.0 +smmap==5.0.1 +sniffio==1.3.1 +SQLAlchemy==2.0.36 +sqlparse==0.5.1 +stack-data==0.6.3 +starlette==0.41.2 +threadpoolctl==3.5.0 +tornado==6.4.1 +traitlets==5.14.3 +typing_extensions==4.12.2 +tzdata==2024.2 +urllib3==2.2.3 +uvicorn==0.32.0 +wcwidth==0.2.13 +Werkzeug==3.0.6 +wrapt==1.16.0 +zipp==3.20.2