Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
dask-worker-space/
.idea/
*.bak

# Created by https://www.gitignore.io/api/data,python,pycharm
# Edit at https://www.gitignore.io/?templates=data,python,pycharm

### Data ###
*.csv
*.dat
*.efx
*.gbr
*.key
*.pps
*.ppt
*.pptx
*.sdf
*.tax2010
*.vcf
*.xml

### PyCharm ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf

# Generated files
.idea/**/contentModel.xml

# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml

# Gradle
.idea/**/gradle.xml
.idea/**/libraries

# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr

# CMake
cmake-build-*/

# Mongo Explorer plugin
.idea/**/mongoSettings.xml

# File-based project format
*.iws

# IntelliJ
out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Cursive Clojure plugin
.idea/replstate.xml

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties

# Editor-based Rest Client
.idea/httpRequests

# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser

### PyCharm Patch ###
# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721

# *.iml
# modules.xml
# .idea/misc.xml
# *.ipr

# Sonarlint plugin
.idea/**/sonarlint/

# SonarQube Plugin
.idea/**/sonarIssues.xml

# Markdown Navigator plugin
.idea/**/markdown-navigator.xml
.idea/**/markdown-navigator/

### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# Mr Developer
.mr.developer.cfg
.project
.pydevproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# End of https://www.gitignore.io/api/data,python,pycharm
10 changes: 10 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/configs/dtypes.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"CRSElapsedTime": "float",
"Cancelled": "bool",
"Diverted": "bool",
"UniqueCarrier": "category",
"FlightNum": "category",
"TailNum": "category",
"Origin": "category",
"Dest": "category"
}
14 changes: 14 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/configs/lr.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"mode": "dask",
"algorithm_name": "dask_ml.linear_model.LogisticRegression",
"params_grid": {
"penalty": ["l1", "l2"],
"C": [0.01, 0.1, 0.5, 1.0],
"solver": ["gradient_descent"],
"fit_intercept": [false],
"solver_kwargs": [{"normalize": false}]
},
"num_folds": 3,
"test_size": 0.7,
"output_path": "lr_results.json"
}
14 changes: 14 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/configs/lr_a.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"mode": "dask_array",
"algorithm_name": "dask_ml.linear_model.LogisticRegression",
"params_grid": {
"penalty": ["l1", "l2"],
"C": [0.01, 0.1, 0.5, 1.0],
"solver": ["gradient_descent"],
"fit_intercept": [false],
"solver_kwargs": [{"normalize": false}]
},
"num_folds": 5,
"test_size": 0.2,
"output_path": "lr_results.json"
}
63 changes: 63 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/configs/preprocess.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{
"preprocessors": [
{
"column": "Cancelled",
"name": "Filter",
"value": false
},
{
"column": "Diverted",
"name": "Filter",
"value": false
},
{
"column": "Distance",
"name": "FillMethod",
"method": "mean"
},
{
"column": "DepTime",
"name": "Drop"
},
{
"column": "ArrTime",
"name": "Drop"
},
{
"column": "DepDelay",
"name": "Drop"
},
{
"column": "AirTime",
"name": "Drop"
},
{
"column": "TaxiIn",
"name": "Drop"
},
{
"column": "TaxiOut",
"name": "Drop"
},
{
"column": "ActualElapsedTime",
"name": "Drop"
},
{
"column": "FlightNum",
"name": "Drop"
},
{
"column": "CRSDepTime",
"name": "CyclicHM"
},
{
"column": "CRSArrTime",
"name": "CyclicHM"
},
{
"column": "TailNum",
"name": "Drop"
}
]
}
21 changes: 21 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/configs/rf.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"mode": "sklearn",
"algorithm_name": "sklearn.ensemble.RandomForestRegressor",
"params_grid": {
"n_estimators": [
20, 30, 50
],
"max_depth": [
3, 4, 5
],
"criterion": [
"mse", "mae"
],
"max_features": [
"sqrt", "log2"
]
},
"num_folds": 5,
"test_size": 0.2,
"output_path": "rf_results.json"
}
24 changes: 24 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Foundation of Data Science 2020 homework 2
Homework 2: large-scale ML. Train `dask` and regular `sklearn` models
on the New York flights dataset.
## Installation
Easiest way is to install the whole directory in editable mode,
since you are checking out of the repository anyway.

`pip install -r requirements.txt`

`pip install -e .`

An isolated python environment required. Either `conda` or `virtualenv` recommended.
Packages are installed and managed using `pip` in either case,
because latest versions are not available in conda at the time of writing.

## Usage
To run the whole pipeline use the supplied shell script:

`bash run.sh`

Performance results will appear in the console and in separate `json` files
as specified in the configuration files located at `configs/`

## Comments
10 changes: 10 additions & 0 deletions Week 2/Day 3/Submissions/Alexander Fishkov/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
dask[complete]
dask-ml
dask-xgboost
tables
jupyter-server-proxy
setuptools
distributed
joblib
numpy
pandas
Loading