Description
The first part of the task is to find a real-life big data problem or competition on Kaggle which solution relies on the functionality available in daal4py-optimized Scikit-learn. I.e. the solution should run at least several minutes and spend more than 70% of time in the following algorithms, in one of them or in the combination:
• Linear or ridge regression
• LASSO or elastic net regularization
• Logistic regression
• Principal component analysis (PCA)
• K-Means clustering
• Pairwise distance (cosine or correlation)
• C-support vector classification (SVC)
The second part of the task is to reproduce the solution using Intel-optimized Scikit-learn and check the correctness of the new solution. It means that the accuracy of intel-optimized solution should not degrade comparing to the original solution. Contribute your example to repository.
- The data analytics or machine learning task that satisfies the requires found; the solution to the task is reproduced and gives satisfactory accuracy results with both vanilla Scikit-learn and Intel-optimized Scikit-learn. Outcome: Jupyter notebook.
- The original solution got one of the following improvements:
a. The solution shows significant improvement in the trained model’s accuracy using Intel-optimized Scikit-learn comparing to vanilla Scikit-learn. But the time spent on the improved model training is no longer than the time spent on the original model training with vanilla Scikit-learn.
b. The solution shows 1.5X speedup in the part of model training using Intel-optimized Scikit-learn comparing to vanilla Scikit-learn.