-
Notifications
You must be signed in to change notification settings - Fork 1
GSoC 2016 Proposal: Adding fused types to Cython files
scikit-learn: Adding fused types to Cython files
scikit-learn
The current implementation of many algorithms, such as Stochastic Gradient Descent, Coordinate Descent, etc. only allow input with float64
and int64
dtypes due to the adoption of Cython fused types may result in explosion of the generated C code. However, since scikit-learn has removed Cython files from the repo and re-generate them from every build, it provides a good chance to refactor some of the ".pyx" files by introducing Cython fused types. This will allow those algorithms to support float32
and int32
dtypes data, which is currently casted into float64
and int64
respectively, and therefore reduce the waste of memory space.
Cython fused types allow programmer to have one type definition that can refer to multiple types. This allows programmer to write a single static-typed cython algorithm that can operate on values of multiple types. Thus fused types allow generic programming and are akin to templates in C++ or generics in languages like Java / C#.
Functions in scikit-learn that I plan to add fused types are listed below:
Motivation:
Dealing with sparse data is fairly common when we are anlyzing large datasets. However, sparse function utils in scikit-learn don't support float32
now and will convert input data into float64
internally, which means it is not visible to user. If we introduce Cython fused types here, we may solve a potential memory waste issue.
Goal:
Make sparse function utils support Cython fused types.
Implementation Plan:
According to #6430, sparsefuncs_fast.pyx
need to be modified first so that other classes can handle input sparse data in the float32
dtype as float32
internally. Also, I will add test for other classes that used functions defined in sparsefuncs_fast.pyx
to make sure those classes still work well. I've already started to work on this issue with PR #6539.
Main Related Files:
sklearn/
utils/
sparsefuncs_fast.pyx
Motivation:
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This common technique is used in many fields, including image analysis and unsupervised document classification. However, lots of clustering algorithms in scikit-learn need to convert float32
input data into float64
first before fitting step, which may cause memory issue.
Goal:
Make clustering algorithms support Cython fused types.
Implementation Plan:
After enhancing sparse functions utils in scikit-learn, I will refactor KMeans/MiniBatchKMeans and modify _hierarchical.pyx
to make these commonly used cluster algorithms also support Cython fused types. Also, I'll modify test_k_means.py
and test_hierarchical.py
to check it actually works
Main Related Files:
sklearn/
cluster/
k_means_.py
_k_means.pyx
hierarchical.py
_hierarchical.pyx
Motivation:
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. In scikit-learn, however, the current implementation of SGDClassifier
and SGDRegressor
doesn't allow the user to specify the dtype they want to use and may cause a MemoryError when the user wants to train the model since it will try to copy input data into np.float64
. (See #5776)
Implementation Plan:
I'll start by modifying base classes in sgd_fast.pyx
including LossFunction
, Classification
, and Regression
, then I'll make several loss functions such as Hinge
and Log
also support Cython fused types. Finally, I'll refactor stochastic_gradient.py
make it accept both float32
and float64
input.
Goal:
Make SGD algorithms and SGD related learning methods support Cython fused types.
Main Related Files:
sklearn/
linear_model/
stochastic_gradient.py
sgd_fast.pyx
Motivation:
Stochastic Average Gradient (SAG) is another variant of SGD with an averaging strategy, and it is useful for classification with a logistic loss. Like SGD, SAG in scikit-learn will try to copy input data into np.float64
internally.
Implementation Plan:
I'll first modify classes and functions defined in sag_fast.pyx
such as MultinomialLogLoss
, and _multinomial_grad_loss_all_samples
, then I'll refactor sag.py
to make learning methods defined in it such as to support Cython fused types and therefore be able to handle both float32
and float64
input.
Goal:
Make SAG and SAG related learning methods support Cython fused types.
Main Related Files:
sklearn/
linear_model/
sag.py
sag_fast.pyx
Motivation:
Coordinate Descent (CD) is a non-derivative optimization algorithm that does line search along one coordinate direction at the current point in each iteration in order to find a local minimum of a function. Nonetheless, the current implementation of ElasticNet
and Lasso
in scikit-learn constrains the input to be float64
, which is a waste of space. (See #5464)
Implementation Plan:
I'll first modify multiple functions related to enet_coordinate_descent
in cd_fast.pyx
, then I'll refactor coordinate_descent.py
to make several learning methods defined in it such as ElasticNet
and Lasso
also support Cython fused types and therefore accept both float32
and float64
input.
Goal:
Make CD and CD related learning methods support Cython fused types.
Main Related Files:
sklearn/
linear_model/
coordinate_descent.py
cd_fast.pyx
Motivation:
The neighbors module in scikit-learn provides functionality for neighbors-based learning methods. The principle behind these methods is to find a predefined number of training samples closest in distance to the new test data point, and predict the label from them. Neighbors-based methods simply “remember” all of its training data and may leverage a
fast indexing structure such as a Ball Tree or KD Tree to accelerate the predicting procedure. Ball Tree and KD Tree currently work well, but they only support np.float64
; this is a waste of memory as same quality results should be obtainable with float32
.
Goal:
Make neighbors trees and related neighbors-based learning methods support Cython fused types.
Implementation Plan:
I'll introduce Cython fused types into functions defined in ball_tree.pyx
and kd_tree.pyx
to make them support both float64
and float32
. Then I'll modify dist_metrics.pyx
to make it also
adopt Cython fused types.
Main Related Files:
sklearn/
neighbors/
ball_tree.pyx
kd_tree.pyx
dist_metrics.pyx
typedefs.pyx
Motivation:
Latent Dirichlet allocation (LDA) is a generative statistical model which represents documents as mixtures of topics that spit out words with certain probabilities. It is also an example of a topic model and is very popular for topic discovery. In practice, document and text datasets tends to be really large. Therefore, it would be really useful if we LDA in scikit-learn can handle both float32
and float64
input.
Goal:
Make LDA in scikit-learn support Cython fused types.
Implementation Plan:
I'll replace original float64_t
arguments defined in _online_lda.pyx
with Cython fused types in order to make it support float32
and prevent potential memory issue. Also, I'll add tests to guarantee this class works for both float32
and float64
.
Main Related Files:
sklearn/
decomposition/
_online_lda.pyx
-
svm module is not included in this plan since it is highly relevant with the source cpp code. However, I'll make it adopt Cython fused types after GSoC program if scikit-learn comminity think it is important.
-
tree module is not included in this plan since it always convert training input samples to
float32
. See comments here
- Warm up (March 26, 2016 - April 22, 2016)
-
Community Bonding Period (April 22, 2016 - May 22, 2016)
- Modify
sparsefuncs_fast.pyx
to make it handle sparse float32 - Add test for classes use functions in
sparsefuncs_fast.pyx
- Modify
-
Week 1-3 (May 23, 2016 - June 12, 2016)
- Refactor KMeans/MiniBatchKMeans to handle sparse float32
- Make other cluster algorithms support fused type
- Add test for cluster
-
Week 4-5 (June 13, 2016 - June 26, 2016)
- Make SGD related learning methods support fused type
- Add test for SGD related learning methods
-
Week 6-7 (July 27, 2016 - July 10, 2016)
- Make SAG related learning methods support fused type
- Add test for SAG related learning methods
-
Week 8-9 (July 11, 2016 - July 24, 2016)
- Make CD related learning methods support fused type
- Add test for CD related learning methods
-
Week 10-12 (July 25, 2016 - August 14, 2016)
- Make neighbors trees support fused type
- Add test for neighbors trees
-
Week 13 (August 15, 2016 - August 23, 2016)
- Make LDA support fused type
- Add test
- Wrap up
I'll graduate in June and will work as a research assistant from September, therefore I can work full time for Google Summer of Code program with scikit-learn.
- #5492 Removing generated C files, adding build cache to Travis
- #5776 Working with float32 data
- #5464 Coordinate descent should work on float32 data (in addition to float64 data)
- #6430 Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types
- #5932 Allow float32 to pass through with copy
- #6539 Use fused type in inplace normalize
- #5973 Add fused type to Cython files
###1. Personal Information:
Name: Yen-Chen Lin
Alternate Names: Yen
Email: [email protected]
Telephone: +886 905136913
Time Zone: GMT/UTC+8
Blog: http://www.yclin.me/
University: National Tsing Hua University
Major: Computer Science
Current Year: 4th year
Expected Graduation date: 2016
Degree: Bachelor
I am Yen-Chen Lin, and I'm a final year undergraduate at the National Tsing Hua University, Taiwan, studying Computer Science. I started doing Machine Learning / Computer Vision research one year ago and have published a workshop paper in CVPR.
In addition to my research experience, I also build interesting side projects such as DeepLearningFlappyBird which has been featured on Hacker News, and work as an contract iOS developer for a startup called Forest.
I started to use scikit-learn around September of 2014 and have started to contribute to scikit-learn since January 2016 by improving documentation and fixing minor bugs etc.
- [MRG+1] Use fused type in inplace normalize (Open)
- [WIP] Add ensemble selection algorithm (Open)
- [MRG+1] Make dump_svmlight_file support sparse y (fixes #6301) (Merged)
- [MRG] Add homogeneous time series cross validation (Open)
- [MRG] Make GradientBoostingClassifier error message more informative (fixes 6433) (Open)
- [MRG] DOC Add LabelKFold in _BaseKFold's docstring (Open)
- Example document_clustering.py has broken (Open)
- [MRG] DOC Add doc of attributes for VotingClassifier (Merged)
- [MRG] DOC Fix misuse of "it's" (Merged)
- [MRG] DOC Remove redundant words in sklearn (Merged)
- Add test for local linear embedding (Merged)
- [MRG] Remove redundant assignment of variable in test_locally_linear.py (Merged)
- [MRG] Fix broken link of Wainwright's paper (fixes #6383) (Merged)
- [MRG+1] Add test for voting classifier‘s init errors (Merged)
- [MRG+1] Remove redundant assignment of parameter in _split.py (Merged)
- [MRG] DOC Remove redundant that in test_ranking.py (Merged)
- BUG: Normalize LDA transform's return value (fixes #6320) (Merged)
- [MRG] DOC Remove redundant "and" in binary_tree.pxi (Merged)
- [MRG] Fix check_cv error message (Merged)
- [MRG+1] Doc Add doc in DictVectorizer when categorical features are numeric values (fixes #4413) (Merged)
- [MRG] DOC: Fix typo in LDA parameter's doc (Merged)
- [MRG+1]Fix axis label in the plot of multioutput regression tree example (Merged)
- [MRG+1] Make doc for parameter in hierarchical.py consistent (Merged)