add blog about working with tabular data using FastAI.jl#94
add blog about working with tabular data using FastAI.jl#94manikyabard wants to merge 2 commits intoFluxML:mainfrom
Conversation
| To start working, we'll have to take our tabular data and load it in such that it supports the interface defined by [Tables.jl](https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1). Most of the popular packages for loading in data from different formats do so already, so you probably won't have to worry about this. | ||
|
|
||
| Here, we have a `path` to a csv file, which we'll load in using [CSV.jl](https://github.com/JuliaData/CSV.jl) package, and get a DataFrame using [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl). | ||
| If your data is present in a different format, you could use a package which supports loading that format, provided that the final object created supports the required interface. |
There was a problem hiding this comment.
What is this required interface? Can you link it? Or is this a general comment?
There was a problem hiding this comment.
Ah, seems like you are referring to the Tables.jl Interface, maybe explicitly note that?
There was a problem hiding this comment.
Yes, this was referring to the Tables.jl interface. Sure I'll do that.
|
|
||
| [FastAI.jl](https://github.com/FluxML/FastAI.jl) is a package inspired by [fastai](https://github.com/fastai/fastai), and it's goal is to easily enable creating state-of-the-art models. | ||
|
|
||
| This blog post shows how to get started on working with tabular data using FastAI.jl and related packages. The work being presented here was done as a part of [GSoC'21](https://summerofcode.withgoogle.com/projects/#5088642453733376) under the mentorship of Kyle Daruwalla, Brian Chen and Lorenz Ohly. |
There was a problem hiding this comment.
I think you should not undersell your work here, I am truthfully unfamiliar with the deep technical detail but saying something like "Before my GSoC project, we could only do x and y. Now we can do XY & Z together with this unified interface". This will make it very clear why someone should read this post.
There was a problem hiding this comment.
Agreed, this project was no small feat. Just look at how long it's taken other frameworks to (not) add support for new modalities!
There was a problem hiding this comment.
Thanks for the comments! I'll add this in as well.
Co-authored-by: Logan Kilpatrick <23kilpatrick23@gmail.com>
|
|
||
| julia> path = joinpath(datasetpath("adult_sample") , "adult.csv"); | ||
|
|
||
| julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5) |
There was a problem hiding this comment.
| julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5) | |
| julia> df = DataFrames.DataFrame(CSV.File(path)) | |
| julia> first(df, 5) |
|
|
||
| ``` | ||
|
|
||
| What this `TableDataset` object allows us to do is that we can get any observation at a particular index by using `getindex(td, index)` and the total number of observations by using `nobs(td)`. |
There was a problem hiding this comment.
Maybe add in a line about why this is cool, and how it generalises the usual getindex based approach for arrays to data frames?
|
|
||
| julia> item = DataAugmentation.TabularItem(row, Tables.columnnames(df)); | ||
|
|
||
| julia> DataAugmentation.apply(normalize, item).data |
There was a problem hiding this comment.
Show the TablularItem here to clarify what is written in the next sentence? We never see the TabularItem post normalisation.
|
Hey @manikyabard it would be great to get this wrapped up, let me know if I can help in any way! |
|
Sure @logankilpatrick, I'll get this done (although the next 2-3 weeks look a little busy for me, so this might take a bit). Also just wanted to get a confirmation from @darsnack, @ToucheSir, or @lorenzoh if it's fine to put this post here since we were talking about putting this on the FastAI.jl website as well. I think we did discuss this a few ML Community calls ago but can't remember what our opinion was on that. Another thing is that this blog mainly focused previously on loading the data and performing some transformations on it (mainly because this was all the code that was written at that time), but we have come a long way from that, and can probably include more functionalities such as creating and training tabular models with the data. |
|
I think it's fine to post this on FluxML, but I agree the content should be expanded to include the full GSoC. |
The post explores some of the work done for FastAI.jl Development as a part of GSoC'21 (container pr, transformation pr) under the mentorship of @darsnack, @ToucheSir and @lorenzoh, and shows how to get started with working on tabular data by creating a container, and performing various transformations on it.