Open
Description
Would it make sense to remove tests
folder from the pandas distribution? It takes roughly 33% of the whole package weight.
It is especially important when using pandas inside the AWS Lambdas, where the deployment package size is limited to 50 MB zipped and 5 MB might really make a difference.
# Uncompressed
du -h -s pandas*
46.5M pandas
30.9M pandas_no_tests
# Compressed
du -h -s pandas*
14.7M pandas.zip
10.1M pandas_no_tests.zip
Activity
TomAugspurger commentedon Jan 6, 2020
I think we've talked about that in the past. We do have
pandas.test()
as part of the public API however, so we'd need to consider that.A couple options:
pandas-slim
or something that excludes these files (and docs)pandas.test()
fetch the source files on demand. That seems a bit messy though.Just as a note: we do exclude the test data files that are present in the git repository. So we're only talking about source files.
vfilimonov commentedon Jan 6, 2020
Hello @TomAugspurger
pandas-slim
sounds like an good workaround.It looks like docs are not a part of the distribution.
And right it's tests code only without the data files - in terms of size they are second to
_libs
and almost equal to the rest of the code.And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)?
TomAugspurger commentedon Jan 6, 2020
stonecharioteer commentedon Jan 7, 2020
I particularly used to run the tests for numpy, scikitlearn and matplotlib after installing, since at times I'd have them fail on Windows. However this was quite some time ago, 4 years ago perhaps. Perhaps other users were doing the same?
jonringer commentedon Jan 9, 2020
I'm a package manager for nixpkgs, and I'm against removing tests from the sdist package, however, removing from the wheel would make sense from a packaging standpoint. It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.
EDIT:
tests are a nice guarantee that the package is working as intended.
We could also checkout the github repo for tests. However, quickly looking at the setup.py, pandas was meant to have the CI set correct metadata such as version. So the version-controlled source can't be directly be used to package pandas.
joaoe commentedon Jan 26, 2020
Hi. i reported this elsewhere, so I'm pasting my comment here.
My use case
After a pip install pandas the
lib/site-packages/pandas/tests/
includes a lot of testing code which is definitely not relevant for me and many other end users ofpandas
.This bloats the installation and makes installation slower.
I'm working on packaging a python environment to distribute with a preinstalled set of modules and application and there are too many popular 3rd-party modules which include unneeded test code, like
numpy
,IPython
,jupyterlab
, etc, which needs to be striped to keep the package size down. I'll be reporting issues to these projects as well.Suggestion
Therefore, my suggestion is to keep the pandas module streamlined, and move the tests out. Perhaps create a
pandas-unittests
module if people are interested in it, or just expect users to checkout the code. Another possibility would be to skip packaging the tests folder andconftest.py
when creating packages to upload topypi.org
.Regarding
pandas-slim
, everyone and their mothers have a dependency onpandas
which would pull the whole code again with tests.That perfectly fine. The discussion is whether tests are bundled with the
pandas
module or not.Since you are now almost releasing 1.0 it might be a bit short notice to include this is such a big release. But for the next major release, it could work.
Thank you very much for your attention.
vfilimonov commentedon Jun 3, 2020
A small remark: as a part of recent commit to pyarrow @wesm removed
pyarrow.tests
from the wheel which to my understanding contributed 2.3 MB of ~60 MB installed size.In case of pandas tests folder contributed (as of version 1.0.3) tests folder contributed 17.9 MB out of 49 MB installed size.
So I'd like to bring the question back to the discussion and perhaps, @wesm could comment on that?
wesm commentedon Jun 3, 2020
I think it would be a good idea to not ship the tests in wheels. If you want users to be able to run the tests against their production installs perhaps the tests can be packaged as a separate source wheel. Install size is becoming a problem because of size constraints in things like AWS Lambda.
TomAugspurger commentedon Oct 30, 2020
https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html has some information.
I think I've come around to the idea that we can just not ship the test files in the main
pandas
distributions. We can have a separatepandas-tests
so thatpip install pandas-tests
package that's just__init__.py
file ties things togetherWe could even update
pandas.test()
to check for the presence of thepandas-tests
package.28 remaining items
viccsjain commentedon Sep 7, 2023
Splitting the pandas library and tests would be really useful. We are using this library in our serverless deployment. and there is size restriction to upload the package into AWS lambda of 250 MB. Removing tests file will reduce the size of our package.
thesamesam commentedon Sep 7, 2023
See also the discussion in #54907.
jbsilva commentedon Sep 25, 2023
Making docs and tests optional would be great.
In my cloud deployments I repackage it without the tests; 15 MB do make a difference for me.
I've seem many other packages including tests, but never that big.
dolfinus commentedon May 30, 2024
I was checking the size of one of my docker images, and found that tests are about 50% of the size of installed package:

Completely waste of space for me.
jonas-w commentedon Aug 3, 2024
According to https://pypistats.org/packages/pandas the package has 240 million downloads per month.
Now if the 32MB tests folder from the package would be removed, the package size would be halved. Currently the wheels are roughly 13MB large, so let's say the wheels would be 7MB after removing the tests, then pypi would save ~1.7 Petabyte of Bandwidth per Month, and could have saved roughly ~90 Petabytes of traffic since this issue was opened...
takluyver commentedon Sep 17, 2024
I just noticed that on one of our filesystems, which is not set up for lots of small files, pandas' tests end up taking 300 MB of installed size.
Would it work as an initial step to make a script which splits the tests out of the wheels to be uploaded, and makes separate pandas-test wheels which can be uploaded separately? This is obviously not the most elegant way to do it, but I think I can see more or less how to make that work, whereas I'm not sure I can commit the time to figuring out how to rework pandas' build scripts and CI config to produce & use two separate wheels.