Skip to content

Suggestion: remove tests from the distribution #30741

Open
@vfilimonov

Description

@vfilimonov

Would it make sense to remove tests folder from the pandas distribution? It takes roughly 33% of the whole package weight.

It is especially important when using pandas inside the AWS Lambdas, where the deployment package size is limited to 50 MB zipped and 5 MB might really make a difference.

# Uncompressed
du -h -s pandas*
 46.5M	pandas
 30.9M	pandas_no_tests

# Compressed
du -h -s pandas*
 14.7M	pandas.zip
 10.1M	pandas_no_tests.zip

Activity

TomAugspurger

TomAugspurger commented on Jan 6, 2020

@TomAugspurger
Contributor

I think we've talked about that in the past. We do have pandas.test() as part of the public API however, so we'd need to consider that.

A couple options:

  1. Provide a separate distribution like pandas-slim or something that excludes these files (and docs)
  2. Have pandas.test() fetch the source files on demand. That seems a bit messy though.

Just as a note: we do exclude the test data files that are present in the git repository. So we're only talking about source files.

vfilimonov

vfilimonov commented on Jan 6, 2020

@vfilimonov
ContributorAuthor

Hello @TomAugspurger

pandas-slim sounds like an good workaround.

It looks like docs are not a part of the distribution.
And right it's tests code only without the data files - in terms of size they are second to _libs and almost equal to the rest of the code.

8.0K	./arrays
 16K	./errors
 24K	./api
 40K	./__pycache__
 76K	./compat
 88K	./_config
248K	./tseries
308K	./util
440K	./plotting
2.3M	./io
6.7M	./core
 17M	./tests
 20M	./_libs

And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)?

TomAugspurger

TomAugspurger commented on Jan 6, 2020

@TomAugspurger
Contributor
stonecharioteer

stonecharioteer commented on Jan 7, 2020

@stonecharioteer

I particularly used to run the tests for numpy, scikitlearn and matplotlib after installing, since at times I'd have them fail on Windows. However this was quite some time ago, 4 years ago perhaps. Perhaps other users were doing the same?

jonringer

jonringer commented on Jan 9, 2020

@jonringer

I'm a package manager for nixpkgs, and I'm against removing tests from the sdist package, however, removing from the wheel would make sense from a packaging standpoint. It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

EDIT:
tests are a nice guarantee that the package is working as intended.

We could also checkout the github repo for tests. However, quickly looking at the setup.py, pandas was meant to have the CI set correct metadata such as version. So the version-controlled source can't be directly be used to package pandas.

joaoe

joaoe commented on Jan 26, 2020

@joaoe

Hi. i reported this elsewhere, so I'm pasting my comment here.

My use case
After a pip install pandas the lib/site-packages/pandas/tests/ includes a lot of testing code which is definitely not relevant for me and many other end users of pandas.
This bloats the installation and makes installation slower.
I'm working on packaging a python environment to distribute with a preinstalled set of modules and application and there are too many popular 3rd-party modules which include unneeded test code, like numpy, IPython, jupyterlab, etc, which needs to be striped to keep the package size down. I'll be reporting issues to these projects as well.

Suggestion
Therefore, my suggestion is to keep the pandas module streamlined, and move the tests out. Perhaps create a pandas-unittests module if people are interested in it, or just expect users to checkout the code. Another possibility would be to skip packaging the tests folder and conftest.py when creating packages to upload to pypi.org.

Regarding pandas-slim, everyone and their mothers have a dependency on pandas which would pull the whole code again with tests.

It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

That perfectly fine. The discussion is whether tests are bundled with the pandas module or not.

Since you are now almost releasing 1.0 it might be a bit short notice to include this is such a big release. But for the next major release, it could work.

Thank you very much for your attention.

added
BuildLibrary building on various platforms
Testingpandas testing functions or related to the test suite
on Jan 27, 2020
vfilimonov

vfilimonov commented on Jun 3, 2020

@vfilimonov
ContributorAuthor

A small remark: as a part of recent commit to pyarrow @wesm removed pyarrow.tests from the wheel which to my understanding contributed 2.3 MB of ~60 MB installed size.

In case of pandas tests folder contributed (as of version 1.0.3) tests folder contributed 17.9 MB out of 49 MB installed size.

So I'd like to bring the question back to the discussion and perhaps, @wesm could comment on that?

wesm

wesm commented on Jun 3, 2020

@wesm
Member

I think it would be a good idea to not ship the tests in wheels. If you want users to be able to run the tests against their production installs perhaps the tests can be packaged as a separate source wheel. Install size is becoming a problem because of size constraints in things like AWS Lambda.

TomAugspurger

TomAugspurger commented on Oct 30, 2020

@TomAugspurger
Contributor

https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html has some information.

I think I've come around to the idea that we can just not ship the test files in the main pandas distributions. We can have a separate pandas-tests so that pip install pandas-tests package that's just

  1. The test files
  2. A small __init__.py file ties things together

We could even update pandas.test() to check for the presence of the pandas-tests package.

28 remaining items

self-assigned this
on Apr 15, 2023
viccsjain

viccsjain commented on Sep 7, 2023

@viccsjain

Splitting the pandas library and tests would be really useful. We are using this library in our serverless deployment. and there is size restriction to upload the package into AWS lambda of 250 MB. Removing tests file will reduce the size of our package.

thesamesam

thesamesam commented on Sep 7, 2023

@thesamesam
Contributor

See also the discussion in #54907.

jbsilva

jbsilva commented on Sep 25, 2023

@jbsilva

Making docs and tests optional would be great.
In my cloud deployments I repackage it without the tests; 15 MB do make a difference for me.
I've seem many other packages including tests, but never that big.

dolfinus

dolfinus commented on May 30, 2024

@dolfinus

I was checking the size of one of my docker images, and found that tests are about 50% of the size of installed package:
изображение

Completely waste of space for me.

jonas-w

jonas-w commented on Aug 3, 2024

@jonas-w

According to https://pypistats.org/packages/pandas the package has 240 million downloads per month.

Now if the 32MB tests folder from the package would be removed, the package size would be halved. Currently the wheels are roughly 13MB large, so let's say the wheels would be 7MB after removing the tests, then pypi would save ~1.7 Petabyte of Bandwidth per Month, and could have saved roughly ~90 Petabytes of traffic since this issue was opened...

takluyver

takluyver commented on Sep 17, 2024

@takluyver
Contributor

I just noticed that on one of our filesystems, which is not set up for lots of small files, pandas' tests end up taking 300 MB of installed size.

Would it work as an initial step to make a script which splits the tests out of the wheels to be uploaded, and makes separate pandas-test wheels which can be uploaded separately? This is obviously not the most elegant way to do it, but I think I can see more or less how to make that work, whereas I'm not sure I can commit the time to figuring out how to rework pandas' build scripts and CI config to produce & use two separate wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

BuildLibrary building on various platformsTestingpandas testing functions or related to the test suite

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Participants

@dazza-codes@takluyver@wesm@joaoe@jbsilva

Issue actions

    Suggestion: remove tests from the distribution · Issue #30741 · pandas-dev/pandas