Skip to content

Latest commit

 

History

History
507 lines (398 loc) · 28 KB

data.md

File metadata and controls

507 lines (398 loc) · 28 KB

Crazy Awesome Python

A selection of 52 curated data Python libraries and frameworks ordered by stars.

Checkout the interactive version that you can filter and sort: https://www.awesomepython.org/

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
https://github.com/scrapy/scrapy
68 stars per week over 621 weeks
42,573 stars, 9,493 forks, 1,809 watches
created 2010-02-22, last commit 2022-01-21, main language Python
crawler, crawling, framework, hacktoberfest, python, scraping

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
https://github.com/pandas-dev/pandas
54 stars per week over 595 weeks
32,435 stars, 13,807 forks, 1,106 watches
created 2010-08-24, last commit 2022-01-23, main language Python
alignment, data-analysis, flexible, pandas, python

A Powerful Spider(Web Crawler) System in Python.
http://docs.pyspider.org/
https://github.com/binux/pyspider
36 stars per week over 413 weeks
15,281 stars, 3,628 forks, 903 watches
created 2014-02-21, last commit 2020-08-02, main language Python
crawler, python

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
https://www.jaided.ai/easyocr
https://github.com/JaidedAI/EasyOCR
140 stars per week over 97 weeks
13,626 stars, 1,815 forks, 268 watches
created 2020-03-14, last commit 2022-01-14, main language Python
cnn, crnn, data-mining, deep-learning, easyocr, image-processing, information-retrieval, lstm, machine-learning, ocr, optical-character-recognition, python, pytorch, scene-text, scene-text-recognition

Faker is a Python package that generates fake data for you.
http://faker.rtfd.org
https://github.com/joke2k/faker
28 stars per week over 479 weeks
13,577 stars, 1,554 forks, 219 watches
created 2012-11-12, last commit 2022-01-13, main language Python
dataset, fake, fake-data, python, test-data, test-data-generator, testing

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
https://github.com/twintproject/twint
51 stars per week over 241 weeks
12,336 stars, 2,013 forks, 294 watches
created 2017-06-10, last commit 2021-03-02, main language Python
elasticsearch, kibana, osint, python, scrape, scrape-followers, scrape-following, scrape-likes, tweep, tweets, twint, twitter

🦉Data Version Control | Git for Data & Models | ML Experiments Management
https://dvc.org
https://github.com/iterative/dvc
35 stars per week over 255 weeks
9,151 stars, 890 forks, 132 watches
created 2017-03-04, last commit 2022-01-22, main language Python
ai, collaboration, data-science, data-version-control, developer-tools, git, hacktoberfest, machine-learning, python, reproducibility

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
https://github.com/apache/arrow
28 stars per week over 309 weeks
8,976 stars, 2,184 forks, 333 watches
created 2016-02-17, last commit 2022-01-22, main language C++
arrow

Create HTML profiling reports from pandas DataFrame objects
https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/
https://github.com/pandas-profiling/pandas-profiling
26 stars per week over 315 weeks
8,421 stars, 1,219 forks, 146 watches
created 2016-01-09, last commit 2022-01-08, main language Jupyter Notebook
big-data-analytics, data-analysis, data-exploration, data-profiling, data-quality, data-science, deep-learning, eda, exploration, exploratory-data-analysis, hacktoberfest, html-report, jupyter, jupyter-notebook, machine-learning, pandas, pandas-dataframe, pandas-profiling, python, statistics

Incredibly fast crawler designed for OSINT.
https://github.com/s0md3v/Photon
42 stars per week over 199 weeks
8,402 stars, 1,242 forks, 318 watches
created 2018-03-30, last commit 2019-12-06, main language Python
crawler, information-gathering, osint, python, spider

SQL databases in Python, designed for simplicity, compatibility, and robustness.
https://sqlmodel.tiangolo.com/
https://github.com/tiangolo/sqlmodel
307 stars per week over 21 weeks
6,667 stars, 256 forks, 102 watches
created 2021-08-24, last commit 2022-01-08, main language Python
fastapi, json, json-schema, pydantic, python, sql, sqlalchemy

An open source multi-tool for exploring and publishing data
https://datasette.io
https://github.com/simonw/datasette
25 stars per week over 221 weeks
5,735 stars, 389 forks, 95 watches
created 2017-10-23, last commit 2022-01-21, main language Python
asgi, automatic-api, csv, datasets, datasette, datasette-io, docker, json, python, sql, sqlite

(JMLR' 19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)
http://pyod.readthedocs.io
https://github.com/yzhao062/pyod
23 stars per week over 224 weeks
5,198 stars, 1,025 forks, 151 watches
created 2017-10-03, last commit 2022-01-04, main language Python
anomaly, anomaly-detection, autoencoder, data-analysis, data-mining, data-science, deep-learning, fraud-detection, machine-learning, neural-networks, outlier-detection, outlier-ensembles, outliers, python, python2, python3, unsupervised-learning

Extract Keywords from sentence or Replace keywords in sentences.
https://github.com/vi3k6i5/flashtext
21 stars per week over 231 weeks
5,040 stars, 578 forks, 141 watches
created 2017-08-15, last commit 2020-05-03, main language Python
data-extraction, keyword-extraction, nlp, search-in-text, word2vec

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
https://github.com/airbnb/knowledge-repo
17 stars per week over 283 weeks
4,991 stars, 677 forks, 186 watches
created 2016-08-17, last commit 2021-09-01, main language Python
data, data-analysis, data-science, knowledge

The Database Toolkit for Python
https://www.sqlalchemy.org
https://github.com/sqlalchemy/sqlalchemy
29 stars per week over 164 weeks
4,805 stars, 777 forks, 82 watches
created 2018-11-27, last commit 2022-01-22, main language Python
python, sql, sqlalchemy

Official Kaggle API
https://github.com/Kaggle/kaggle-api
21 stars per week over 208 weeks
4,534 stars, 888 forks, 180 watches
created 2018-01-25, last commit 2021-03-15, main language Python

A data augmentations library for audio, image, text, and video.
https://ai.facebook.com/blog/augly-a-new-data-augmentation-library-to-help-build-more-robust-ai-models/
https://github.com/facebookresearch/AugLy
130 stars per week over 32 weeks
4,241 stars, 220 forks, 60 watches
created 2021-06-09, last commit 2022-01-20, main language Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
https://github.com/alirezamika/autoscraper
57 stars per week over 72 weeks
4,193 stars, 441 forks, 118 watches
created 2020-08-31, last commit 2021-02-03, main language Python
ai, artificial-intelligence, automation, crawler, machine-learning, python, scrape, scraper, scraping, web-scraping, webautomation, webscraping

Mimesis is a high-performance fake data generator for Python, which provides data for a variety of purposes in a variety of languages.
https://mimesis.name
https://github.com/lk-geimfari/mimesis
12 stars per week over 280 weeks
3,458 stars, 282 forks, 68 watches
created 2016-09-09, last commit 2022-01-22, main language Python
api-mock, data, datascience, dummy, fake, faker, fixtures, generator, json, json-generator, mimesis, mock, python, schema, synthetic-data, testing

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
https://www.amundsen.io/amundsen/
https://github.com/amundsen-io/amundsen
21 stars per week over 140 weeks
2,981 stars, 732 forks, 244 watches
created 2019-05-14, last commit 2022-01-22, main language Python
amundsen, data-catalog, data-discovery, linuxfoundation, metadata

Python tools for geographic data
http://geopandas.readthedocs.io/
https://github.com/geopandas/geopandas
6.62 stars per week over 447 weeks
2,963 stars, 668 forks, 108 watches
created 2013-06-27, last commit 2022-01-20, main language Python

A Python module for creating Excel XLSX files.
https://xlsxwriter.readthedocs.io
https://github.com/jmcnamara/XlsxWriter
5.94 stars per week over 472 weeks
2,804 stars, 558 forks, 119 watches
created 2013-01-04, last commit 2022-01-22, main language Python
charts, libxlsxwriter, pandas, python, spreadsheet, xlsx, xlsx-files, xlsxwriter

A non-validating SQL parser module for Python
https://github.com/andialbrecht/sqlparse
5.3 stars per week over 509 weeks
2,701 stars, 548 forks, 92 watches
created 2012-04-18, last commit 2021-09-10, main language Python

PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.
http://praw.readthedocs.io/
https://github.com/praw-dev/praw
4.53 stars per week over 596 weeks
2,699 stars, 427 forks, 67 watches
created 2010-08-19, last commit 2022-01-12, main language Python
api, oauth, praw, python, reddit, reddit-api

Manipulation and analysis of geometric objects
https://shapely.readthedocs.io/en/latest/
https://github.com/Toblerity/Shapely
4.97 stars per week over 525 weeks
2,609 stars, 445 forks, 88 watches
created 2011-12-31, last commit 2022-01-13, main language Python

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-data-wrangler.readthedocs.io
https://github.com/awslabs/aws-data-wrangler
16 stars per week over 151 weeks
2,469 stars, 412 forks, 60 watches
created 2019-02-26, last commit 2022-01-21, main language Python
amazon-athena, amazon-sagemaker-notebook, apache-arrow, apache-parquet, athena, aws, aws-glue, aws-lambda, data-engineering, data-science, emr, etl, glue-catalog, lambda, mysql, pandas, python, redshift

A Pythonic wrapper for the Wikipedia API
https://wikipedia.readthedocs.org/
https://github.com/goldsmith/Wikipedia
5.44 stars per week over 439 weeks
2,392 stars, 482 forks, 81 watches
created 2013-08-20, last commit 2020-10-09, main language Python

xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa. It works with Microsoft Excel on Windows and macOS.
https://www.xlwings.org
https://github.com/ZoomerAnalytics/xlwings
5.38 stars per week over 409 weeks
2,207 stars, 396 forks, 120 watches
created 2014-03-17, last commit 2021-12-21, main language Python
automation, excel, python, reporting

A pythonic interface to Amazon's DynamoDB
http://pynamodb.readthedocs.io
https://github.com/pynamodb/PynamoDB
4.34 stars per week over 417 weeks
1,815 stars, 388 forks, 41 watches
created 2014-01-20, last commit 2022-01-10, main language Python
aws, dynamodb, python

PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
https://github.com/graphistry/pygraphistry
4.42 stars per week over 346 weeks
1,532 stars, 150 forks, 43 watches
created 2015-06-02, last commit 2021-12-22, main language Python
analytics, blazingsql, csv, cuda, cudf, cugraph, dashboards, gpu, graph, graphistry, neo4j, networkx, notebooks, pandas, python, rapids, splunk, tigergraph, visualization, webgl

The Orator ORM provides a simple yet beautiful ActiveRecord implementation.
https://orator-orm.com
https://github.com/sdispater/orator
3.76 stars per week over 348 weeks
1,307 stars, 158 forks, 48 watches
created 2015-05-24, last commit 2020-01-06, main language Python
database, orm, python

A very simple Salesforce.com REST API client for Python
https://github.com/simple-salesforce/simple-salesforce
2.61 stars per week over 470 weeks
1,228 stars, 556 forks, 88 watches
created 2013-01-17, last commit 2021-09-09, main language Python
api, api-client, python, salesforce

Official PyTorch repo for JoJoGAN: One Shot Face Stylization
https://github.com/mchong6/JoJoGAN
157 stars per week over 5 weeks
831 stars, 114 forks, 15 watches
created 2021-12-17, last commit 2022-01-21, main language Jupyter Notebook
anime, gans, image-translation

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
https://github.com/pyjanitor-devs/pyjanitor
4.0 stars per week over 203 weeks
812 stars, 136 forks, 23 watches
created 2018-03-04, last commit 2022-01-23, main language Python
cleaning-data, data, data-engineering, dataframe, hacktoberfest, pandas, pydata

python async orm with fastapi in mind and pydantic validation
https://collerek.github.io/ormar/
https://github.com/collerek/ormar
10 stars per week over 77 weeks
789 stars, 38 forks, 13 watches
created 2020-08-02, last commit 2022-01-17, main language Python
alembic, async-orm, databases, fastapi, orm, pydantic, python-orm, sqlalchemy

bamboolib - a GUI for pandas DataFrames
https://bamboolib.com
https://github.com/tkrabel/bamboolib
5.56 stars per week over 138 weeks
771 stars, 81 forks, 30 watches
created 2019-05-29, last commit 2021-12-21, main language Jupyter Notebook
jupyter-notebook, jupyterlab, pandas, pandas-dataframes, python

🐳 The stupidly simple CLI workspace for your data warehouse.
https://docs.whale.cx
https://github.com/hyperqueryhq/whale
7.74 stars per week over 86 weeks
670 stars, 36 forks, 36 watches
created 2020-05-27, last commit 2022-01-04, main language Python
data-catalog, data-discovery, data-documentation

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
https://github.com/simonw/sqlite-utils
3.36 stars per week over 184 weeks
619 stars, 54 forks, 15 watches
created 2018-07-14, last commit 2022-01-19, main language Python
cli, click, datasette, datasette-io, datasette-tool, python, sqlite, sqlite-database

TorchGeo: datasets, transforms, and models for geospatial data
https://github.com/microsoft/torchgeo
15 stars per week over 35 weeks
555 stars, 51 forks, 24 watches
created 2021-05-21, last commit 2022-01-19, main language Python
datasets, deep-learning, models, pytorch, remote-sensing, torchvision, transforms

Fastest library to load data from DB to DataFrames in Rust and Python
https://github.com/sfu-db/connector-x
10 stars per week over 53 weeks
537 stars, 32 forks, 18 watches
created 2021-01-13, last commit 2022-01-21, main language Rust
database, dataframe, python, rust, sql

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
https://github.com/scikit-hep/awkward-1.0
3.74 stars per week over 127 weeks
477 stars, 46 forks, 15 watches
created 2019-08-14, last commit 2022-01-21, main language Python
apache-arrow, cern-root, columnar-format, data-analysis, jagged-array, json, numba, numpy, pandas, python, ragged-array, scikit-hep

Library for creating dataframes from functions.
https://github.com/stitchfix/hamilton
3.64 stars per week over 86 weeks
316 stars, 9 forks, 12 watches
created 2020-05-26, last commit 2021-12-23, main language Python

Uses tokenized query returned by python-sqlparse and generates query metadata
https://pypi.python.org/pypi/sql-metadata
https://github.com/macbre/sql-metadata
1.08 stars per week over 241 weeks
260 stars, 45 forks, 12 watches
created 2017-06-06, last commit 2022-01-22, main language Python
database, hive, hiveql, metadata, mysql-query, parser, python-package, python3-library, sql, sql-parser, sqlparse

Minimal class to download shared files from Google Drive.
https://github.com/ndrplz/google-drive-downloader
1.06 stars per week over 215 weeks
228 stars, 48 forks, 11 watches
created 2017-12-08, last commit 2019-02-09, main language Python

A toolkit providing a uniform interface for connecting to and extracting data from a wide variety of (potentially remote) data stores (including HDFS, Hive, Presto, MySQL, etc).
https://github.com/airbnb/omniduct
0.87 stars per week over 256 weeks
222 stars, 48 forks, 30 watches
created 2017-02-22, last commit 2021-10-27, main language Python

A Python implementation of Amazon Ion.
http://amzn.github.io/ion-docs/
https://github.com/amzn/ion-python
0.67 stars per week over 302 weeks
203 stars, 46 forks, 21 watches
created 2016-04-07, last commit 2021-12-10, main language Python

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
https://microsoft.github.io/genalog/
https://github.com/microsoft/genalog
2.23 stars per week over 83 weeks
187 stars, 18 forks, 10 watches
created 2020-06-15, last commit 2021-08-18, main language Jupyter Notebook
data-generation, data-science, machine-learning, ner, ocr-recognition, python, synthetic-data, synthetic-data-generation, synthetic-images, text-alignment

A package to structure Australian addresses
https://github.com/jasonrig/address-net
0.89 stars per week over 163 weeks
145 stars, 65 forks, 10 watches
created 2018-12-05, last commit 2020-09-09, main language Python
address-parser, deep-learning, machine-learning, rnn

Apache Beam pipelines to make weather data accessible and useful.
https://weather-tools.readthedocs.io/
https://github.com/google/weather-tools
5.31 stars per week over 8 weeks
47 stars, 8 forks, 6 watches
created 2021-11-22, last commit 2022-01-19, main language Python
apache-beam, python, weather

A package for getting cloud products and product descriptions from a cloud provider website.
https://pypi.org/project/cloud-products/
https://github.com/dylanhogg/cloud-products
0.01 stars per week over 77 weeks
1 stars, 0 forks, 1 watches
created 2020-08-01, last commit 2021-09-06, main language Python
aws, cloud-products, crawler, data, text-processing

Provides access to Australian legal data
https://github.com/dylanhogg/legaldata
0.0 stars per week over 66 weeks
0 stars, 0 forks, 1 watches
created 2020-10-12, last commit 2020-11-03, main language Python
crawler, data, law, lawtech, legal, legaltech

This file was automatically generated on 2022-01-23.

To curate your own github list, simply clone and change the input csv file.

Inspired by:
https://github.com/vinta/awesome-python
https://github.com/trananhkma/fucking-awesome-python