Package to easily import datasets from the UC Irvine Machine Learning Repository into scripts and notebooks.
Current Version: 0.0.7
In a Jupyter notebook, install with the command
!pip3 install -U ucimlrepo
Restart the kernel and import the module ucimlrepo.
from ucimlrepo import fetch_ucirepo, list_available_datasets
# check which datasets can be imported
list_available_datasets()
# import dataset
heart_disease = fetch_ucirepo(id=45)
# alternatively: fetch_ucirepo(name='Heart Disease')
# access data
X = heart_disease.data.features
y = heart_disease.data.targets
# train model e.g. sklearn.linear_model.LinearRegression().fit(X, y)
# access metadata
print(heart_disease.metadata.uci_id)
print(heart_disease.metadata.num_instances)
print(heart_disease.metadata.additional_info.summary)
# access variable info in tabular format
print(heart_disease.variables)
Loads a dataset from the UCI ML Repository, including the dataframes and metadata information.
Provide either a dataset ID or name as keyword (named) arguments. Cannot accept both.
id: Dataset ID for UCI ML Repositoryname: Dataset name, or substring of name
datasetdata: Contains dataset matrices as pandas dataframesids: Dataframe of ID columnsfeatures: Dataframe of feature columnstargets: Dataframe of target columnsoriginal: Dataframe consisting of all IDs, features, and targetsheaders: List of all variable names/headers
metadata: Contains metadata information about the dataset- See Metadata section below for details
variables: Contains variable details presented in a tabular/dataframe formatname: Variable namerole: Whether the variable is an ID, feature, or targettype: Data type e.g. categorical, integer, continuousdemographic: Indicates whether the variable represents demographic datadescription: Short description of variableunits: variable units for non-categorical datamissing_values: Whether there are missing values in the variable's column
Prints a list of datasets that can be imported via fetch_ucirepo
filter: Optional keyword argument to filter available datasets based on a category- Valid filters:
aim-ahead
- Valid filters:
search: Optional keyword argument to search datasets whose name contains the search query
none
uci_id: Unique dataset identifier for UCI repositorynameabstract: Short description of datasetarea: Subject area e.g. life science, businesstask: Associated machine learning tasks e.g. classification, regressioncharacteristics: Dataset types e.g. multivariate, sequentialnum_instances: Number of rows or samplesnum_features: Number of feature columnsfeature_types: Data types of featurestarget_col: Name of target column(s)index_col: Name of index column(s)has_missing_values: Whether the dataset contains missing valuesmissing_values_symbol: Indicates what symbol represents the missing entries (if the dataset has missing values)year_of_dataset_creationdataset_doi: DOI registered for dataset that links to UCI repo dataset pagecreators: List of dataset creator namesintro_paper: Information about dataset's published introductory paperrepository_url: Link to dataset webpage on the UCI repositorydata_url: Link to raw data fileadditional_info: Descriptive free text about datasetsummary: General summarypurpose: For what purpose was the dataset created?funding: Who funded the creation of the dataset?instances_represent: What do the instances in this dataset represent?recommended_data_splits: Are there recommended data splits?sensitive_data: Does the dataset contain data that might be considered sensitive in any way?preprocessing_description: Was there any data preprocessing performed?variable_info: Additional free text description for variablescitation: Citation Requests/Acknowledgements
external_url: URL to external dataset page. This field will only exist for linked datasets i.e. not hosted by UCI