The Bills Classification dataset is a collection of Congressional Bills from the 101st to 118th Congress with their legislative subjects and policy areas, suitable and intended for use in training and evaluating (large language) classification models and other approaches to classifying Bill texts by subject and policy area.
The dataset is entirely in English.
The dataset is in the form of a dataframe where each bill contains its identifiers (congress, house-billType and bill number-billNumber), most recent version of the bill text, legislative subjects and policy area categories.
An example of the dataframe is as follows:
congress | billType | billNumber | legislativeSubjects | policyArea | billText |
---|---|---|---|---|---|
101 | hconres | 1 | [American economic assistance, American milita...] | International Affairs | <pre>Â \nB37 6-6-89 [OC's]\nHCON 1 IH\n101st C... |
101 | hconres | 10 | [Constitutional law, Meditation, Prayer in the...] | Civil Rights and Liberties, Minority Issues | <pre>Â \nB37 Rosey 1/4/89 [Updated]\nHCON 10 I.. |
101 | hconres | 100 | [Genocide, Human rights, International relief,...] | International Affairs | <pre>Â \nHCON 100 IH\n101st CONGRESS\n1st Sess... |
Note that the billText values in this example have been truncated. Bill texts contain roughly 7000 words on average.
Field | Description |
---|---|
congress | String or integer indicating the congress that the bill belongs to |
billType | String indicating the house that the bill belong to |
billNumber | Integer indicating the bill number |
legislativeSubjects | List of strings containing the legislative subjects assigned to the bill |
policyArea | String containing the policy area assigned to the bill |
billText | HTML string containing the most recent text of the bill |
We are providing multiple version of the bills dataset. These include the original version, with some bills being removed from the set according to set criteria, and versions that have had resampling techniques applied to mitigate the statistical bias (imbalance) present in the original data.
All files are gzipped and range from 0.02GB to 0.84GB in size.
This dataset as been created on behalf of the Library of Congress.
The source data comprises the Congressional Bills relevant files (subjects, texts, metadata) which are available through the Congress.gov API.
Data on individual bills from the 101st to 118th Congresses was gathered from the Congress.gov API. This initial range of source data for the datasets was selected as it could provide a consistent set of subjects along with plaintext for bills suitable for use in classification tasks.
We have made the decision to not perform any text related pre-processing on the bill text itself. This data has been retrieved by filtering the text versions by date and taking the most recent HTML file in its raw format. As such, the text contains HTML tags which may or may not be useful depending on the model the user decides to use. We have therefore left it up to the user of the dataset to determine what, if any, pre-processing they would like to apply to the bill texts.
As part of our initial exploratory data analysis, general and targeted statistical profiling was performed on the data (see notebooks here). It was discovered that the data was highly skewed with regards to legislative subjects, which include Geographic Entities and Organization Names, with the most common occurring term appearing over 30,000 times while the least common only appears once. Due to this discrepancy it is highly likely that any classification models trained on the original data will be biased towards the more frequently occurring terms.
To mitigate this bias caused by statistical imbalance, we have created multiple re-sampled versions of this data and reapplied some statistical profiling to evaluate the subsequent datasets, see notebook for visualization and analysis of data splits. To outline, mitigation methods that have been applied are as follows:
- Random Undersampling
- k-Nearest-Neighbors Undersampling
- NearMiss Undersampling
- Random Oversampling
- Hybrid: Random Undersampling followed by Random Oversampling.
Please see our configurations used for these techniques. The configurations include arguments specific to each approach which may be similar for some due to the nature of each algorithm.
The bills used are public law texts, therefore any identifiable information will pertain to public figures.
Not applicable.
Not applicable.
Not applicable.
Not applicable.
The datasets used in the experiment are public domain, and all data generated has no intellectual property or privacy restrictions.
Instructions for running any specific part of the code can be found in the relevant subdirectory.