Skip to content

Contributing

Catherine Birney edited this page Nov 8, 2021 · 8 revisions

Contributing to FLOWSA

FLOWSA contributions are managed through a standard GitHub process, by forking this repository, making changes, and creating a pull request in the main repository for an administrator review.

Please use separate commits for different functional changes to allow for git cherry-pick.

Creating a Flow-By-Activity Dataset

  1. Write "instructions" for how to find the original data source being imported (i.e. a webpage). Write the instructions in a yaml file in the flowbyactivitymethods folder. See the README for configuration information of the yaml.

  2. Write any functions needed to help pull, parse, and format the data in a single script with the same SourceName as that used in the flowbyactivitymethods yaml. Any functions written in this script should be called on in the method yaml.

  3. Generate the Flow-By-Activity datset by running flowbyactivity.py. This file houses the script that calls on the information in the flowbyactivitymethods yaml (1) and the functions specific to a data source (2).

  4. Update the source catalog yaml with the information specific to the dataset.

FlowByActivity Naming Convention

Source dataset names are consistent across (1) the FlowByActivity dataset 'SourceName' columns, (2) the parquet file names, (3) the Crosswalk file names, and (4) the Source Catalog information. Source names are comprised of two or three components. The first part of the name is the agency that published the data. The second component is the name or acronym of the published dataset. The third piece of the naming schema, if it exists, is the topic of data parsed from the original dataset. Of the four FlowByActivity datasets imported from the U.S. Department of Agriculture (USDA), three are data pulled from the same dataset, the Census of Agriculture (CoA). To make data easier to find, the CoA data is separated by topic (Cropland, Livestock, Product Market Value). As the FlowByActivity datasets are grouped by topic, some of the parquets contain multiple class types, meaning the Class type should be specified when calling on the data. The USDA_CoA_Cropland dataframe includes acreage information for crops (Class = Land) and the number of farms that grow a particular crop (Class = Other).

Creating a Flow-By-Activity Crosswalk

A crosswalk linking a Flow-By-Activity's unique FlowNames to NAICS is required for each Flow-By-Activity.

  1. Generate a csv mapping each activity name in the Flow-By-Activity to NAICS 2012 codes and save in the activitytosectormapping folder. These mapping files are only necessary for datasets that are not already NAICS based. It must be specified if the activities are NAICS-like in the Source Catalog.
    • Each mapping file is created with a .py file in the Scripts folder.
    • The mapping files do not include ratios for how an activity is mapped to a NAICS code in the event there are more than one NAICS related to an activity. Instead, ratios are created through the Flow-By-Sector methodology.

Creating a Flow-By-Sector Dataset

  1. Write "instructions" for how to attribute environmental data in the Flow-By-Activity datasets to North American Industrial Classification (NAICS) Codes. Write the instructions in a yaml file in the flowbysectormethods folder. See the README for configuration information of the yaml.

    • Data in a Flow-By-Activity dataset are converted to NAICS based on the values in the 'FlowName' column. There are two options for identifying which 'FlowNames' to allocate in an activity set in the method yaml. The first method is to manually list out the FlowNames in the flowbysectormethods yaml. The second option is to create a csv with FlowNames and the activity set they belong to, saving the csv in the flowbysectoractivitysets folder. The scripts to generate the flowbysectoractivitysets are written in the Scripts folder.
  2. Add any functions required to help allocate a Flow-By-Activity dataset to NAICS in the same py file used to generate the Flow-By-Activity. These functions are optional, dependent on the data source.

  3. Generate the Flow-By-Sector dataset by running flowbyfunctions.py, the script that calls on the information and functions specified in the flowbysectormethods yaml.

Flow-By-Activity and Flow-By-Sector Naming Convention

See explanation here.

NAICS Crosswalk

Included in the package is a NAICS Crosswalk, which maps NAICS across years for 2002, 2007, 2012, and 2017. All Flow-By-Sector dataframes are mapped to NAICS 2012 Codes. The basis of the crosswalk comes from USEEIO's mapping, which includes mapping NAICS to BEA codes.

The NAICS crosswalk contains some 7-digit NAICS 2012 Codes, which are not official US Census Codes. These 7-digit codes are used temporarily to help link datasets when available data is only a component of a 6-digt NAICS (such as the USDA Irrigation and Water Management Survey and the USDA Census of Agriculture). The 7-digit codes are aggregated to official 6-digit codes.

If you create any of your own NAICS codes, the NAICS crosswalk must be recreated by running the function update_naics_crosswalk.

Clone this wiki locally