Split legacy single-letter industry codes into industry_letter (#16)#18
Open
fgregg wants to merge 2 commits into
Open
Split legacy single-letter industry codes into industry_letter (#16)#18fgregg wants to merge 2 commits into
fgregg wants to merge 2 commits into
Conversation
Pre-2005 records carry a single-letter FMCS industry code in the `industry` field instead of a NAICS code or spelled-out sector. Route those single-letter values into a new `industry_letter` column so `industry` holds only the modern spelled-out names. ~180k rows (1996-2004) are affected; letters and NAICS never co-occur, so no data is lost.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Splits the legacy pre-2005 single-letter FMCS industry codes out of the
industrycolumn into a newindustry_lettercolumn.Why
Before ~2005 the
industryfield carried a single-letter code (E,J,C,L, …) instead of a NAICS code or a spelled-out sector name (see #16 §1), soindustrymixed two incompatible taxonomies. After this change:industry_letter— the legacy single-letter code (1996–2004 records only; ~180k rows)industry— only the modern spelled-out sector names (2005+)naics— unchangedMechanical change in
scripts/to_csv.py: anyindustryvalue that is a single A–Z letter is moved toindustry_letterand blanked fromindustry. Non-letter values (Not Provided,Other…, spelled-out names) are untouched. Letters and NAICS never co-occur on a record, so no information is lost.