This repository is organized in the following manner:
TREND-Junior-Data-Engineer-Technical-Exercise-VB/
|--- .sql_files/
|--- scripts/
|--- visualizations/
Here is what is contained in each subdirectory:
.sql_files: Containsschema.sql, where the sqlite table is generated fromscripts: Containsanalysis.py, where the data will be downloaded and analyzedvisualizations: Contains ER diagram as well as text files and a plot to answer the questions inanalysis.py
I ended up choosing the 311-Service-Requests-from-2010-to-Present dataset to download and analyze to get a sense of how complaints to public services in NYC, one of the more densely populated cities in the United States, are reported and handled. I have selected a subset of the columns available both for database normalization and relevance to any business questions I believe I could answer.
Issues:
- Socrata API Foundry's directed Python/Pandas code excerpt here times out my requests almost instantly, so my download script does not use an API key
- I have run into what I believe are harsh API rate limits with my approach, even when setting a limit of 500 rows to read from the larger dataset, making each run take longer and increasing the difficulty of debugging
- Clone the repo to your local machine
cdto theC:\TREND-Junior-Data-Engineer-Technical-Exercise-VB\scriptssubdirectory- Run this command in a terminal:
python analysis.py -cli_limit <cli_limit> -cli_chunk <cli_chunk>- Replace
<cli_limit>with the number of rows you want to query from the NYC 311 dataset - Replace
<cli_chunk>with the size of each chunk you want to process - Example usage:
python analysis.py -cli_limit 50 -cli_chunk 10(runs the pipeline on 50 rows from the dataset in sizes of 10)
- Replace
Notes:
- Please try to define your
cli_chunkas a denomination of the value you have chosen forcli_limit:- For example, if you choose
cli_limitas1000, choose a value like100,500, etc. forcli_limit
- For example, if you choose
- There seems to be something awry with the
resolution_time_in_dayscalculation. I am assuming it is the SQLitejuliandayfunction I used to try and get the number of days between dates. Given more time, I would definitely fix this.