GitHub - muchi674/fetch_interview

How to Run

example commands:
docker pull fetchdocker/data-takehome-postgres
docker pull fetchdocker/data-takehome-postgres
docker compose up -d
pip3 install -r requirements.txt
python3 login_etl.py
docker compose down

Questions

How would you deploy this application in production?
Assuming that Fetch's postgreSQL database in hosted on aws RDS, I would deploy this application in production by the following steps:

Create an IAM role to define the aws services and resources this application is allowed to access on Fetch's aws account
launch an EC2 instance with the IAM role (wrapped as profile) created in step 1 using launch template
allocate an elastic ip on aws VPC to the EC2 instance launched in step 2
create a security group on aws VPC and set an inbound rule that whitelists inbound traffic from elastic ip allocated in step 3
add the security group created in step 4 to the aws RDS instance that hosts the postgreSQL database that this application interacts with
containerize this application with a Dockerfile that defines the runtime environment and a Compose file that defines services that make up this application
merge production ready code to master
clone git repo on the EC2 instance launched in step 2, cd into project directory, and run docker compose up -d

What other components would you want to add to make this production ready?

Right now, inputs to this application is hardcoded. However, in production, said inputs should either be passed in through command line or through HTTP request if this application is made into a small Flask web application launched with, for instance, uWSGI
a robust data cleaning component to sanitize raw login data
clearly defined exception handling mechanism
informative logging mechanism. Where to put the logs? When to send the logs?
It is quite unsecure to pass database user passwords around. aws RDS db authentication can also be done through IAM.
an independent database role (user) should also be created for the application for better access control

How can this application scale with a growing dataset?

since aws SQS is distributed and that the etl process on each login data is independent of each other, this application/algorithm can also be distributed and parallel. This can be done through Spark and aws EMR.

How can PII be recovered later on?

I masked the PII by simpling rotating the string by the half of its length. Therefore, PII can be simply recovered by reversing the said rotation.

What are the assumptions you made?

Aside from removing the dots from the app_version values, I assumed that raw login data all have the correct data type.
create_date is relative to the hardware running the application's locale

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
docker-compose.yml		docker-compose.yml
login_etl.py		login_etl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Run

Questions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How to Run

Questions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages