Table of Contents
- About The Project
- Built With Python and PostgresSQL Framework
- Contributing (UC Berkeley Bootcamp Students Only)
- Contact
- Acknowledgments
The instructions for this mini project are divided into the following subsections:
- Create the Category and Subcategory DataFrames
- Create the Campaign DataFrame
- Create the Contacts DataFrame
- Create the Crowdfunding Database
- Extract and transform the
crowdfunding.xlsx
Excel data to create a category DataFrame that has the following columns:
-
A
category_id
column that has entries going sequentially fromcat1
tocatn
, where n is the number of unique categories -
A
category
column that contains only the category titles
-
Export the category DataFrame as
category.csv
. -
Extract and transform the
crowdfunding.xlsx
Excel data to create a subcategory DataFrame that has the following columns:
-
A
subcategory_id
column that has entries going sequentially fromsubcat1
tosubcatn
, where n is the number of unique subcategories -
A "subcategory" column that contains only the subcategory titles
-
The following image shows this subcategory DataFrame:
- Export the subcategory DataFrame as
subcategory.csv
.
We're tasked with extracting specific columns from a crowdfunding dataset, cleaning and transforming them, and creating a new DataFrame called campaign_df
. This DataFrame will serve as the foundation for further analysis.
-
Import pandas: Imports the Pandas library for data manipulation.
-
Load Excel data: Reads the
crowdfunding.xlsx
file into a Pandas DataFrame. -
Create
campaign_df
DataFrame: Creates a new DataFrame with selected columns and necessary transformations.
cf_id
,contact_id
,company_name
,outcome
,backers_count
,country
, andcurrency
columns are directly assigned.blurb
is renamed todescription
.goal
andpledged
are converted to float data types.launched_at
anddeadline
are converted todatetime
format with UTC conversion.category_id
andsubcategory_id
are assigned.
- Data Cleaning: Ensure the data in the Excel file is clean and consistent before creating the DataFrame.
- Timezones: If the timestamps in
launched_at
anddeadline
are in a different timezone, adjust thepd.to_datetime
function accordingly. - Performance: For large datasets, consider using optimized methods for data conversion and creation.
- Mapping
category_id
andsubcategory_id
: These columns will be created based on the category and subcategory DataFrames created earlier. This will involve using mapping or merging operations.
By following these steps, this should have a well-structured campaign DataFrame ready for further analysis.
Choose one of the following two options for extracting and transforming the data from the contacts.xlsx
Excel data:
-
Option 1: Use Python dictionary methods.
-
Option 2: Use
regex
regular expressions.
Option 1, using Python dictionary methods, would be suitable for scenarios where:
- Data is irregular or unstructured: If the data within the
contact_info
column is inconsistent or has varying formats, dictionary-based parsing might offer more flexibility. - Detailed control over data extraction: You require granular control over the extraction process and have a clear understanding of the data structure.
- Performance is not critical: For smaller datasets or when performance is not a primary concern, dictionary-based processing can be a viable approach.
Option 2, using Regular Expressions regex
methods, would be suitable for scenarios where:
- Efficiency: Regular expressions are generally more efficient for extracting specific patterns from text data compared to iterating over dictionaries.
- Directness: Directly extracting the desired columns using regular expressions is more concise.
- Inspect the four CSV files, and then sketch an ERD of the tables by using
QuickDBDLinks
site.
-
Use the information from the ERD to create a table schema for each CSV file.
-
Save the database schema as a Postgres file named
crowdfunding_db_schema.sql
. -
Create a new Postgres database, named
crowdfunding_db
.
- Using the database schema, create the tables in the correct order to handle the foreign keys.
-
Verify the table creation by running a
SELECT
statement for each table. -
Import each CSV file into its corresponding SQL table.
- Verify that each table has the correct data by running a
SELECT
statement for each.
ETL is a data integration process that involves extracting data from various sources, transforming it into a desired format, and loading it into a target system for further analysis or processing. It's a critical component of data warehousing and business intelligence solutions.
- Extraction
- Data Sources: This phase involves identifying and connecting to data sources, which can include databases (SQL, NoSQL), flat files (CSV, JSON, XML), APIs, or other systems.
- Data Retrieval: The data is extracted from these sources and fetched into the ETL process.
- Transformation
- Data Cleaning: This phase involves handling missing values, inconsistencies, and errors in the extracted data.
- Data Standardization: Data is formatted consistently across different sources.
- Data Enrichment: Additional data might be added or derived from existing data.
- Data Aggregation: Data is summarized or aggregated as required.
- Data Validation: Data integrity is checked to ensure accuracy.
- Loading
-
Target System: The transformed data is loaded into a target system, such as a data warehouse, data mart, or operational database.
-
Data Storage: Data is stored in a structured format for efficient querying and analysis.
(UC Berkeley Bootcamp Students Only)
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!
-
Fork the Project
-
Create your Feature Branch (
git checkout -b new-branch-name
) -
Commit your Changes (
git commit -m 'Add some message'
) -
Push to the Branch (
git push origin new-branch-name
) -
Create a pull request.
Forking a repository and creating a pull request on GitHub is a great way to contribute to open-source projects. Here's a breakdown of the process:
- Forking the Repository:
Find the repository you want to contribute to on GitHub. Click on the "Fork" button in the top right corner. This creates a copy of the repository in your own account.
- Clone the Forked Repository to Your Local Machine
You'll need Git installed on your system. Use Git commands to clone your forked repository to your local machine. There will be instructions on the GitHub repository page for cloning.
- Making Changes (Local Work):
Make your changes to the code in your local copy. Use Git commands to track your changes (adding, committing).
- Pushing Changes to Your Fork:
Once you're happy with your changes, use Git commands to push your local commits to your forked repository on GitHub.
- Creating a Pull Request:
Go to your forked repository on GitHub. Click the "Compare & pull request" button (might appear as a yellow banner). Here, you'll see a comparison between your changes and the original repository. Write a clear title and description for your pull request explaining the changes you made. Click "Create Pull Request" to submit it for review.
Distributed under GNU General Public License. See LICENSE.txt
for more information.
Thay Chansy - @thaychansy - or [email protected]
Please visit my Portfolio Page: thaychansy.github.io (https://thaychansy.github.io/)
Project Link: thaychansy/etl-project (Extract, Transform, and Load)
Here's a list of resources we found helpful and would like to give credit to.
- [Chat GPT] ChatGPT
- [Google Gemini] Gemini Generative AI
- [Stack Overflow] Error Importing CSV File Into PostgreSQL