Docstring generated documentation for the modules and functions which were developed in this project can be seen at: Project's Github Pages.
Extraction is the first step in an etl pipeline, it involves ingesting it from a database or data source. In this project the data can be read from a local file or from a given http url. It must be noted however, that this data must be in a csv format.
The first step in the program's execution is to prompt the user for the order filters - which are read from the terminal by the function parse_user_input.
Type origin filter:
0 -> No filter
1 -> Online
2 -> PhysicalType status filter:
0 -> No filter
1 -> Pending
2 -> CompletedThe provided filters are stored in a record called user_input and then used in a latter step to eliminate orders which don't equal the desired fields.
The main functions in this step are:
- http_get_string - Downloads a string from the specified url
- read_file - Reads a local csv file from the specified path
After the strings containing the data in the csv format are loaded, the next step is to store these strings in the corresponding records. In this case, the two record types are:
Item Record:
type item = {
order_id : int;
quantity : int;
price : float;
tax : float;
}Order Record:
type order = {
id : int;
client_id : int;
order_date : string;
status : status;
origin : origin;
}The main functions associated with parsing the csv data:
- parse_items - Parses a csv string of items into a list of item records
- parse_orders - Parses a csv string of orders into a list of order records
In possession of a list of item and order records, the list of orders is then filtered using the aforementioned user_input and the filter_orders function - which returns a list of order records that satisfy the user_input's filters.
Transformation is the portion of the ETL which transforms the ingested data and produces the output which will be stored "Loaded" in the target system - which is a database in this case.
Before the transform portion of the ETL, the last step of preparation is to join the order and item lists of records. These tables are inner joined by the function inner_join this operation produces a new type of record called int_result, it contains the original item information and its corresponding order information:
int_result:
type int_result = {
order_id : int;
client_id : int;
order_date : string;
status : status;
origin : origin;
quantity : int;
price : float;
tax : float;
}each int_result object is the result of an item field plus its corresponding order client_id, order_date, status, origin fields - if such order exists. If a certain item has no corresponding order it is dropped.
These int_result records are then groupped by their order_id field with the function group_by, which results in an IntMap:
{int: int_result list}The order results are then calculated. Each result is a record that contains a total_price field that is calculated by summing the price of all of its items, and a total_tax field which is the total_tax payed in that order - which is calculated by summing up all of the item prices times their tax rate.
These result fields are obtained with Fold Left operations which end up reversing the final record list, so a List.rev is needed to return them to their original order.
This can all be seen in the code for the get_results function.
type result = {
order_id : int;
price : float;
tax : float;
}By the end of this operation we are left with a list of result objects.
The same aforementioned int_result (mentioned in step 2.2) are also groupped by their year_month field with the function group_by_ym, this operation produces a StringMap:
{string: int_result list}The same logic described in 2.3 is applied to the the year_month items in order to obtain ym_results - which contain the same fields as the results records but have a string identifier called date. It must be noted that instead of calculating the total order price and tax as is done in 2.3, this step calculates the average price and average tax for all of the items in orders which are grouped by year-month.
type ym_result = {
date : string;
avg_price : float;
avg_tax : float;
}The function which performs the avg_price and avg_tax calculations for the year_month items is called get_ym_results.
In principle, the loading step of an ETL requires loading the processed data into a target system. In this project, the target is an Sqlite3 database. In this project, there are two lists of records that must be written into the output DB, they are the results list and the ym_results list.
The .db file is created by the create_db function, which creates the file itself and calls two other functions create_table and create_ym_table in order to create a table to store the normal results and the ym_results. This function returns a db handle that is used in writing to the created tables.
The db handle returned by the previous step is then received as an argument by the write_to_sqlite alongside the results and ym_results and so write_to_sqlite iterates over both lists of results and calls their respective insert functions (insert_result and insert_ym_result) to write each iterated record to a table.
Finally, after writing all of the result and ym_result records, the db is closed and the program execution finishes.
To build the project and run it yourself, one can download ocaml and opam as such:
- installing opam:
sudo apt update
sudo apt install -y opam- installing ocaml:
opam switch create 5.2.0
eval $(opam env)- installing the project's dependencies:
opam install . --deps-onlyBuilding the project:
dune buildRunning the project:
dune exec bin/main.exeA pure function always produces the same output for a given input and does not read or modify any external state. This means it is both deterministic and free of side effects.
Unit tests are essential for ensuring that business logic behaves as expected. In this project, OCaml's OUnit2 testing library was used to cover all pure functions. These functions are primarily found in the Parsers and Transform modules. A few test cases were initially written for selected functions, and Claude was used to generalize them to cover all functions.
To run said tests, one can run:
dune clean
dune build
dune runtestThe written tests can be read in the test/ folder.
- Claude was leveraged in the development of unit tests for the pure functions.
- Grok was used in the writing of the docstrings for a good portion of the functions developed for the project.
- Perplexity was used to search github for code snippets for some "not very documented" ocaml libraries such as Sqlite3 and Cohttp-lwt.
- No autocomplete tools such as github copilot or cursor were used in this project.
- 1 The project must be implemented in OCaml.
- 2 To compute the output, it is necessary to use
map,reduce, andfilter. - 3 The code must include functions for reading and writing CSV files. This will result in impure functions.
- 4 Separate impure functions from pure functions in the project files.
- 5 The input must be loaded into a list structure of
Record. - 6 The use of
Helper Functionsis mandatory for loading fields into aRecord. - 7 A project report must be written, explaining how each step was implemented. This serves as a guide for someone who might want to replicate the project in the future. You must declare whether or not Generative AI was used in this report.
- 1 Read the input data from a static file available on the internet (exposed via HTTP).
- 2 Save the output data in an SQLite database.
- 3 It is possible to process input tables separately, but it is preferable to process the data together using an
inner joinoperation. In other words, join the tables before starting theTransformstep. - 4 Organize the ETL project using
dune. - 5 Document all generated functions using the
docstringformat. - 6 Produce an additional output that contains the average revenue and taxes paid, grouped by month and year.
- 7 Generate complete test files for the pure functions.