[DEV-14238] - Transaction Loader Refactor by zachflanders-frb · Pull Request #4577 · fedspendingtransparency/usaspending-api

zachflanders-frb · 2026-01-14T18:15:00Z

Description:

This PR refactors the transaction loader flow to remove the lookup tables.

Technical Details:

This PR make the following changes to the transaction loader flow:

add a hash key to the published_fabs and detached_award_procurement delta tables
use the hash key and unique id column (afa_generated_unique/detached_award_proc_unique) for the merge with transaction fabs/fpds/normalized tables.
- Using the unique keys allows us to eliminate the use of the transaction lookup tables
- Adding the hash allows us to compare the entire table and eliminate the use of the last_etl_date. This ensures the entire tables are kept in sync.
Use WHEN NOT MATCHED BY SOURCE in the merge statement itself instead of having a separate delete function for transactions
Use the unique_award_key column to form the relationships between awards and transactions instead of using the award lookup table (eliminates need for the award lookup table)
splits the load_transactions_in_delta into separate commands
Reorganizes code using object-oriented patterns for better maintainability

Requirements for PR Merge:

Unit & integration tests updated
API documentation updated (examples listed below)
1. API Contracts
2. API UI
3. Comments
Data validation completed (examples listed below)
1. Does this work well with the current frontend? Or is the frontend aware of a needed change?
2. Is performance impacted in the changes (e.g., API, pipeline, downloads, etc.)?
3. Is the expected data returned with the expected format?
Appropriate Operations ticket(s) created
Jira Ticket(s)
1. DEV-0

Explain N/A in above checklist:

…n / award keys for relations instead of lookup keys

…mmands to remove unused id columns

…d update awards loader

zachflanders-frb · 2026-01-15T19:05:43Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

I left this step largely unchanged except I used the unique_award_key/generated_unique_award_id and unique_transaction_id to form the relationships between the awards and normalized transactions.

usaspending_api/common/spark/configs.py

Co-authored-by: Andrew Guest <110476931+aguest-kc@users.noreply.github.com>

aguest-kc · 2026-01-15T19:50:16Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+            table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")
+            if not table_exists:


We could avoid creating a single-use variable here.

Suggested change

table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")

if not table_exists:

if not self.spark._jsparkSession.catalog().tableExists(f"int.awards"):

…ithub.com/fedspendingtransparency/usaspending-api into ftr/dev-14238-transaction-loader-refactor

…n_id back to long type

zachflanders-frb · 2026-01-22T16:34:15Z

usaspending_api/etl/transaction_delta_loaders/loaders.py

+        super().load_transactions()
+        self.populate_award_ids()
+        self.populate_transaction_normalized_ids()
+        self.link_transactions_to_normalized()


This involves 4 sequential executions of merge statements. I need to explore whether I can adjust this to two executions and whether that will be more performant.

…ithub.com/fedspendingtransparency/usaspending-api into ftr/dev-14238-transaction-loader-refactor

sethstoudenmier

An initial pass on the changes. Overall, I didn't see anything that I would block over. Will take another pass before approving once testing is done.

sethstoudenmier · 2026-01-22T16:08:40Z

.github/workflows/test-spark-integration-load-transactions-fabs-fpds.yaml

@@ -62,7 +62,7 @@ jobs:
        with:
          cov-report-name: 'spark-load-transactions-fabs-fpds-tests'


Suggested change

cov-report-name: 'spark-load-transactions-fabs-fpds-tests'

cov-report-name: 'spark-load-transactions-tests'

I believe we actually can remove the generation of these reports, however, to be consistent we should change the name for now.

Can we update the filename accordingly as well?

sethstoudenmier · 2026-01-22T16:23:09Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+        )
+
+    def handle(self, *args, **options):
+        with self.prepare_spark():


Should this use the defined function in usaspending_api/etl/transaction_delta_loaders/context_managers.py instead of the separately defined method below?

sethstoudenmier · 2026-01-22T16:25:16Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+            update_last_load_date("awards", next_last_load)
+
+    @contextmanager
+    def prepare_spark(self):


Similar to my comment above; could this be removed and instead use the function in usaspending_api/etl/transaction_delta_loaders/context_managers.py?

sethstoudenmier · 2026-01-22T16:40:34Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+        subquery = """
+            SELECT awards.generated_unique_award_id AS id_to_remove
+            FROM int.awards
+            LEFT JOIN int.transaction_normalized on awards.transaction_unique_id = transaction_normalized.transaction_unique_id


This probably isn't too bad performance wise, but it is possible that an EXISTS instead of a JOIN could help with performance some since we don't care about returning any of the values from int.transaction_normalized in this query.

I won't add this comment anywhere else, but the same could hold true. Being inside of a subquery may negate benefit of using EXISTS over a JOIN. If performance is looking good, then probably not worth exploring.

sethstoudenmier · 2026-01-22T16:41:24Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+
+    def delete_records_sql(self):
+        id_col = "generated_unique_award_id"
+        # TODO could do an outer join here to find awards that do not join to transaction fpds or transaction fabs


Is this comment saying that you could move away from Transaction Normalized to use FPDS and FABS directly? If so, do you mind clarifying in the comment.

sethstoudenmier · 2026-01-22T16:44:26Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+                            /* NOTE: In Postgres, the default sorting order sorts NULLs as larger than all other values.
+                               However, in Spark, the default sorting order sorts NULLs as smaller than all other
+                               values.  In the Postgres transaction loader the default sorting behavior was used, so to
+                               be consistent with the behavior of the previous loader, we need to reverse the default
+                               Spark NULL sorting behavior for any field that can be NULL. */


Suggested change

/* NOTE: In Postgres, the default sorting order sorts NULLs as larger than all other values.

However, in Spark, the default sorting order sorts NULLs as smaller than all other

values. In the Postgres transaction loader the default sorting behavior was used, so to

be consistent with the behavior of the previous loader, we need to reverse the default

Spark NULL sorting behavior for any field that can be NULL. */

-- NOTE: In Spark, the default sorting order sorts NULLs as smaller than all other values.

Now that we are so far removed from the Postgres pipeline, I feel that this comment can be updated to include simply how Spark handles NULL values.

sethstoudenmier · 2026-01-22T16:52:03Z

usaspending_api/etl/management/commands/load_awards_in_delta.py

+            SELECT
+                latest.id,
+                latest.unique_award_key,
+                0 AS subaward_count,  -- for consistency with Postgres table


Suggested change

0 AS subaward_count, -- for consistency with Postgres table

0 AS subaward_count, -- default value that is updated later

Similar to other comment suggestion, since we are here and far removed from the Postgres pipeline it would be good to make this more meaningful.

zachflanders-frb added 6 commits December 11, 2025 16:56

[DEV-13931] - break apart transaction types and use unique transactio…

3d4de9e

…n / award keys for relations instead of lookup keys

[DEV-14238] - Add load_transactions_fabs_in_delta and update other co…

ee11d91

…mmands to remove unused id columns

[DEV-14238] - WIP

8e93414

[DEV-14238] - WIP transaction flow refactor

682253f

[DEV-14238] - update load_table_to_delta to add hash column option an…

b4212aa

…d update awards loader

[DEV-14238] - remove last_etl_date from load_awards

3772625

github-actions bot assigned zachflanders-frb Jan 14, 2026

[DEV-14238] - Adding tests for transaction loaders

bd8a1ab

zachflanders-frb marked this pull request as ready for review January 15, 2026 17:28

[DEV-14238] - flake8 fixes

d27f079

zachflanders-frb commented Jan 15, 2026

View reviewed changes

aguest-kc reviewed Jan 15, 2026

View reviewed changes

usaspending_api/common/spark/configs.py Outdated Show resolved Hide resolved

github-actions bot assigned aguest-kc Jan 15, 2026

zachflanders-frb and others added 2 commits January 15, 2026 13:32

[DEV-14238] - Updating github actions

1c45071

Update usaspending_api/common/spark/configs.py

5324741

Co-authored-by: Andrew Guest <110476931+aguest-kc@users.noreply.github.com>

aguest-kc reviewed Jan 15, 2026

View reviewed changes

zachflanders-frb added 4 commits January 15, 2026 14:16

[DEV-14238] - Adding external load types

fdcbe0b

Merge branch 'ftr/dev-14238-transaction-loader-refactor' of https://g…

00b898d

…ithub.com/fedspendingtransparency/usaspending-api into ftr/dev-14238-transaction-loader-refactor

[DEV-14238] - flake8 fix

96b5326

[DEV-14238] - cast transaction id cols to string in vw_awards

1d4ba2b

zachflanders-frb changed the title ~~Transaction Loader Refactor~~ [DEV-14238] - Transaction Loader Refactor Jan 16, 2026

zachflanders-frb added 6 commits January 20, 2026 11:29

[DEV-14238] - add ids and transaction_ids

475e50b

[DEV-14238] - adding award_ids

2e9ec24

[DEV-14238] - updating lastest_transaction_id and earliest_transactio…

a7abd6c

…n_id back to long type

[DEV-14238] - removing trailing white space

3568600

[DEV-14238] - moving spark context manager out of loader class

76ad7be

Merge branch 'qat' into ftr/dev-14238-transaction-loader-refactor

fcd5e30

zachflanders-frb commented Jan 22, 2026

View reviewed changes

zachflanders-frb added 2 commits January 22, 2026 10:35

[DEV-14238] - Remove extra logging

180ca0c

Merge branch 'ftr/dev-14238-transaction-loader-refactor' of https://g…

b1a00e7

…ithub.com/fedspendingtransparency/usaspending-api into ftr/dev-14238-transaction-loader-refactor

sethstoudenmier reviewed Jan 22, 2026

View reviewed changes

github-actions bot assigned sethstoudenmier Jan 22, 2026

zachflanders-frb added the do not merge [PR] shouldn't be merged label Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEV-14238] - Transaction Loader Refactor#4577

[DEV-14238] - Transaction Loader Refactor#4577
zachflanders-frb wants to merge 22 commits intoqatfrom
ftr/dev-14238-transaction-loader-refactor

zachflanders-frb commented Jan 14, 2026 •

edited

Loading

Uh oh!

zachflanders-frb Jan 15, 2026

Uh oh!

Uh oh!

aguest-kc Jan 15, 2026

Uh oh!

zachflanders-frb Jan 22, 2026 •

edited

Loading

Uh oh!

sethstoudenmier left a comment

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

sethstoudenmier Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")
		if not table_exists:

	table_exists = self.spark._jsparkSession.catalog().tableExists(f"int.awards")
	if not table_exists:
	if not self.spark._jsparkSession.catalog().tableExists(f"int.awards"):

		@@ -62,7 +62,7 @@ jobs:
		with:
		cov-report-name: 'spark-load-transactions-fabs-fpds-tests'

	0 AS subaward_count, -- for consistency with Postgres table
	0 AS subaward_count, -- default value that is updated later

Conversation

zachflanders-frb commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Technical Details:

Requirements for PR Merge:

Explain N/A in above checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachflanders-frb Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethstoudenmier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zachflanders-frb commented Jan 14, 2026 •

edited

Loading

zachflanders-frb Jan 22, 2026 •

edited

Loading