Add 2026-01 SLC crawl #399

lkacenja · 2026-01-27T16:12:23Z

It's time to update SLC's documents. They are going to use the tool to assist in their efforts starting in early February.

When I tried to run the crawler, I got 403 responses from the SLC site on all urls. I added the adaptive requesting technique that I implemented in the document inference component to the crawler. That seems to have fixed it.

This PR includes:

New site crawl for SLC
Updates to the crawler script

What additional steps are required to test this branch locally?

Check out the branch
Update gems if neccesary (recent dep updates) bundler install and bundler clean --force
Start up the app with a fresh setup. rails db:drop ; rails db:migrate; rails db:setup
Import the October cut: bin/rake documents:import_documents["1", "db/seeds/site_documents_2025_10_07/salt_lake_city.csv", true]
Import the new cut over it: bin/rake documents:import_documents["1", "db/seeds/site_documents_2026_01_22/salt_lake_city.csv", true]

Are there any areas you would like extra review?

Here is a hex notebook that explores the changes between releases a bit. It may be helpful in creating transparency around the new data. The first table is most reliable.

Are there any rake tasks to run on production?

Yes, the new documents will need to be imported.

lkacenja · 2026-01-27T16:13:25Z

app/models/document.rb


  def get_crawl_status_display
-    if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.week.ago)
+    if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.month.ago)


This anticipated much more frequent crawling. By the time the SLC folks log in the "new" tag won't be visible anymore.

lkacenja · 2026-01-27T16:14:07Z

python_components/document_inference/document_inference/helpers.py



-def get_file(url: str, output_path: str, wait_to_retry: int = 1000) -> str:
+def get_file(url: str, output_path: str, wait_to_retry: int = 2) -> str:


This was absurdly high. I think I thought time.sleep(...) took milliseconds.

allisonmorgan

This looks great to me. I confirmed I could run import the new data and run the app locally. Thanks for tackling these complex crawling issues! 🚀

lkacenja added 5 commits January 22, 2026 15:58

Use document_inference strategy to request pages and documents.

a362252

Update logging and fix crawling directory bug.

eb99479

Show new tag for 2 weeks instead of one.

7df8961

Show new tag for 1 month ago instead of two weeks.

89d5ccb

Add new 2026-01-22 crawl.

6f45118

lkacenja self-assigned this Jan 27, 2026

lkacenja requested a review from allisonmorgan January 27, 2026 16:12

lkacenja commented Jan 27, 2026

View reviewed changes

Fix linting issues.

e8aa36d

lkacenja changed the base branch from main to dev January 27, 2026 16:18

Fix failing test and more linting issues.

7c355af

allisonmorgan approved these changes Jan 27, 2026

View reviewed changes

lkacenja merged commit a224ce8 into dev Jan 27, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2026-01 SLC crawl #399

Add 2026-01 SLC crawl #399

Uh oh!

lkacenja commented Jan 27, 2026 •

edited

Loading

Uh oh!

lkacenja Jan 27, 2026

Uh oh!

lkacenja Jan 27, 2026 •

edited

Loading

Uh oh!

allisonmorgan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def get_file(url: str, output_path: str, wait_to_retry: int = 1000) -> str:
		def get_file(url: str, output_path: str, wait_to_retry: int = 2) -> str:

Add 2026-01 SLC crawl #399

Add 2026-01 SLC crawl #399

Uh oh!

Conversation

lkacenja commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkacenja Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

lkacenja Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lkacenja commented Jan 27, 2026 •

edited

Loading

lkacenja Jan 27, 2026 •

edited

Loading