Skip to content

Conversation

@lkacenja
Copy link
Contributor

@lkacenja lkacenja commented Jan 27, 2026

It's time to update SLC's documents. They are going to use the tool to assist in their efforts starting in early February.

When I tried to run the crawler, I got 403 responses from the SLC site on all urls. I added the adaptive requesting technique that I implemented in the document inference component to the crawler. That seems to have fixed it.

This PR includes:

  • New site crawl for SLC
  • Updates to the crawler script

What additional steps are required to test this branch locally?

  • Check out the branch
  • Update gems if neccesary (recent dep updates) bundler install and bundler clean --force
  • Start up the app with a fresh setup. rails db:drop ; rails db:migrate; rails db:setup
  • Import the October cut: bin/rake documents:import_documents["1", "db/seeds/site_documents_2025_10_07/salt_lake_city.csv", true]
  • Import the new cut over it: bin/rake documents:import_documents["1", "db/seeds/site_documents_2026_01_22/salt_lake_city.csv", true]

Are there any areas you would like extra review?

Here is a hex notebook that explores the changes between releases a bit. It may be helpful in creating transparency around the new data. The first table is most reliable.

Are there any rake tasks to run on production?

Yes, the new documents will need to be imported.

@lkacenja lkacenja self-assigned this Jan 27, 2026

def get_crawl_status_display
if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.week.ago)
if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.month.ago)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This anticipated much more frequent crawling. By the time the SLC folks log in the "new" tag won't be visible anymore.



def get_file(url: str, output_path: str, wait_to_retry: int = 1000) -> str:
def get_file(url: str, output_path: str, wait_to_retry: int = 2) -> str:
Copy link
Contributor Author

@lkacenja lkacenja Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was absurdly high. I think I thought time.sleep(...) took milliseconds.

@lkacenja lkacenja changed the base branch from main to dev January 27, 2026 16:18
Copy link
Contributor

@allisonmorgan allisonmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. I confirmed I could run import the new data and run the app locally. Thanks for tackling these complex crawling issues! 🚀

@lkacenja lkacenja merged commit a224ce8 into dev Jan 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants