-
Notifications
You must be signed in to change notification settings - Fork 7
Add 2026-01 SLC crawl #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| def get_crawl_status_display | ||
| if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.week.ago) | ||
| if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.month.ago) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This anticipated much more frequent crawling. By the time the SLC folks log in the "new" tag won't be visible anymore.
|
|
||
|
|
||
| def get_file(url: str, output_path: str, wait_to_retry: int = 1000) -> str: | ||
| def get_file(url: str, output_path: str, wait_to_retry: int = 2) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was absurdly high. I think I thought time.sleep(...) took milliseconds.
allisonmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. I confirmed I could run import the new data and run the app locally. Thanks for tackling these complex crawling issues! 🚀
It's time to update SLC's documents. They are going to use the tool to assist in their efforts starting in early February.
When I tried to run the crawler, I got 403 responses from the SLC site on all urls. I added the adaptive requesting technique that I implemented in the document inference component to the crawler. That seems to have fixed it.
This PR includes:
What additional steps are required to test this branch locally?
bundler installandbundler clean --forcerails db:drop ; rails db:migrate; rails db:setupbin/rake documents:import_documents["1", "db/seeds/site_documents_2025_10_07/salt_lake_city.csv", true]bin/rake documents:import_documents["1", "db/seeds/site_documents_2026_01_22/salt_lake_city.csv", true]Are there any areas you would like extra review?
Here is a hex notebook that explores the changes between releases a bit. It may be helpful in creating transparency around the new data. The first table is most reliable.
Are there any rake tasks to run on production?
Yes, the new documents will need to be imported.