Skip to content

Commit 8ab7241

Browse files
authored
Asap 225 add comparison to crawler (#318)
* Add a comparison option to the crawler and create some test covearge. * Add script to make crawling and classification updates easier. * Clean up and fix script for non-archives. * Add fields and UI for crawl data. * Clean up and add previous json option to crawl script. * Further refine comparision logic and allow resuming scrape with json file. * Add more robust date handling. * Make sure last_crawl_date is available before formatting. * Add crawl date to all rows. * Make column order irrelevant in test. * Make document_status optional, but set a default value. * Get rid of bloated date helper. * Fix up linting. * Add lightweight test for import rake. * General cleanup. * Temporarily add 2025-09 crawl. * Fix up linting. * Fix more linting issues. * Fix bad logic in default value setter. * Update the tests. * Fix tests with old status values. * Add test for document status tags. * Add latest crawl and remove intermediate version. * Add a couple of small tweaks to docs. * Add a staging credentials file. * Add updated .gitignore.
1 parent c4845ca commit 8ab7241

File tree

25 files changed

+741
-252
lines changed

25 files changed

+741
-252
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,3 +157,5 @@ python_components/*/*/.pytest_cache
157157
db/schema.rb
158158

159159
cline_docs
160+
161+
/config/credentials/staging.key

app/models/document.rb

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,11 @@ class Document < ApplicationRecord
3232

3333
COMPLEXITIES = [SIMPLE_STATUS, COMPLEX_STATUS].freeze
3434

35+
DOCUMENT_STATUS_NEW = "New".freeze
36+
DOCUMENT_STATUS_ACTIVE = "Active".freeze
37+
DOCUMENT_STATUS_REMOVED = "Removed".freeze
38+
DOCUMENT_STATUSES = [DOCUMENT_STATUS_NEW, DOCUMENT_STATUS_ACTIVE, DOCUMENT_STATUS_REMOVED].freeze
39+
3540
belongs_to :site
3641

3742
has_many :document_inferences
@@ -42,7 +47,7 @@ class Document < ApplicationRecord
4247

4348
validates :file_name, presence: true
4449
validates :url, presence: true, format: {with: URI::DEFAULT_PARSER.make_regexp}
45-
validates :document_status, presence: true, inclusion: {in: %w[discovered downloaded]}
50+
validates :document_status, inclusion: {in: DOCUMENT_STATUSES, allow_blank: true, allow_nil: true}
4651
validates :document_category, inclusion: {in: CONTENT_TYPES}
4752
validates :accessibility_recommendation, inclusion: {in: -> { get_decision_types }}, presence: true
4853
validates :complexity, inclusion: {in: COMPLEXITIES}, allow_nil: true
@@ -284,6 +289,14 @@ def primary_source
284289
urls.is_a?(Array) ? urls.first : urls
285290
end
286291

292+
def get_crawl_status_display
293+
if document_status == DOCUMENT_STATUS_NEW && last_crawl_date.present? && last_crawl_date.after?(1.week.ago)
294+
DOCUMENT_STATUS_NEW
295+
elsif document_status == DOCUMENT_STATUS_REMOVED
296+
DOCUMENT_STATUS_REMOVED
297+
end
298+
end
299+
287300
private
288301

289302
def recursive_decode(url)
@@ -295,8 +308,8 @@ def recursive_decode(url)
295308
end
296309

297310
def set_defaults
298-
self.document_status = "discovered" unless document_status
299311
self.accessibility_recommendation = DEFAULT_DECISION unless accessibility_recommendation
312+
self.document_status = DOCUMENT_STATUS_ACTIVE unless document_status.present?
300313
end
301314

302315
def set_complexity

app/models/site.rb

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,6 @@ def discover_documents!(document_data, collect = false)
170170
url = data[:url]
171171
modification_date = data[:modification_date]
172172

173-
# Find existing document - one query per document but minimal memory usage
174173
existing_document = documents.find_by(url: url)
175174

176175
ActiveRecord::Base.transaction do
@@ -235,6 +234,10 @@ def process_csv_documents(csv_path)
235234
urls.empty? ? nil : urls
236235
end
237236

237+
if row["crawl_date"].present? && row["crawl_date"].is_a?(String)
238+
row["crawl_date"] = Time.parse(row["crawl_date"]).to_i
239+
end
240+
238241
documents << {
239242
url: row["url"],
240243
file_name: row["file_name"],
@@ -251,7 +254,9 @@ def process_csv_documents(csv_path)
251254
predicted_category_confidence: row["predicted_category_confidence"],
252255
number_of_pages: row["number_of_pages"]&.to_i,
253256
number_of_tables: row["number_of_tables"]&.to_i,
254-
number_of_images: row["number_of_images"]&.to_i
257+
number_of_images: row["number_of_images"]&.to_i,
258+
crawl_status: row["crawl_status"].present? ? row["crawl_status"].capitalize : "",
259+
crawl_date: row["crawl_date"]
255260
}
256261
rescue URI::InvalidURIError => e
257262
puts "Skipping invalid URL: #{row["url"]}"
@@ -353,12 +358,12 @@ def attributes_from(data)
353358
document_category: data[:predicted_category] || data[:document_category],
354359
document_category_confidence: data[:predicted_category_confidence] || data[:document_category_confidence],
355360
url: data[:url],
356-
modification_date: data[:modification_date],
361+
modification_date: clean_date(data[:modification_date]),
357362
file_size: data[:file_size],
358363
author: clean_string(data[:author]),
359364
subject: clean_string(data[:subject]),
360365
keywords: clean_string(data[:keywords]),
361-
creation_date: data[:creation_date],
366+
creation_date: clean_date(data[:creation_date]),
362367
producer: clean_string(data[:producer]),
363368
pdf_version: clean_string(data[:pdf_version]),
364369
source: if data[:source].nil?
@@ -369,7 +374,8 @@ def attributes_from(data)
369374
number_of_pages: data[:number_of_pages],
370375
number_of_tables: data[:number_of_tables],
371376
number_of_images: data[:number_of_images],
372-
document_status: "discovered"
377+
document_status: data[:crawl_status],
378+
last_crawl_date: clean_date(data[:crawl_date])
373379
}
374380
end
375381

@@ -378,6 +384,31 @@ def clean_string(str)
378384
str.to_s.encode("UTF-8", invalid: :replace, undef: :replace, replace: "").strip
379385
end
380386

387+
def clean_date(date)
388+
if date.nil?
389+
return nil
390+
end
391+
if date.is_a?(String)
392+
return nil if date.empty?
393+
Time.parse(date)
394+
end
395+
if date.is_a?(Integer)
396+
case date
397+
when 0..9_999_999_999
398+
return Time.at(date)
399+
when 10_000_000_000..9_999_999_999_999
400+
return Time.at(date / 1000)
401+
when 10_000_000_000_000..9_999_999_999_999_999
402+
return Time.at(date / 1_000_000)
403+
when 10_000_000_000_000_000..Float::INFINITY
404+
return Time.at(date / 1_000_000_000)
405+
else
406+
return nil
407+
end
408+
end
409+
date
410+
end
411+
381412
def ensure_safe_url
382413
return if primary_url.blank?
383414

app/views/documents/index.html.erb

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,18 @@
9696
<i class="fas fa-file-pdf text-sm"></i> <%= document.file_name&.truncate(80) %>
9797
<span class="sr-only">Show Document Modal</span>
9898
</button>
99+
<% crawl_status = document.get_crawl_status_display %>
100+
<% if crawl_status.present? %>
101+
<% if crawl_status == Document::DOCUMENT_STATUS_NEW %>
102+
<div class="tooltip tooltip-top tooltip-primary" data-tip="Recently uploaded PDF - <%= document.last_crawl_date.present? ? document.last_crawl_date.strftime("%Y-%m-%d") : "Date Unknown" %>">
103+
<div class="badge badge-sm badge-primary"><%= crawl_status %></div>
104+
</div>
105+
<% else %>
106+
<div class="tooltip tooltip-top tooltip-neutral" data-tip="PDF removed - <%= document.last_crawl_date.present? ? document.last_crawl_date.strftime("%Y-%m-%d") : "Date Unknown" %>">
107+
<div class="badge badge-sm badge-neutral"><%= crawl_status %></div>
108+
</div>
109+
<% end %>
110+
<% end %>
99111
</div>
100112
<% source = document_source(document.primary_source) %>
101113
<div class="text-gray-400">

bin/crawl

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
#!/bin/bash
2+
3+
# Script to crawl a website and extract PDF information
4+
# Usage: bin/crawl <site_url> <output_dir> [previous_crawl_directory] [previous_link_json]
5+
# Site site_url must exist in python_components/crawler/config.json.
6+
# previous_crawl_directory is the path to a previous version of the crawl to compare against.
7+
# previous_link_json is a list of links and sources produced by the first phase of the crawler.
8+
# Useful if metadata fetching failed or to resume an otherwise halted process.
9+
10+
# Check if site_url parameter is provided
11+
if [ $# -eq 0 ]; then
12+
echo "Error: site_url parameter is required"
13+
echo "Usage: bin/crawl <site_url> <output_dir> [previous_crawl_directory] [previous_link_json]"
14+
echo "Example: bin/crawl https://georgia.gov /db/seeds/site_documents_2025_09"
15+
echo "Example: bin/crawl https://georgia.gov /db/seeds/site_documents_2025_09 /path/to/previous_crawl_directory/"
16+
echo "Example: bin/crawl https://georgia.gov /db/seeds/site_documents_2025_09 /path/to/previous_crawl_directory.zip"
17+
echo "Example: bin/crawl https://georgia.gov /db/seeds/site_documents_2025_09 /path/to/previous_crawl_directory.zip /georgia_files.json"
18+
exit 1
19+
fi
20+
21+
SITE_URL="$1"
22+
OUTPUT_DIR="$2"
23+
PREVIOUS_CRAWL="$3"
24+
PREVIOUS_LINK_JSON="$4"
25+
26+
# Find the project root directory by looking for the bin directory
27+
# This allows the script to work from any subdirectory
28+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
29+
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
30+
31+
docker build -t asap_pdf:crawler "$PROJECT_ROOT/python_components/crawler/."
32+
docker build -t asap_pdf:classifier "$PROJECT_ROOT/python_components/classifier/."
33+
34+
CONFIG_FILE="$PROJECT_ROOT/python_components/crawler/config.json"
35+
36+
if [ ! -f "$CONFIG_FILE" ]; then
37+
echo "Error: config.json not found at $CONFIG_FILE"
38+
exit 1
39+
fi
40+
41+
OUTPUT_FILE=$(jq -r --arg site_url "$SITE_URL" '
42+
if has($site_url) then
43+
.[$site_url].output_file
44+
else
45+
"output.csv"
46+
end
47+
' "$CONFIG_FILE" 2>/dev/null || echo "output.csv")
48+
49+
if [ -n "$PREVIOUS_CRAWL" ]; then
50+
if [ ! -e "$PREVIOUS_CRAWL" ]; then
51+
echo "Error: Previous crawl path '$PREVIOUS_CRAWL' does not exist"
52+
exit 1
53+
fi
54+
CRAWLER_TMP_DIR="$PROJECT_ROOT/crawler_tmp"
55+
PREVIOUS_CRAWL_DIR="$CRAWLER_TMP_DIR/previous_crawl"
56+
mkdir -p $PREVIOUS_CRAWL_DIR
57+
if [ -f "$PREVIOUS_CRAWL" ] && [[ "$PREVIOUS_CRAWL" == *.zip ]]; then
58+
if command -v unzip >/dev/null 2>&1; then
59+
unzip -j -q "$PREVIOUS_CRAWL" -d "$PREVIOUS_CRAWL_DIR"
60+
if [ $? -eq 0 ]; then
61+
echo "Successfully extracted previous crawl archive"
62+
else
63+
echo "Error: Failed to extract zip archive"
64+
rm -rf "$PREVIOUS_CRAWL_DIR"
65+
exit 1
66+
fi
67+
else
68+
echo "Error: unzip command not found. Please install unzip to extract archives"
69+
rm -rf "$PREVIOUS_CRAWL_DIR"
70+
exit 1
71+
fi
72+
else
73+
cp -r "$PREVIOUS_CRAWL"/* "$PREVIOUS_CRAWL_DIR"
74+
fi
75+
COMPARISON_FLAG="--comparison_crawl=/data/previous_crawl/$OUTPUT_FILE"
76+
fi
77+
78+
PREVIOUS_JSON_LINK_FLAG=""
79+
if [ -n "$PREVIOUS_LINK_JSON" ]; then
80+
if [ ! -f "$PREVIOUS_LINK_JSON" ]; then
81+
echo "Previous JSON crawl file was specified, but does not exist."
82+
exit 1
83+
fi
84+
mkdir -p "$CRAWLER_TMP_DIR/previous_json_links"
85+
mv $PREVIOUS_LINK_JSON "$CRAWLER_TMP_DIR/previous_json_links"
86+
PREVIOUS_JSON_LINK_FLAG="--crawled_links_json=/data/previous_json_links/$(basename $PREVIOUS_LINK_JSON)"
87+
fi
88+
89+
TMP_OUTPUT="$CRAWLER_TMP_DIR/output"
90+
mkdir -p $TMP_OUTPUT
91+
mkdir -p $OUTPUT_DIR
92+
echo "$OUTPUT_DIR"
93+
94+
set -x
95+
96+
crawler_command=(
97+
docker run --rm
98+
-v "$PROJECT_ROOT/python_components/crawler:/workspace"
99+
-v "$CRAWLER_TMP_DIR:/data"
100+
asap_pdf:crawler
101+
python /workspace/crawler.py "$SITE_URL" "/data/output/$OUTPUT_FILE"
102+
)
103+
104+
if [ -n "$COMPARISON_FLAG" ]; then
105+
crawler_command+=("$COMPARISON_FLAG")
106+
fi
107+
108+
if [ -n "$PREVIOUS_JSON_LINK_FLAG" ]; then
109+
crawler_command+=("$PREVIOUS_JSON_LINK_FLAG")
110+
fi
111+
112+
"${crawler_command[@]}"
113+
114+
set +x
115+
116+
mv "$TMP_OUTPUT/$OUTPUT_FILE" "$TMP_OUTPUT/$OUTPUT_FILE-crawled"
117+
118+
set -x
119+
120+
docker run --rm \
121+
-v "$PROJECT_ROOT/python_components/classifier:/workspace" \
122+
-v "$TMP_OUTPUT:/output" \
123+
asap_pdf:classifier \
124+
python /workspace/classifier.py "/output/$OUTPUT_FILE-crawled" "/output/$OUTPUT_FILE"
125+
126+
set +x
127+
128+
mv "$TMP_OUTPUT/$OUTPUT_FILE" "$OUTPUT_DIR"
129+
130+
rm -rf "$CRAWLER_TMP_DIR"

config/credentials/staging.yml.enc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
7JOknLbj0E2yC0DwuXhJ2xQSnyyfeOmvwoUw4lRgNFBmPzMvfIXGGqF5TKB9wl6bK8FhA5meY8yTFCTIPFqvKq1WMnoRmcJqDa1wrhujvqrS9R4wExtYlAP7Q/bJyQVWK9D/VTo9gsUW9NeoeKQblqSEpE1PHw7X8w9gHztEygGxPAfYZJFfoGHQChzfMl9ac5sdCTEPOh/JuVJjBY4/dEJv75BEajlY6kv0zPn68frNnS8v/aJ6JMdp6UAbXjuzRYRD3E5rGWqQGxtCR1DQKL0RTFf3PWNemZainDfi7phDwZF2KaSiE2IlWks76fPYwNclxcW55o7FNm8MvpaWu3M4tIVJ4zwSNuq62j44gyUT4XYvUXyxkDf+ESr8Kw2bQrfN4CZMmaEWAw0cxHbj2Je0RTVhhTeu3aUX5qn1+0HBQdJ/gyWiskRFOFybfKb9QUgOSPWobz4hDfZnWqS8BHP0J0sKx2gzUY2svgjbPiK9WpdsWhtK1hIo--mlbA8M5tmXoKuCSX--xUIvSR9NnKJp1ht2UILLUw==
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
class RemoveStatusColumn < ActiveRecord::Migration[8.0]
2+
def up
3+
remove_column :documents, :status
4+
end
5+
6+
def down
7+
add_column :documents, :status, :string
8+
end
9+
end
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
class AddLastCrawlDate < ActiveRecord::Migration[8.0]
2+
def up
3+
add_column :documents, :last_crawl_date, :datetime
4+
change_column_default :documents, :document_status, from: nil, to: Document::DOCUMENT_STATUS_ACTIVE
5+
end
6+
7+
def down
8+
remove_column :documents, :last_crawl_date
9+
change_column_default :documents, :document_status, from: Document::DOCUMENT_STATUS_ACTIVE, to: nil
10+
end
11+
end
1.45 MB
Binary file not shown.

docs/windows_localdev.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ We've tested two approaches for Windows development:
2525

2626
**Prerequisites:**
2727
- Clone the repository to your WSL home directory (this avoids file permission issues)
28-
- Install Yarn: `npm install yarn -y`
28+
- Install Yarn: `npm install yarn -g`
2929

3030
**Configuration Steps:**
3131

@@ -36,7 +36,9 @@ We've tested two approaches for Windows development:
3636
```
3737

3838
2. **Update database configuration:**
39-
- Make sure Postgres credentials in `config/database.yml` match your Postgres setup
39+
- If your Postgres service was set up to allow "trust" authentication for local connections, you may not need to change anything.
40+
- If you created a user with a password during database setup, add a username and password entry to the development section of `config/database.yml`. See the `staging` section for an example.
41+
4042

4143
3. **Install dependencies and setup database:**
4244
```bash

0 commit comments

Comments
 (0)