Skip to content

Commit dae3ca0

Browse files
reduce logging noise (#141)
* reduce logging noise * update docs on logging related details
1 parent cec5f04 commit dae3ca0

File tree

5 files changed

+26
-40
lines changed

5 files changed

+26
-40
lines changed

docs/extraction_architecture.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -192,34 +192,44 @@ flowchart TD
192192
The pipeline implements **fault-tolerant design** common in production data systems:
193193

194194
1. **Provider-level fault isolation** - Individual provider failures don't cascade; pipeline continues processing other providers
195-
2. **Extraction error boundaries** - Caught by `@extractor_error_handler` decorator; malformed data is logged but doesn't block the pipeline
196-
3. **Network resilience** - httpx timeout management (30 second timeout) prevents hanging; timeouts are logged as provider failures
197-
4. **API error handling** - Google Sheets API errors are logged with full context; transient failures can be retried by re-running the pipeline
195+
2. **Extraction error boundaries** - Caught by `@extractor_error_handler` decorator; malformed data is silently skipped without blocking the pipeline
196+
3. **Network resilience** - httpx timeout management (30 second timeout) prevents hanging; HTTP errors silently return None for graceful degradation
197+
4. **API error handling** - Google Sheets API errors are logged with full context at the application level; transient failures can be retried by re-running the pipeline
198+
5. **Graceful degradation** - Failed articles are silently skipped (exception caught), allowing the pipeline to process successfully extracted articles
198199

199200
**Idempotent Operations**:
200201

201202
- Deduplication check ensures reruns don't insert duplicates
202203
- Timestamp updates are overwritten (safe for retries)
203204
- Sheet sorting is deterministic
204205

205-
All errors are written to stdout for operational visibility (captured in GitHub Actions logs or Docker containers).
206+
**Logging Strategy**:
207+
208+
- High-level events (provider processing, batch writes) logged in `main.py`
209+
- Low-level errors (HTTP failures, extraction errors) handled silently in utility modules
210+
- This reduces log noise while maintaining operational visibility at the application level
206211

207212
## Logging & Observability
208213

209214
Structured logging enables operational visibility:
210215

211216
- **Level**: INFO (production-grade)
212217
- **Format**: `%(asctime)s - %(name)s - %(levelname)s - %(message)s`
218+
- **Date Format**: `%Y-%m-%d %H:%M:%S` (without milliseconds)
213219
- **Output**: stdout (captured by GitHub Actions logs and Docker)
220+
- **httpx Logging**: Suppressed to CRITICAL level to reduce noise from HTTP requests
221+
- **Centralized Setup**: All logging configured in `main.py` for consistency
214222

215223
**Key Log Messages** (Observable Events):
216224

217-
- "Processed {provider}: X new articles found" - Success metric
218-
- "Failed to fetch page for {provider}" - Network issue indicator
219-
- "Error processing {provider}: {error}" - Provider-specific failures
220-
- "Unknown provider: {provider}" - Configuration issue
225+
- "Processing {provider_url} - X new articles found" - Success metric per provider
226+
- "Failed to fetch page for {provider_name} from {provider_url}" - Network issue indicator
227+
- "Error processing {provider_name}: {error}" - Provider-specific failures
228+
- "Unknown provider: {provider_name}" - Configuration issue
229+
- "Batch write complete: X articles added to the sheet." - Load completion metric
230+
- "✅ No new articles found" - No-op scenario indicator
221231

222-
These logs enable downstream monitoring, alerting, and audit trails—essential for operational pipelines.
232+
These logs enable downstream monitoring, alerting, and audit trails—essential for operational pipelines. Utility modules (`get_page.py`, `extractors.py`) delegate logging to `main.py` for a unified view.
223233

224234
## Performance & Architecture
225235

@@ -228,6 +238,7 @@ These logs enable downstream monitoring, alerting, and audit trails—essential
228238
- **Sequential processing** - Providers processed one at a time; can be parallelized if needed
229239
- **Generator-based streaming** - Articles flow through pipeline immediately after extraction (no batch buffering)
230240
- **Memory efficient** - Generators enable incremental processing without storing all articles in memory
241+
- **Centralized logging** - Single logging source in `main.py` provides unified observability across all pipeline stages
231242

232243
### Rate Limiting & Respect
233244

script/main.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,10 @@
2929
logging.basicConfig(
3030
level=logging.INFO,
3131
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
32+
datefmt="%Y-%m-%d %H:%M:%S",
3233
stream=sys.stdout,
3334
)
35+
logging.getLogger("httpx").setLevel(logging.CRITICAL)
3436

3537

3638
async def process_provider(fetcher_state, provider, existing_titles):
@@ -64,7 +66,7 @@ async def process_provider(fetcher_state, provider, existing_titles):
6466
get_articles(elements, handler["extractor"], existing_titles)
6567
)
6668
logger.info(
67-
f"Processed {provider_name}: {len(articles_found)} new articles found"
69+
f"Processing {provider_url} - {len(articles_found)} new articles found"
6870
)
6971
return articles_found, fetcher_state
7072

@@ -92,11 +94,9 @@ async def async_main(timestamp):
9294

9395
# Batch write all articles at once
9496
if all_articles:
95-
batch_start = time.time()
9697
batch_append_articles(articles_sheet, all_articles)
97-
batch_time = time.time() - batch_start
9898
logger.info(
99-
f"Batch write complete: {len(all_articles)} articles written in {batch_time:.2f}s"
99+
f"Batch write complete: {len(all_articles)} articles added to the sheet."
100100
)
101101
else:
102102
logger.info("\n✅ No new articles found\n")

script/utils/extractors.py

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,11 @@
11
import re
22
import logging
3-
import sys
43
import traceback
54
from datetime import datetime
65
from utils.format_date import clean_and_convert_date
76

87

98
logger = logging.getLogger(__name__)
10-
# Configure logging to write to stdout for log file capture
11-
logging.basicConfig(
12-
level=logging.INFO,
13-
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
14-
stream=sys.stdout,
15-
)
169

1710

1811
# Error handling decorator for extractors
@@ -152,7 +145,7 @@ def get_articles(elements, extract_func, existing_titles):
152145
if normalized_title not in normalized_existing_titles:
153146
yield article_info
154147
except Exception as e:
155-
logger.error(f"Skipping an article due to error: {e}")
148+
pass
156149

157150

158151
def provider_dict(provider_element):

script/utils/get_page.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,11 @@
11
import httpx
22
import asyncio
33
import logging
4-
import sys
54
import time
65
from bs4 import BeautifulSoup
76
from .constants import DEFAULT_REQUEST_INTERVAL, DEFAULT_TIMEOUT
87

98
logger = logging.getLogger(__name__)
10-
# Configure logging to write to stdout for log file capture
11-
logging.basicConfig(
12-
level=logging.INFO,
13-
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
14-
stream=sys.stdout,
15-
)
169

1710

1811
def init_fetcher_state():
@@ -52,11 +45,9 @@ async def fetch_page(state, url):
5245
soup = BeautifulSoup(response.text, "html.parser")
5346
return soup, state
5447

55-
logger.error(f"HTTP {response.status_code} from {url}")
5648
return None, state
5749

5850
except Exception as e:
59-
logger.error(f"Error fetching {url}: {str(e)}")
6051
return None, state
6152

6253

script/utils/sheet.py

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,7 @@ def get_all_providers(providers_sheet: Worksheet) -> List[Dict[str, Any]]:
7676
return providers_sheet.get_all_records()
7777

7878

79-
def batch_append_articles(
80-
sheet: Worksheet, articles: List[tuple], log_func: Callable = print
81-
) -> None:
79+
def batch_append_articles(sheet: Worksheet, articles: List[tuple]) -> None:
8280
"""
8381
Appends multiple article rows to the given sheet in a single batch operation.
8482
@@ -90,13 +88,6 @@ def batch_append_articles(
9088
if not articles:
9189
return
9290

93-
# Log all articles
94-
for article_info in articles:
95-
date = article_info[0]
96-
title = article_info[1]
97-
link = article_info[2]
98-
log_func(f"==> {title} - {date}\n{link}\n")
99-
10091
# Batch append all rows at once
10192
rows = [list(article) for article in articles]
10293
sheet.append_rows(rows)

0 commit comments

Comments
 (0)