refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion#44
Conversation
…rkdown conversion - Removed trafilatura dependency and its usage in scraper.py. - Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion. - Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup. - Adjusted tests to mock markitdown usage and verify new scraping workflow.
|
""" WalkthroughThe changes replace the use of the Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_scraper.py (1)
42-42: Fix import placement and formatting issues.The import statement should be moved to the top of the file according to PEP 8 guidelines.
-from unittest.mock import patch, MagicMock +# Move this import to the top of the file after other importsAlso ensure proper spacing between function definitions (2 blank lines before function definitions at module level).
src/scraper.py (1)
75-83: Refactor unnecessary else block.The else block after a return statement is unnecessary and can be simplified.
if response.status_code != 200: logger.warning( f"Failed to fetch {url} with status code {response.status_code}" ) return [] - else: - content = response.text + content = response.text
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
requirements.txt(1 hunks)src/scraper.py(4 hunks)tests/test_scraper.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scraper.py (1)
src/database_manager.py (1)
DatabaseManager(7-150)
🪛 Pylint (3.3.7)
src/scraper.py
[refactor] 75-81: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
🪛 Flake8 (7.2.0)
tests/test_scraper.py
[error] 42-42: module level import not at top of file
(E402)
[error] 46-46: expected 2 blank lines, found 1
(E302)
[error] 67-67: expected 2 blank lines, found 1
(E302)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build-and-push
🔇 Additional comments (8)
tests/test_scraper.py (3)
46-66: Well-structured test with proper mocking.The test correctly mocks tempfile and os.remove operations to avoid actual file system interactions, and properly tests the new MarkItDown-based scraping functionality.
67-86: Comprehensive test coverage for markitdown integration.This test provides good coverage of the markitdown conversion with more complex HTML and markdown output, ensuring the integration works correctly.
137-137: Properly updated mock object for response handling change.The addition of the
textattribute correctly reflects the change fromresponse.contenttoresponse.textin the scraper implementation.src/scraper.py (5)
5-5: New import correctly added for markitdown integration.The MarkItDown import is appropriate for the new HTML to Markdown conversion functionality.
10-11: Imports added for temporary file handling.The tempfile and os imports are necessary for the new temporary file-based markdown conversion approach.
81-81: Appropriate change from bytes to text for HTML processing.The change from
response.contenttoresponse.textis correct for HTML processing with BeautifulSoup and ensures consistent text handling throughout the scraper.
117-132: Secure and well-implemented markdown conversion.The refactored scrape_page method correctly:
- Uses BeautifulSoup for HTML parsing and title extraction
- Safely handles temporary file creation and cleanup
- Properly converts HTML to markdown using MarkItDown
- Maintains error handling
The temporary file handling is secure with proper cleanup using os.remove().
230-230: Consistent response handling change.The change from
response.contenttoresponse.textmaintains consistency with the fetch_links method and ensures proper text handling.
Docstrings generation was requested by @obeone. * #44 (comment) The following files were modified: * `src/scraper.py` * `tests/test_scraper.py`
|
Note Generated docstrings for this pull request at #45 |
- Changes argument definitions in main.py to multi-line format - Adds detailed docstring and Raises section to main() - Adds minor whitespace and formatting improvements for clarity - No logic changes; improves code style and maintainability
…atible with markitdown)
Implement #38
Summary by CodeRabbit