refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion by obeone · Pull Request #44 · obeone/crawler-to-md

obeone · 2025-07-05T07:09:22Z

Removed trafilatura dependency and its usage in scraper.py.
Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion.
Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup.
Adjusted tests to mock markitdown usage and verify new scraping workflow.

Implement #38

Summary by CodeRabbit

Bug Fixes
- Improved web page scraping and markdown conversion for more accurate extraction of page titles and content.
Chores
- Updated dependencies to replace the markdown conversion library.
Tests
- Added new tests to enhance coverage of web scraping and markdown conversion functionality.
Chores
- Streamlined command-line interface code for better readability.
- Adjusted build workflow to target fewer platforms for Docker builds.

…rkdown conversion - Removed trafilatura dependency and its usage in scraper.py. - Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion. - Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup. - Adjusted tests to mock markitdown usage and verify new scraping workflow.

coderabbitai · 2025-07-05T07:09:28Z

"""

Walkthrough

The changes replace the use of the trafilatura library with BeautifulSoup and MarkItDown for HTML metadata extraction and markdown conversion in the scraping logic. The requirements are updated accordingly. Tests are enhanced to mock file operations and validate the new scraping approach, with no changes to public interfaces. Minor formatting and docstring improvements were made in main.py. The CI workflow was updated to remove the linux/i386 platform from Docker build targets.

Changes

File(s)	Change Summary
requirements.txt	Removed `trafilatura`, added `markitdown` as a dependency.
src/scraper.py	Replaced `trafilatura` with `BeautifulSoup` and `MarkItDown` for metadata extraction and markdown conversion. Adjusted response handling and imports.
tests/test_scraper.py	Added tests for the new scraping logic, mocking file operations and updating dummy response classes.
main.py	Reformatted and enhanced `main()` function docstring and argument parsing for readability; no logic changes.
.github/workflows/build-and-publish.yaml	Removed `linux/i386` platform from Docker multi-arch build targets.

Poem

In the garden of code, a scraper anew,
MarkItDown hops in, trafilatura bids adieu.
With soup and a file, the pages unfold,
While tests keep us safe from bugs uncontrolled.
Hooray for the changes, so nimble and bright—
The rabbit approves, and all is just right! 🐇✨
"""

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8207a4d and 934efa4.

📒 Files selected for processing (1)

.github/workflows/build-and-publish.yaml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.github/workflows/build-and-publish.yaml

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build-and-push

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/test_scraper.py (1)
42-42: Fix import placement and formatting issues.

The import statement should be moved to the top of the file according to PEP 8 guidelines.
-from unittest.mock import patch, MagicMock
+# Move this import to the top of the file after other imports
Also ensure proper spacing between function definitions (2 blank lines before function definitions at module level).
src/scraper.py (1)
75-83: Refactor unnecessary else block.

The else block after a return statement is unnecessary and can be simplified.
                 if response.status_code != 200:
                     logger.warning(
                         f"Failed to fetch {url} with status code {response.status_code}"
                     )
                     return []
-                else:
-                    content = response.text
+                content = response.text

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30cee92 and 9cf3f51.

📒 Files selected for processing (3)

requirements.txt (1 hunks)
src/scraper.py (4 hunks)
tests/test_scraper.py (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

src/scraper.py (1)

src/database_manager.py (1)

DatabaseManager (7-150)

🪛 Pylint (3.3.7)

src/scraper.py

[refactor] 75-81: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🪛 Flake8 (7.2.0)

tests/test_scraper.py

[error] 42-42: module level import not at top of file

(E402)

[error] 46-46: expected 2 blank lines, found 1

(E302)

[error] 67-67: expected 2 blank lines, found 1

(E302)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build-and-push

🔇 Additional comments (8)

tests/test_scraper.py (3)

46-66: Well-structured test with proper mocking.

The test correctly mocks tempfile and os.remove operations to avoid actual file system interactions, and properly tests the new MarkItDown-based scraping functionality.

67-86: Comprehensive test coverage for markitdown integration.

This test provides good coverage of the markitdown conversion with more complex HTML and markdown output, ensuring the integration works correctly.

137-137: Properly updated mock object for response handling change.

The addition of the text attribute correctly reflects the change from response.content to response.text in the scraper implementation.

src/scraper.py (5)

5-5: New import correctly added for markitdown integration.

The MarkItDown import is appropriate for the new HTML to Markdown conversion functionality.

10-11: Imports added for temporary file handling.

The tempfile and os imports are necessary for the new temporary file-based markdown conversion approach.

81-81: Appropriate change from bytes to text for HTML processing.

The change from response.content to response.text is correct for HTML processing with BeautifulSoup and ensures consistent text handling throughout the scraper.

117-132: Secure and well-implemented markdown conversion.

The refactored scrape_page method correctly:

Uses BeautifulSoup for HTML parsing and title extraction

Safely handles temporary file creation and cleanup

Properly converts HTML to markdown using MarkItDown

Maintains error handling

The temporary file handling is secure with proper cleanup using os.remove().

230-230: Consistent response handling change.

The change from response.content to response.text maintains consistency with the fetch_links method and ensures proper text handling.

@obeone

Docstrings generation was requested by @obeone. * #44 (comment) The following files were modified: * `src/scraper.py` * `tests/test_scraper.py`

coderabbitai · 2025-07-05T07:14:00Z

Note

Generated docstrings for this pull request at #45

- Changes argument definitions in main.py to multi-line format - Adds detailed docstring and Raises section to main() - Adds minor whitespace and formatting improvements for clarity - No logic changes; improves code style and maintainability

…atible with markitdown)

coderabbitai Bot reviewed Jul 5, 2025

View reviewed changes

Comment thread requirements.txt Outdated

coderabbitai Bot added a commit that referenced this pull request Jul 5, 2025

📝 Add docstrings to markitdown

0f85763

Docstrings generation was requested by @obeone. * #44 (comment) The following files were modified: * `src/scraper.py` * `tests/test_scraper.py`

coderabbitai Bot mentioned this pull request Jul 5, 2025

📝 Add docstrings to markitdown #45

Closed

obeone added 3 commits July 5, 2025 09:16

chore(deps): bump markitdown from 0.1.1 to 0.1.2 in requirements

53c81a5

chore(ci): remove linux/i386 from build platforms in workflow (incomp…

934efa4

…atible with markitdown)

obeone merged commit 6cf87a3 into main Jul 5, 2025
6 checks passed

obeone deleted the markitdown branch July 13, 2025 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion#44

refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion#44
obeone merged 4 commits into
mainfrom
markitdown

obeone commented Jul 5, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 5, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obeone commented Jul 5, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Jul 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

obeone commented Jul 5, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)