Skip to content

refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion#44

Merged
obeone merged 4 commits into
mainfrom
markitdown
Jul 5, 2025
Merged

refactor(scraper): replace trafilatura with markitdown for HTML to Markdown conversion#44
obeone merged 4 commits into
mainfrom
markitdown

Conversation

@obeone
Copy link
Copy Markdown
Owner

@obeone obeone commented Jul 5, 2025

  • Removed trafilatura dependency and its usage in scraper.py.
  • Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion.
  • Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup.
  • Adjusted tests to mock markitdown usage and verify new scraping workflow.

Implement #38

Summary by CodeRabbit

  • Bug Fixes
    • Improved web page scraping and markdown conversion for more accurate extraction of page titles and content.
  • Chores
    • Updated dependencies to replace the markdown conversion library.
  • Tests
    • Added new tests to enhance coverage of web scraping and markdown conversion functionality.
  • Chores
    • Streamlined command-line interface code for better readability.
    • Adjusted build workflow to target fewer platforms for Docker builds.

…rkdown conversion

- Removed trafilatura dependency and its usage in scraper.py.
- Added markitdown as a dependency in requirements.txt and integrated it for Markdown conversion.
- Updated content extraction logic to use markitdown and extract page titles with BeautifulSoup.
- Adjusted tests to mock markitdown usage and verify new scraping workflow.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 5, 2025

"""

Walkthrough

The changes replace the use of the trafilatura library with BeautifulSoup and MarkItDown for HTML metadata extraction and markdown conversion in the scraping logic. The requirements are updated accordingly. Tests are enhanced to mock file operations and validate the new scraping approach, with no changes to public interfaces. Minor formatting and docstring improvements were made in main.py. The CI workflow was updated to remove the linux/i386 platform from Docker build targets.

Changes

File(s) Change Summary
requirements.txt Removed trafilatura, added markitdown as a dependency.
src/scraper.py Replaced trafilatura with BeautifulSoup and MarkItDown for metadata extraction and markdown conversion. Adjusted response handling and imports.
tests/test_scraper.py Added tests for the new scraping logic, mocking file operations and updating dummy response classes.
main.py Reformatted and enhanced main() function docstring and argument parsing for readability; no logic changes.
.github/workflows/build-and-publish.yaml Removed linux/i386 platform from Docker multi-arch build targets.

Poem

In the garden of code, a scraper anew,
MarkItDown hops in, trafilatura bids adieu.
With soup and a file, the pages unfold,
While tests keep us safe from bugs uncontrolled.
Hooray for the changes, so nimble and bright—
The rabbit approves, and all is just right! 🐇✨
"""


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8207a4d and 934efa4.

📒 Files selected for processing (1)
  • .github/workflows/build-and-publish.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/build-and-publish.yaml
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build-and-push

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_scraper.py (1)

42-42: Fix import placement and formatting issues.

The import statement should be moved to the top of the file according to PEP 8 guidelines.

-from unittest.mock import patch, MagicMock
+# Move this import to the top of the file after other imports

Also ensure proper spacing between function definitions (2 blank lines before function definitions at module level).

src/scraper.py (1)

75-83: Refactor unnecessary else block.

The else block after a return statement is unnecessary and can be simplified.

                 if response.status_code != 200:
                     logger.warning(
                         f"Failed to fetch {url} with status code {response.status_code}"
                     )
                     return []
-                else:
-                    content = response.text
+                content = response.text
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30cee92 and 9cf3f51.

📒 Files selected for processing (3)
  • requirements.txt (1 hunks)
  • src/scraper.py (4 hunks)
  • tests/test_scraper.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scraper.py (1)
src/database_manager.py (1)
  • DatabaseManager (7-150)
🪛 Pylint (3.3.7)
src/scraper.py

[refactor] 75-81: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🪛 Flake8 (7.2.0)
tests/test_scraper.py

[error] 42-42: module level import not at top of file

(E402)


[error] 46-46: expected 2 blank lines, found 1

(E302)


[error] 67-67: expected 2 blank lines, found 1

(E302)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build-and-push
🔇 Additional comments (8)
tests/test_scraper.py (3)

46-66: Well-structured test with proper mocking.

The test correctly mocks tempfile and os.remove operations to avoid actual file system interactions, and properly tests the new MarkItDown-based scraping functionality.


67-86: Comprehensive test coverage for markitdown integration.

This test provides good coverage of the markitdown conversion with more complex HTML and markdown output, ensuring the integration works correctly.


137-137: Properly updated mock object for response handling change.

The addition of the text attribute correctly reflects the change from response.content to response.text in the scraper implementation.

src/scraper.py (5)

5-5: New import correctly added for markitdown integration.

The MarkItDown import is appropriate for the new HTML to Markdown conversion functionality.


10-11: Imports added for temporary file handling.

The tempfile and os imports are necessary for the new temporary file-based markdown conversion approach.


81-81: Appropriate change from bytes to text for HTML processing.

The change from response.content to response.text is correct for HTML processing with BeautifulSoup and ensures consistent text handling throughout the scraper.


117-132: Secure and well-implemented markdown conversion.

The refactored scrape_page method correctly:

  • Uses BeautifulSoup for HTML parsing and title extraction
  • Safely handles temporary file creation and cleanup
  • Properly converts HTML to markdown using MarkItDown
  • Maintains error handling

The temporary file handling is secure with proper cleanup using os.remove().


230-230: Consistent response handling change.

The change from response.content to response.text maintains consistency with the fetch_links method and ensures proper text handling.

Comment thread requirements.txt Outdated
coderabbitai Bot added a commit that referenced this pull request Jul 5, 2025
Docstrings generation was requested by @obeone.

* #44 (comment)

The following files were modified:

* `src/scraper.py`
* `tests/test_scraper.py`
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 5, 2025

Note

Generated docstrings for this pull request at #45

obeone added 3 commits July 5, 2025 09:16
- Changes argument definitions in main.py to multi-line format
- Adds detailed docstring and Raises section to main()
- Adds minor whitespace and formatting improvements for clarity
- No logic changes; improves code style and maintainability
@obeone obeone merged commit 6cf87a3 into main Jul 5, 2025
6 checks passed
@obeone obeone deleted the markitdown branch July 13, 2025 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant