Add proxy support option#50
Conversation
WalkthroughThe changes introduce an optional Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Scraper
participant HTTP_Server
User->>CLI: Run with --proxy <PROXY_URL>
CLI->>Scraper: Instantiate Scraper(proxy=<PROXY_URL>)
Scraper->>Scraper: Configure requests.Session with proxy
Scraper->>HTTP_Server: Make HTTP requests via proxy
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
crawler_to_md/scraper.py (1)
47-50: LGTM! Proxy configuration implemented correctly.The implementation properly:
- Uses
requests.Sessionfor better connection management- Configures proxy for both HTTP and HTTPS protocols
- Stores the proxy reference for potential future use
Consider adding basic URL validation for the proxy parameter to catch common formatting errors early.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(2 hunks)crawler_to_md/cli.py(2 hunks)crawler_to_md/scraper.py(4 hunks)tests/test_cli.py(1 hunks)tests/test_scraper.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_scraper.py (1)
crawler_to_md/scraper.py (1)
Scraper(19-291)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-and-push
🔇 Additional comments (11)
README.md (2)
55-55: LGTM! Documentation updated correctly.The usage example properly includes the new
--proxyoption in the command line example.
70-70: LGTM! Clear option description.The proxy option documentation is concise and appropriately describes its purpose.
crawler_to_md/scraper.py (4)
27-27: LGTM! Proxy parameter added correctly.The proxy parameter is properly added with appropriate default value.
39-39: LGTM! Docstring updated appropriately.The parameter documentation is clear and follows the existing format.
88-88: LGTM! Updated to use session.Correctly replaced
requests.getwithself.session.getto utilize the configured proxy.
242-242: LGTM! Consistent session usage.Correctly replaced
requests.getwithself.session.getfor consistent proxy usage throughout the scraper.tests/test_scraper.py (2)
155-155: LGTM! Mock target updated correctly.The monkeypatch correctly targets
scraper.session.getinstead ofrequests.getto match the implementation change.
176-182: LGTM! Comprehensive proxy test added.The test effectively verifies that:
- Proxy parameter is correctly passed to the Scraper constructor
- Session proxies are configured for both HTTP and HTTPS protocols
- The proxy URL is properly stored in the session
The test follows existing patterns and provides good coverage for the new functionality.
tests/test_cli.py (1)
48-82: LGTM! Thorough CLI proxy test.The test effectively verifies that:
- The
--proxyCLI argument is correctly parsed- The proxy value is properly passed to the Scraper constructor
- The test uses appropriate mocking to isolate the functionality being tested
The test follows existing patterns and provides good coverage for the new CLI option.
crawler_to_md/cli.py (2)
91-95: LGTM! Proxy argument added correctly.The
--proxyargument is properly configured with:
- Appropriate help text describing its purpose
- Default value of
Nonewhen not specified- Consistent with other CLI argument patterns
180-180: LGTM! Proxy argument passed correctly.The proxy argument is correctly passed to the Scraper constructor, completing the integration of the proxy feature through the CLI.
Docstrings generation was requested by @obeone. * #50 (comment) The following files were modified: * `crawler_to_md/cli.py` * `crawler_to_md/scraper.py` * `tests/test_cli.py` * `tests/test_scraper.py`
|
Note Generated docstrings for this pull request at #51 |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/test_cli.py (4)
51-85: Refactor to reduce code duplication.The
fake_initfunction and monkeypatching setup are duplicated across multiple test functions. Consider extracting common setup logic into a helper function or fixture.+def _setup_proxy_test(monkeypatch, captured): + def fake_init( + self, + base_url, + exclude_patterns, + db_manager, + rate_limit=0, + delay=0, + proxy=None, + ): + captured['proxy'] = proxy + + monkeypatch.setattr(Scraper, '__init__', fake_init) + monkeypatch.setattr(Scraper, 'start_scraping', lambda *a, **k: None) + monkeypatch.setattr(ExportManager, 'export_to_markdown', lambda *a, **k: None) + monkeypatch.setattr(ExportManager, 'export_to_json', lambda *a, **k: None) + def test_cli_proxy_option(monkeypatch, tmp_path): captured = {} - - def fake_init( - self, - base_url, - exclude_patterns, - db_manager, - rate_limit=0, - delay=0, - proxy=None, - ): - captured['proxy'] = proxy - - monkeypatch.setattr(Scraper, '__init__', fake_init) - monkeypatch.setattr(Scraper, 'start_scraping', lambda *a, **k: None) - monkeypatch.setattr(ExportManager, 'export_to_markdown', lambda *a, **k: None) - monkeypatch.setattr(ExportManager, 'export_to_json', lambda *a, **k: None) + _setup_proxy_test(monkeypatch, captured)
87-121: Use consistent quote style.The test uses single quotes while the rest of the file uses double quotes. Maintain consistency with the existing codebase.
- cache_folder = tmp_path / 'cache' + cache_folder = tmp_path / "cache" - 'prog', - '--url', - 'http://example.com', - '--output-folder', + "prog", + "--url", + "http://example.com", + "--output-folder",
159-179: Consider adding more specific error validation.The test only verifies that
SystemExitis raised but doesn't validate the error message or exit code. Consider adding assertions for more specific error handling.+import contextlib +import io + def test_cli_proxy_error(monkeypatch, tmp_path): def fake_init(*a, **k): raise ValueError('Proxy unreachable') monkeypatch.setattr(Scraper, '__init__', fake_init) + + # Capture stderr to verify error message + stderr_capture = io.StringIO() + cache_folder = tmp_path / 'cache' args = [ 'prog', '--url', 'http://example.com', '--output-folder', str(tmp_path), '--cache-folder', str(cache_folder), '--proxy', 'http://proxy:8080', ] monkeypatch.setattr(sys, 'argv', args) - with pytest.raises(SystemExit): - cli.main() + + with contextlib.redirect_stderr(stderr_capture): + with pytest.raises(SystemExit) as exc_info: + cli.main() + + assert exc_info.value.code == 1 + assert 'Proxy unreachable' in stderr_capture.getvalue()
51-179: Consider adding edge case tests.The current test suite covers the main scenarios but could benefit from additional edge cases to ensure robustness.
Additional test cases to consider:
- Empty proxy string
- Malformed proxy URLs
- Proxy with authentication
- IPv6 proxy addresses
def test_cli_proxy_empty_string(monkeypatch, tmp_path): # Test behavior with empty proxy string pass def test_cli_proxy_malformed_url(monkeypatch, tmp_path): # Test behavior with malformed proxy URL pass
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(2 hunks)crawler_to_md/cli.py(2 hunks)crawler_to_md/scraper.py(4 hunks)tests/test_cli.py(2 hunks)tests/test_scraper.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- README.md
- crawler_to_md/cli.py
- tests/test_scraper.py
- crawler_to_md/scraper.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-and-push
🔇 Additional comments (2)
tests/test_cli.py (2)
2-4: LGTM: Import changes are appropriate.The addition of pytest import and formatting adjustment are correct for the new test functionality.
123-157: Good coverage of SOCKS proxy support.This test appropriately verifies that SOCKS proxy URLs are handled correctly, which is important given the
requests[socks]dependency mentioned in the summary.
Summary
--proxyCLI flagScraperScraperviarequests.SessionTesting
pytest -qhttps://chatgpt.com/codex/tasks/task_e_686e78addf04832e9325f74ff9395904
Summary by CodeRabbit
--proxycommand-line option.