You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
📝 Add docstrings to codex/ajouter-option-pour-utiliser-un-proxy
Docstrings generation was requested by @obeone.
* #50 (comment)
The following files were modified:
* `crawler_to_md/cli.py`
* `crawler_to_md/scraper.py`
* `tests/test_cli.py`
* `tests/test_scraper.py`
Copy file name to clipboardExpand all lines: crawler_to_md/cli.py
+3-7Lines changed: 3 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -19,13 +19,9 @@
19
19
20
20
defmain():
21
21
"""
22
-
Main function to start the web scraper application.
23
-
24
-
This function parses command line arguments, initializes necessary components,
25
-
and manages the scraping and exporting process.
26
-
27
-
Raises:
28
-
ValueError: If neither a URL nor a URLs file is provided.
22
+
Parses command-line arguments and orchestrates the web scraping, data storage, and export process.
23
+
24
+
This function serves as the main entry point for the web scraper application. It handles argument parsing, initializes required components, manages the scraping workflow, and exports the results to Markdown and JSON formats as specified by the user. The function ensures necessary directories exist, validates input, and provides user feedback on output locations.
29
25
"""
30
26
logger.info("Starting the web scraper application.")
html (str, optional): The HTML content of the page.
80
-
73
+
Retrieve all valid links from a specified URL or provided HTML content.
74
+
75
+
If HTML is not provided, the method fetches the page content using an HTTP GET request. Extracts and resolves all anchor tag links, removes URL fragments, and filters them using the link validation logic. Returns a set of valid links found on the page. Returns an empty list if the request fails.
76
+
77
+
Parameters:
78
+
url (str): The URL to extract links from.
79
+
html (str, optional): HTML content to parse instead of fetching from the URL.
80
+
81
81
Returns:
82
-
set: Set of validlinks found on the page.
82
+
set: A set of valid, filtered links found on the page, or an empty set if none are found or on error.
Initiates the scraping process for a single URL or a list of URLs.
168
-
It validates URLs, logs the scraping process, and manages the
169
-
progress of scraping through the database.
170
-
171
-
Args:
172
-
url (str, optional): A single URL to start scraping from. Defaults to None.
173
-
urls_list (list, optional): A list of URLs to scrape.
174
-
"""
167
+
Starts the web scraping process from a given URL or list of URLs, managing progress, rate limiting, and database integration.
168
+
169
+
If a list of URLs is provided, only valid URLs are inserted into the database; otherwise, a single URL is used as the starting point. The method iteratively fetches unvisited links from the database, retrieves and processes each page, stores scraped content and metadata, and discovers new links to continue scraping (unless a predefined list is used). Progress is tracked with a progress bar, and rate limiting and delays are enforced as configured. The process continues until all discovered links have been visited.
175
170
# Validate and insert the provided URLs into the database
176
171
if urls_list:
177
172
# Iterate through the list to check for valid URLs
Copy file name to clipboardExpand all lines: tests/test_scraper.py
+14Lines changed: 14 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -138,6 +138,11 @@ def get_all_pages(self):
138
138
139
139
140
140
deftest_start_scraping_process(monkeypatch):
141
+
"""
142
+
Test the complete scraping process, verifying link tracking, page storage, and integration with mocked dependencies.
143
+
144
+
This test ensures that the `Scraper.start_scraping()` method correctly inserts and marks links as visited, stores scraped page content, and interacts properly with mocked HTTP requests and progress tracking.
0 commit comments