-
Notifications
You must be signed in to change notification settings - Fork 4k
better normalize_url #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
better normalize_url #1108
Conversation
""" WalkthroughThe Changes
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🔇 Additional comments (6)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
crawl4ai/utils.py (2)
1996-2071
: Consider caching frequent URLs for performance improvement.The function handles many complex operations which could be expensive when called repeatedly with the same URLs (common in web crawling). Since URL normalization results are deterministic, this function would benefit from caching.
Apply the
@lru_cache
decorator as done forefficient_normalize_url_for_deep_crawl
in line 2122:+@lru_cache(maxsize=1000) def normalize_url(href, base_url): """Normalize URLs to ensure consistent format with a better way also remove tracking parameters"""
1999-2028
: Move tracking parameters list outside the function.The comprehensive list of tracking parameters is recreated every time the function is called, which is inefficient. Consider moving it to a module-level constant.
+# Common tracking parameters to be removed from URLs +TRACKING_PARAMETERS = [ + 'utm_source', 'utm_medium', 'utm_campaign', + 'utm_term', 'utm_content', 'fbclid', 'gclid', + # ... rest of the parameters +] def normalize_url(href, base_url): """Normalize URLs to ensure consistent format with a better way also remove tracking parameters""" - params_to_remove = [ - 'utm_source', 'utm_medium', 'utm_campaign', - 'utm_term', 'utm_content', 'fbclid', 'gclid', - # ... rest of the parameters - ] + params_to_remove = TRACKING_PARAMETERS
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/utils.py
(2 hunks)
🔇 Additional comments (4)
crawl4ai/utils.py (4)
1996-1998
: Clear and improved docstring of the function.The docstring has been updated to explicitly mention both URL normalization and tracking parameter removal, accurately reflecting the function's enhanced capabilities.
1999-2028
: Well-organized comprehensive list of tracking parameters.The extensive list of tracking parameters covers a wide range of common analytics and marketing tracking identifiers. This is essential for privacy enhancement and will make URLs cleaner and more consistent.
2030-2044
: Robust handling of various URL formats.The implementation now properly handles multiple URL prefix patterns, including:
- URLs starting with "www." or "WWW."
- URLs starting with "/www."
- Protocol-relative URLs starting with "//"
- Absolute URLs with http/https
- Malformed URLs with missing slashes
- Relative URLs
This should significantly improve the robustness of URL processing in the application.
2048-2064
:❓ Verification inconclusive
Verify consistent behavior with the previous implementation.
The function now uses
parse_qs
andurlencode
which handle parameters differently than the previous implementation might have. Specifically,parse_qs
returns a dictionary where each value is a list, andurlencode
withdoseq=True
maintains this structure.Run the following script to check how the function handles URLs with multiple values for the same parameter:
🏁 Script executed:
#!/bin/bash # Test how normalize_url handles multiple values for the same parameter # Create a Python script for testing cat > test_normalize_url.py << 'EOF' from crawl4ai.utils import normalize_url # Test URLs with multiple values for the same parameter test_urls = [ "https://example.com/path?param1=value1¶m1=value2&utm_source=test", "https://example.com/path?normal=value&utm_source=value1&utm_source=value2", "https://www.example.com/path?a=1&b=2&c=3", "//example.com/path?param=value&fbclid=123", "http:/example.com/path?q=test&gclid=abc123", "/path/to/resource?t=now&utm_campaign=spring", "example.com/no-protocol?ref=homepage" ] base_url = "https://baseurl.com" print("=== Testing normalize_url function ===") for url in test_urls: normalized = normalize_url(url, base_url) print(f"\nOriginal: {url}") print(f"Normalized: {normalized}") EOF # Run the test script python test_normalize_url.pyLength of output: 603
🏁 Script executed:
#!/bin/bash set -e # Install required dependencies pip install pydantic # Install the local crawl4ai package in editable mode pip install -e . # Recreate and run the normalize_url test cat > test_normalize_url.py << 'EOF' from crawl4ai.utils import normalize_url # Test URLs with multiple values for the same parameter test_urls = [ "https://example.com/path?param1=value1¶m1=value2&utm_source=test", "https://example.com/path?normal=value&utm_source=value1&utm_source=value2", "https://www.example.com/path?a=1&b=2&c=3", "//example.com/path?param=value&fbclid=123", "http:/example.com/path?q=test&gclid=abc123", "/path/to/resource?t=now&utm_campaign=spring", "example.com/no-protocol?ref=homepage" ] base_url = "https://baseurl.com" print("=== Testing normalize_url function ===") for url in test_urls: normalized = normalize_url(url, base_url) print(f\"\\nOriginal: {url}\") print(f\"Normalized: {normalized}\") EOF python test_normalize_url.pyLength of output: 18444
Double-check normalize_url behavior with duplicate query parameters
The updated
normalize_url
incrawl4ai/utils.py
now usesparse_qs
(which groups values into lists) andurlencode(..., doseq=True)
to rebuild the query string. Please confirm this matches the previous behavior for parameters appearing multiple times—especially that:
- Duplicate keys (e.g.
param1=value1¶m1=value2
) are preserved in order and count.- Removing entries in
params_to_remove
(e.g.utm_source
,fbclid
) doesn’t inadvertently drop or reorder other parameters.Suggested actions:
- Add unit tests for URLs with repeated parameters, verifying both removal of unwanted keys and retention of all other values.
- Manually compare output of the new implementation against the old one for a set of representative URLs.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
crawl4ai/utils.py (1)
2047-2047
: This line correctly uses_url
for URL parsing.Following the previous review feedback, the function now correctly uses the normalized URL variable (
_url
) for parsing instead of the original URL.
🧹 Nitpick comments (2)
crawl4ai/utils.py (2)
2065-2071
: Error handling could be improved.The fallback mechanism is a good practice, but could be enhanced to:
- Catch specific exceptions rather than a generic
Exception
- Include the problematic URL in the error message
- Implement a more robust URL reconstruction logic
- except Exception as e: - print(f"Error cleaning URL: {e}") - _ss = _url.split('//') - if len(_ss) == 2: - return '//'.join(_ss) - else: - return _ss[0] + '//' + '/'.join(_ss[1:]) + except (ValueError, AttributeError, TypeError) as e: + print(f"Error cleaning URL '{_url}': {e}") + try: + # More robust fallback that preserves the original URL structure + _ss = _url.split('//') + if len(_ss) == 2: + return '//'.join(_ss) + else: + return _ss[0] + '//' + '/'.join(_ss[1:]) + except: + # Ultimate fallback - return the input URL + return href
1996-1997
: Consider adding more descriptive documentation.The function comments should be expanded to better explain:
- What tracking parameters are being removed
- How different URL formats are normalized
- The fallback behavior
def normalize_url(href, base_url): - """Normalize URLs to ensure consistent format with a better way - also remove tracking parameters""" + """Normalize URLs to ensure consistent format and remove tracking parameters. + + This function: + 1. Normalizes URLs with various formats (www., //, http:/, etc.) + 2. Removes common tracking parameters (UTM, Facebook, Google, etc.) + 3. Handles URL parsing errors with a fallback mechanism + + Args: + href (str): The URL to normalize + base_url (str): The base URL to use for relative URLs + + Returns: + str: The normalized and cleaned URL + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/utils.py
(2 hunks)
🔇 Additional comments (2)
crawl4ai/utils.py (2)
1995-2046
: Excellent enhancement to URL normalization handling various edge cases.The function now properly handles a wide variety of URL formats (www., //, http:/, etc.) and includes a comprehensive list of tracking parameters to remove. This implementation significantly improves privacy protection and strengthens anti-crawling measures as intended.
2047-2063
: Well-implemented tracking parameter removal logic.The implementation correctly parses the URL, removes tracking parameters, and reconstructs the cleaned URL. This approach is robust and handles URL parsing and cleaning efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/utils.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py
2086-2086: Do not use bare except
(E722)
🔇 Additional comments (6)
crawl4ai/utils.py (6)
1994-2023
: Comprehensive tracking parameter list enhances privacy protection.This extensive list of tracking parameters will significantly improve privacy by removing identifiers from various marketing and analytics platforms including:
- UTM parameters used for campaign tracking
- Facebook, Google, and other social media tracking parameters
- Marketing automation platform identifiers (Mailchimp, etc.)
- Various analytics and tracking IDs
2025-2026
: Good optimization with LRU cache.The LRU cache implementation with a size of 1000 entries is an excellent performance optimization for repeatedly processed URLs, which is common in web crawling scenarios.
2027-2040
: Well-documented function with clear purpose.The docstring clearly explains the function's three primary purposes:
- Normalizing URLs with various formats
- Removing tracking parameters
- Handling URL parsing errors with fallbacks
This documentation helps future maintainers understand the intent and functionality.
2042-2057
: Robust handling of URL prefix variations.The logic correctly handles various URL prefix formats that might be encountered in real-world websites:
- www. and WWW. prefixes
- Protocol-relative URLs (starting with //)
- Malformed http:/ and https:/ URLs (missing slash)
- Relative URLs joined with base_url
2058-2076
: Effective tracking parameter removal logic.The implementation correctly:
- Parses the normalized URL
- Extracts query parameters
- Removes tracking parameters
- Reconstructs the query string
- Rebuilds the full URL
This addresses the PR objective of privacy enhancement.
2059-2059
: Fixed previous URL variable discrepancy.This resolves the discrepancy identified in a previous review where the function was parsing
url
instead of_url
. The normalized URL is now correctly used for parsing.
sovle None input Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crawl4ai/utils.py (2)
2085-2096
: Consider handling additional URL edge cases.While the error handling is good, there are additional edge cases that could be improved:
- The function doesn't handle data URLs, mailto links, or JavaScript URLs specially
- URLs with authentication (username:password@domain) might be processed differently than expected
Consider adding special handling for these cases to further improve robustness.
@lru_cache(maxsize=1000) def normalize_url(href, base_url): # Guard against None or empty inputs if href is None: return None href = href.strip() if not href: return None + # Handle special URL schemes that shouldn't be modified + special_schemes = ['mailto:', 'tel:', 'javascript:', 'data:'] + if any(href.lower().startswith(scheme) for scheme in special_schemes): + return href url = href # Rest of the function...
2025-2025
: Consider a larger LRU cache size for heavily accessed sites.The current
@lru_cache(maxsize=1000)
is a good starting point, but depending on your application's traffic patterns, you might need a larger cache size. If your application crawls many sites with unique URLs, consider increasing this value to reduce repeated normalizations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/utils.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py
2065-2065: SyntaxError: Got unexpected token …
🔇 Additional comments (3)
crawl4ai/utils.py (3)
1994-2023
: The tracking parameter list is comprehensive.This extensive list of tracking parameters covers a wide range of advertising, analytics, and marketing platforms including Google UTM parameters, Facebook tracking, email marketing trackers, and many others. This will significantly improve privacy for users by removing these identifiers during URL normalization.
2067-2084
: The URL cleaning implementation effectively removes tracking parameters.The implementation correctly:
- Parses the URL using
urlparse
- Extracts query parameters using
parse_qs
- Removes all tracking parameters from the query dict
- Reconstructs the query string with
urlencode
- Builds the clean URL with
urlunparse
This approach is robust and will clean URLs as expected.
2025-2040
:⚠️ Potential issueFix duplicate function definition causing decorator to be ineffective.
There's a critical error in the code structure. The
@lru_cache
decorator is applied to an empty function definition (lines 2025-2026), while the actual implementation starts on line 2042. This means the caching functionality won't work.Apply this correction to fix the function structure:
@lru_cache(maxsize=1000) -def normalize_url(href, base_url): - """Normalize URLs to ensure consistent format and remove tracking parameters. - - This function: - 1. Normalizes URLs with various formats (www., //, http:/, etc.) - 2. Removes common tracking parameters (UTM, Facebook, Google, etc.) - 3. Handles URL parsing errors with a fallback mechanism - - Args: - href (str): The URL to normalize - base_url (str): The base URL to use for relative URLs - - Returns: - str: The normalized and cleaned URL - """ - -def normalize_url(href, base_url): +def normalize_url(href, base_url): + """Normalize URLs to ensure consistent format and remove tracking parameters. + + This function: + 1. Normalizes URLs with various formats (www., //, http:/, etc.) + 2. Removes common tracking parameters (UTM, Facebook, Google, etc.) + 3. Handles URL parsing errors with a fallback mechanism + + Args: + href (str): The URL to normalize + base_url (str): The base URL to use for relative URLs + + Returns: + str: The normalized and cleaned URL + """ # Guard against None or empty inputs if href is None: return NoneLikely an incorrect or invalid review comment.
remove misstake ... Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
crawl4ai/utils.py (1)
2084-2095
: 🛠️ Refactor suggestionImprove error handling by avoiding bare except.
Your error handling includes a good first try/except with specific exception types, but contains a bare
except
clause in the fallback which can mask unexpected errors.Apply this change to capture and log specific exception types in the fallback:
except (ValueError, AttributeError, TypeError) as e: print(f"Error cleaning URL '{_url}': {e}") try: # More robust fallback that preserves the original URL structure _ss = _url.split('//') if len(_ss) == 2: return '//'.join(_ss) else: return _ss[0] + '//' + '/'.join(_ss[1:]) - except: + except Exception as fallback_error: + print(f"Fallback error for URL '{_url}': {fallback_error}") # Ultimate fallback - return the uncleaned url return _url🧰 Tools
🪛 Ruff (0.8.2)
2093-2093: Do not use bare
except
(E722)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/utils.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py
2042-2042: Redefinition of unused normalize_url
from line 2026
(F811)
2093-2093: Do not use bare except
(E722)
🔇 Additional comments (5)
crawl4ai/utils.py (5)
1994-2023
: Comprehensive tracking parameter list enhances privacy.The extensive list of tracking parameters to remove is well-organized and thorough, covering major analytics platforms (UTM, Facebook, Google), email marketing tools, and numerous other tracking mechanisms. This significantly improves user privacy and reduces tracking footprint.
2025-2041
: Good use of caching for performance optimization.The
@lru_cache
decorator will significantly improve performance for frequently accessed URLs by avoiding redundant processing. The cache size of 1000 seems reasonable for balancing memory usage vs. performance gains.
2051-2065
: Excellent URL normalization logic.The URL normalization logic thoroughly handles a wide range of URL formats including:
- URLs starting with 'www.' or 'WWW.'
- URLs with missing protocol parts like '/www.'
- Protocol-relative URLs ('//example.com')
- URLs with malformed protocols ('http:/' instead of 'http://')
This robust handling will significantly improve the crawler's ability to properly process URLs found in the wild.
2066-2083
: Great implementation of tracking parameter removal.The function correctly uses
urlparse
,parse_qs
, andurlunparse
to extract, clean, and rebuild the URL. This approach properly preserves the URL structure while removing only the unwanted tracking parameters.
34-34
: Good addition of required imports.The added imports from
urllib.parse
are necessary for the URL parsing, manipulation, and reconstruction functionality.
Summary
Improved utils.normalize_url functionality:
Checklist:
Summary by CodeRabbit