Skip to content

better normalize_url #1108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

bigbrother666sh
Copy link

@bigbrother666sh bigbrother666sh commented May 13, 2025

Summary

Improved utils.normalize_url functionality:

  1. Can handle more edge cases;
  2. Add the function to remove tracking parameters, enhancing privacy and anti-crawling capabilities.

Checklist:

  • [✓ ] My code follows the style guidelines of this project
  • [✓ ] I have performed a self-review of my own code
  • [ ✓] I have commented my code, particularly in hard-to-understand areas
  • [✓ ] I have made corresponding changes to the documentation
  • [✓ ] I have added/updated unit tests that prove my fix is effective or that my feature works
  • [✓ ] New and existing unit tests pass locally with my changes

Summary by CodeRabbit

  • New Features
    • Improved URL normalization to handle various URL formats and prefixes.
    • Automatic removal of common tracking parameters from URLs for cleaner links.
    • Enhanced asynchronous logging with more efficient and reliable file handling using a new logging library.

Copy link
Contributor

coderabbitai bot commented May 13, 2025

"""

Walkthrough

The normalize_url function in crawl4ai/utils.py was extensively rewritten. The update introduces logic to remove a comprehensive set of tracking parameters from URLs and improves normalization by handling various URL prefix formats. The function now reconstructs cleaned URLs and includes fallback logic for error scenarios. Additionally, crawl4ai/async_logger.py replaced manual asynchronous file writing with the loguru logging library, removing the _write_to_file method and configuring loguru for async file logging.

Changes

File(s) Change Summary
crawl4ai/utils.py Rewrote normalize_url to robustly handle URL normalization, explicitly remove many tracking parameters, and add fallback error handling. Function signature unchanged.
crawl4ai/async_logger.py Removed _write_to_file method; integrated loguru logger for asynchronous file logging with rotation and diagnostics; added self.loguru_logger attribute; console logging unchanged.

Poem

A hop and a skip through the URLs we go,
Stripping the trackers, making them slow.
From "utm_source" to "fbclid"—all swept away,
Leaving clean links to brighten the day.
With whiskers twitching, I tidy each string—
Now only pure paths do my rabbit hops bring! 🐇✨

And logs now dance with loguru's light,
Async and swift, they leap through the night! 📜🐰
"""

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
crawl4ai/utils.py (2)

1996-2071: Consider caching frequent URLs for performance improvement.

The function handles many complex operations which could be expensive when called repeatedly with the same URLs (common in web crawling). Since URL normalization results are deterministic, this function would benefit from caching.

Apply the @lru_cache decorator as done for efficient_normalize_url_for_deep_crawl in line 2122:

+@lru_cache(maxsize=1000)
 def normalize_url(href, base_url):
     """Normalize URLs to ensure consistent format with a better way
     also remove tracking parameters"""

1999-2028: Move tracking parameters list outside the function.

The comprehensive list of tracking parameters is recreated every time the function is called, which is inefficient. Consider moving it to a module-level constant.

+# Common tracking parameters to be removed from URLs
+TRACKING_PARAMETERS = [
+    'utm_source', 'utm_medium', 'utm_campaign', 
+    'utm_term', 'utm_content', 'fbclid', 'gclid',
+    # ... rest of the parameters
+]

 def normalize_url(href, base_url):
     """Normalize URLs to ensure consistent format with a better way
     also remove tracking parameters"""

-    params_to_remove = [
-    'utm_source', 'utm_medium', 'utm_campaign', 
-    'utm_term', 'utm_content', 'fbclid', 'gclid',
-    # ... rest of the parameters
-    ]
+    params_to_remove = TRACKING_PARAMETERS
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 897e017 and 3036e17.

📒 Files selected for processing (1)
  • crawl4ai/utils.py (2 hunks)
🔇 Additional comments (4)
crawl4ai/utils.py (4)

1996-1998: Clear and improved docstring of the function.

The docstring has been updated to explicitly mention both URL normalization and tracking parameter removal, accurately reflecting the function's enhanced capabilities.


1999-2028: Well-organized comprehensive list of tracking parameters.

The extensive list of tracking parameters covers a wide range of common analytics and marketing tracking identifiers. This is essential for privacy enhancement and will make URLs cleaner and more consistent.


2030-2044: Robust handling of various URL formats.

The implementation now properly handles multiple URL prefix patterns, including:

  • URLs starting with "www." or "WWW."
  • URLs starting with "/www."
  • Protocol-relative URLs starting with "//"
  • Absolute URLs with http/https
  • Malformed URLs with missing slashes
  • Relative URLs

This should significantly improve the robustness of URL processing in the application.


2048-2064:

❓ Verification inconclusive

Verify consistent behavior with the previous implementation.

The function now uses parse_qs and urlencode which handle parameters differently than the previous implementation might have. Specifically, parse_qs returns a dictionary where each value is a list, and urlencode with doseq=True maintains this structure.

Run the following script to check how the function handles URLs with multiple values for the same parameter:


🏁 Script executed:

#!/bin/bash
# Test how normalize_url handles multiple values for the same parameter

# Create a Python script for testing
cat > test_normalize_url.py << 'EOF'
from crawl4ai.utils import normalize_url

# Test URLs with multiple values for the same parameter
test_urls = [
    "https://example.com/path?param1=value1&param1=value2&utm_source=test",
    "https://example.com/path?normal=value&utm_source=value1&utm_source=value2",
    "https://www.example.com/path?a=1&b=2&c=3",
    "//example.com/path?param=value&fbclid=123",
    "http:/example.com/path?q=test&gclid=abc123",
    "/path/to/resource?t=now&utm_campaign=spring",
    "example.com/no-protocol?ref=homepage"
]

base_url = "https://baseurl.com"

print("=== Testing normalize_url function ===")
for url in test_urls:
    normalized = normalize_url(url, base_url)
    print(f"\nOriginal:  {url}")
    print(f"Normalized: {normalized}")
EOF

# Run the test script
python test_normalize_url.py

Length of output: 603


🏁 Script executed:

#!/bin/bash
set -e

# Install required dependencies
pip install pydantic

# Install the local crawl4ai package in editable mode
pip install -e .

# Recreate and run the normalize_url test
cat > test_normalize_url.py << 'EOF'
from crawl4ai.utils import normalize_url

# Test URLs with multiple values for the same parameter
test_urls = [
    "https://example.com/path?param1=value1&param1=value2&utm_source=test",
    "https://example.com/path?normal=value&utm_source=value1&utm_source=value2",
    "https://www.example.com/path?a=1&b=2&c=3",
    "//example.com/path?param=value&fbclid=123",
    "http:/example.com/path?q=test&gclid=abc123",
    "/path/to/resource?t=now&utm_campaign=spring",
    "example.com/no-protocol?ref=homepage"
]

base_url = "https://baseurl.com"

print("=== Testing normalize_url function ===")
for url in test_urls:
    normalized = normalize_url(url, base_url)
    print(f\"\\nOriginal:  {url}\")
    print(f\"Normalized: {normalized}\")
EOF

python test_normalize_url.py

Length of output: 18444


Double-check normalize_url behavior with duplicate query parameters

The updated normalize_url in crawl4ai/utils.py now uses parse_qs (which groups values into lists) and urlencode(..., doseq=True) to rebuild the query string. Please confirm this matches the previous behavior for parameters appearing multiple times—especially that:

  • Duplicate keys (e.g. param1=value1&param1=value2) are preserved in order and count.
  • Removing entries in params_to_remove (e.g. utm_source, fbclid) doesn’t inadvertently drop or reorder other parameters.

Suggested actions:

  • Add unit tests for URLs with repeated parameters, verifying both removal of unwanted keys and retention of all other values.
  • Manually compare output of the new implementation against the old one for a set of representative URLs.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
crawl4ai/utils.py (1)

2047-2047: This line correctly uses _url for URL parsing.

Following the previous review feedback, the function now correctly uses the normalized URL variable (_url) for parsing instead of the original URL.

🧹 Nitpick comments (2)
crawl4ai/utils.py (2)

2065-2071: Error handling could be improved.

The fallback mechanism is a good practice, but could be enhanced to:

  1. Catch specific exceptions rather than a generic Exception
  2. Include the problematic URL in the error message
  3. Implement a more robust URL reconstruction logic
-    except Exception as e:
-        print(f"Error cleaning URL: {e}")
-        _ss = _url.split('//')
-        if len(_ss) == 2:
-            return '//'.join(_ss)
-        else:
-            return _ss[0] + '//' + '/'.join(_ss[1:])
+    except (ValueError, AttributeError, TypeError) as e:
+        print(f"Error cleaning URL '{_url}': {e}")
+        try:
+            # More robust fallback that preserves the original URL structure
+            _ss = _url.split('//')
+            if len(_ss) == 2:
+                return '//'.join(_ss)
+            else:
+                return _ss[0] + '//' + '/'.join(_ss[1:])
+        except:
+            # Ultimate fallback - return the input URL
+            return href

1996-1997: Consider adding more descriptive documentation.

The function comments should be expanded to better explain:

  1. What tracking parameters are being removed
  2. How different URL formats are normalized
  3. The fallback behavior
 def normalize_url(href, base_url):
-    """Normalize URLs to ensure consistent format with a better way
-    also remove tracking parameters"""
+    """Normalize URLs to ensure consistent format and remove tracking parameters.
+    
+    This function:
+    1. Normalizes URLs with various formats (www., //, http:/, etc.)
+    2. Removes common tracking parameters (UTM, Facebook, Google, etc.)
+    3. Handles URL parsing errors with a fallback mechanism
+    
+    Args:
+        href (str): The URL to normalize
+        base_url (str): The base URL to use for relative URLs
+        
+    Returns:
+        str: The normalized and cleaned URL
+    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3036e17 and ea1cc39.

📒 Files selected for processing (1)
  • crawl4ai/utils.py (2 hunks)
🔇 Additional comments (2)
crawl4ai/utils.py (2)

1995-2046: Excellent enhancement to URL normalization handling various edge cases.

The function now properly handles a wide variety of URL formats (www., //, http:/, etc.) and includes a comprehensive list of tracking parameters to remove. This implementation significantly improves privacy protection and strengthens anti-crawling measures as intended.


2047-2063: Well-implemented tracking parameter removal logic.

The implementation correctly parses the URL, removes tracking parameters, and reconstructs the cleaned URL. This approach is robust and handles URL parsing and cleaning efficiently.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea1cc39 and cbf8b46.

📒 Files selected for processing (1)
  • crawl4ai/utils.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py

2086-2086: Do not use bare except

(E722)

🔇 Additional comments (6)
crawl4ai/utils.py (6)

1994-2023: Comprehensive tracking parameter list enhances privacy protection.

This extensive list of tracking parameters will significantly improve privacy by removing identifiers from various marketing and analytics platforms including:

  • UTM parameters used for campaign tracking
  • Facebook, Google, and other social media tracking parameters
  • Marketing automation platform identifiers (Mailchimp, etc.)
  • Various analytics and tracking IDs

2025-2026: Good optimization with LRU cache.

The LRU cache implementation with a size of 1000 entries is an excellent performance optimization for repeatedly processed URLs, which is common in web crawling scenarios.


2027-2040: Well-documented function with clear purpose.

The docstring clearly explains the function's three primary purposes:

  1. Normalizing URLs with various formats
  2. Removing tracking parameters
  3. Handling URL parsing errors with fallbacks

This documentation helps future maintainers understand the intent and functionality.


2042-2057: Robust handling of URL prefix variations.

The logic correctly handles various URL prefix formats that might be encountered in real-world websites:

  • www. and WWW. prefixes
  • Protocol-relative URLs (starting with //)
  • Malformed http:/ and https:/ URLs (missing slash)
  • Relative URLs joined with base_url

2058-2076: Effective tracking parameter removal logic.

The implementation correctly:

  1. Parses the normalized URL
  2. Extracts query parameters
  3. Removes tracking parameters
  4. Reconstructs the query string
  5. Rebuilds the full URL

This addresses the PR objective of privacy enhancement.


2059-2059: Fixed previous URL variable discrepancy.

This resolves the discrepancy identified in a previous review where the function was parsing url instead of _url. The normalized URL is now correctly used for parsing.

sovle None input

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crawl4ai/utils.py (2)

2085-2096: Consider handling additional URL edge cases.

While the error handling is good, there are additional edge cases that could be improved:

  1. The function doesn't handle data URLs, mailto links, or JavaScript URLs specially
  2. URLs with authentication (username:password@domain) might be processed differently than expected

Consider adding special handling for these cases to further improve robustness.

@lru_cache(maxsize=1000)
def normalize_url(href, base_url):
    # Guard against None or empty inputs
    if href is None:
        return None
    href = href.strip()
    if not href:
        return None

+    # Handle special URL schemes that shouldn't be modified
+    special_schemes = ['mailto:', 'tel:', 'javascript:', 'data:']
+    if any(href.lower().startswith(scheme) for scheme in special_schemes):
+        return href

    url = href
    # Rest of the function...

2025-2025: Consider a larger LRU cache size for heavily accessed sites.

The current @lru_cache(maxsize=1000) is a good starting point, but depending on your application's traffic patterns, you might need a larger cache size. If your application crawls many sites with unique URLs, consider increasing this value to reduce repeated normalizations.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbf8b46 and 7849f98.

📒 Files selected for processing (1)
  • crawl4ai/utils.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py

2065-2065: SyntaxError: Got unexpected token …

🔇 Additional comments (3)
crawl4ai/utils.py (3)

1994-2023: The tracking parameter list is comprehensive.

This extensive list of tracking parameters covers a wide range of advertising, analytics, and marketing platforms including Google UTM parameters, Facebook tracking, email marketing trackers, and many others. This will significantly improve privacy for users by removing these identifiers during URL normalization.


2067-2084: The URL cleaning implementation effectively removes tracking parameters.

The implementation correctly:

  1. Parses the URL using urlparse
  2. Extracts query parameters using parse_qs
  3. Removes all tracking parameters from the query dict
  4. Reconstructs the query string with urlencode
  5. Builds the clean URL with urlunparse

This approach is robust and will clean URLs as expected.


2025-2040: ⚠️ Potential issue

Fix duplicate function definition causing decorator to be ineffective.

There's a critical error in the code structure. The @lru_cache decorator is applied to an empty function definition (lines 2025-2026), while the actual implementation starts on line 2042. This means the caching functionality won't work.

Apply this correction to fix the function structure:

@lru_cache(maxsize=1000)
-def normalize_url(href, base_url):
-    """Normalize URLs to ensure consistent format and remove tracking parameters.
-    
-    This function:
-    1. Normalizes URLs with various formats (www., //, http:/, etc.)
-    2. Removes common tracking parameters (UTM, Facebook, Google, etc.)
-    3. Handles URL parsing errors with a fallback mechanism
-    
-    Args:
-        href (str): The URL to normalize
-        base_url (str): The base URL to use for relative URLs
-        
-    Returns:
-        str: The normalized and cleaned URL
-    """
-
-def normalize_url(href, base_url):
+def normalize_url(href, base_url):
+    """Normalize URLs to ensure consistent format and remove tracking parameters.
+    
+    This function:
+    1. Normalizes URLs with various formats (www., //, http:/, etc.)
+    2. Removes common tracking parameters (UTM, Facebook, Google, etc.)
+    3. Handles URL parsing errors with a fallback mechanism
+    
+    Args:
+        href (str): The URL to normalize
+        base_url (str): The base URL to use for relative URLs
+        
+    Returns:
+        str: The normalized and cleaned URL
+    """
     # Guard against None or empty inputs
     if href is None:
         return None

Likely an incorrect or invalid review comment.

remove misstake ...

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
crawl4ai/utils.py (1)

2084-2095: 🛠️ Refactor suggestion

Improve error handling by avoiding bare except.

Your error handling includes a good first try/except with specific exception types, but contains a bare except clause in the fallback which can mask unexpected errors.

Apply this change to capture and log specific exception types in the fallback:

    except (ValueError, AttributeError, TypeError) as e:
        print(f"Error cleaning URL '{_url}': {e}")
        try:
            # More robust fallback that preserves the original URL structure
            _ss = _url.split('//')
            if len(_ss) == 2:
                return '//'.join(_ss)
            else:
                return _ss[0] + '//' + '/'.join(_ss[1:])
-        except:
+        except Exception as fallback_error:
+            print(f"Fallback error for URL '{_url}': {fallback_error}")
            # Ultimate fallback - return the uncleaned url
            return _url
🧰 Tools
🪛 Ruff (0.8.2)

2093-2093: Do not use bare except

(E722)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7849f98 and c792299.

📒 Files selected for processing (1)
  • crawl4ai/utils.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
crawl4ai/utils.py

2042-2042: Redefinition of unused normalize_url from line 2026

(F811)


2093-2093: Do not use bare except

(E722)

🔇 Additional comments (5)
crawl4ai/utils.py (5)

1994-2023: Comprehensive tracking parameter list enhances privacy.

The extensive list of tracking parameters to remove is well-organized and thorough, covering major analytics platforms (UTM, Facebook, Google), email marketing tools, and numerous other tracking mechanisms. This significantly improves user privacy and reduces tracking footprint.


2025-2041: Good use of caching for performance optimization.

The @lru_cache decorator will significantly improve performance for frequently accessed URLs by avoiding redundant processing. The cache size of 1000 seems reasonable for balancing memory usage vs. performance gains.


2051-2065: Excellent URL normalization logic.

The URL normalization logic thoroughly handles a wide range of URL formats including:

  • URLs starting with 'www.' or 'WWW.'
  • URLs with missing protocol parts like '/www.'
  • Protocol-relative URLs ('//example.com')
  • URLs with malformed protocols ('http:/' instead of 'http://')

This robust handling will significantly improve the crawler's ability to properly process URLs found in the wild.


2066-2083: Great implementation of tracking parameter removal.

The function correctly uses urlparse, parse_qs, and urlunparse to extract, clean, and rebuild the URL. This approach properly preserves the URL structure while removing only the unwanted tracking parameters.


34-34: Good addition of required imports.

The added imports from urllib.parse are necessary for the URL parsing, manipulation, and reconstruction functionality.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
crawl4ai/async_logger.py (1)

347-351: Consider refactoring AsyncFileLogger to use loguru for consistency

The AsyncFileLogger class still uses manual file writing while AsyncLogger now uses loguru. This creates inconsistency in the codebase and means AsyncFileLogger misses out on the benefits of loguru (async writing, rotation, etc.).

Consider refactoring AsyncFileLogger to also use loguru for consistency:

class AsyncFileLogger(AsyncLoggerBase):
    def __init__(self, log_file: str):
        self.log_file = log_file
        os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
        
        # Use loguru for consistent async file logging
        self.loguru_logger = loguru_logger.bind(logger_id=id(self))
        try:
            self.loguru_logger.remove(0)
        except ValueError:
            pass
        
        self.loguru_logger.add(
            log_file,
            level="TRACE",
            rotation="50 MB", 
            enqueue=True,
            encoding="utf-8"
        )
    
    def _write_to_file(self, level: str, message: str, tag: str):
        """Write a message to the log file using loguru."""
        formatted_message = f"[{level}] [{tag}] {message}"
        getattr(self.loguru_logger, level.lower())(formatted_message)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e611c0 and 9852937.

📒 Files selected for processing (1)
  • crawl4ai/async_logger.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
crawl4ai/async_logger.py (1)
tests/loggers/test_logger.py (2)
  • info (31-33)
  • _write_to_file (21-25)
🔇 Additional comments (2)
crawl4ai/async_logger.py (2)

9-9: LGTM: Clean import addition

The loguru import is properly added and will provide better async logging capabilities.


382-382: LGTM: Minor formatting fix

Clean whitespace fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant