Skip to content

Conversation

@sundarshankar89
Copy link
Collaborator

What does this PR do?

  • Introduces a speculative encoding detector using the chardet library to better handle encoding detection when reading files. The change aims to improve the confidence and accuracy of text decoding, falling back to the system’s preferred encoding if confidence in the detected encoding is low.
  • Updates licensing information to reference chardet (LGPL v2), adding required attribution.
  • Adds chardet as an dependency in pyproject.toml.
  • Existing tests already covers this path extensively.

Question: Should we simplify the detection using chardet instead of current approach for non xml files?

a text-based IO wrapper that will decode the underlying binary-mode file as text.
"""
use_encoding: str | None
_chardet_confidence_threshold: float = 0.6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be determined by client or controlled common library?

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

✅ 40/40 passed, 2 skipped, 1m34s total

Running from acceptance #363

Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this offline. I understand the motivation, but do have some concerns and maybe there's a different way of achieving the same result while assuaging them.

from typing import BinaryIO, Literal, NoReturn, TextIO, TypeVar
from urllib.parse import quote_from_bytes as urlquote_from_bytes

import chardet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import means it's not an optional dependency, which is why the downstream projects are failing.

Comment on lines +14 to +18
This Software contains code from the following open source projects, licensed under the GNU Lesser GPL v2:

chardet - https://github.com/chardet/chardet
Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard
License - https://github.com/chardet/chardet/blob/main/LICENSE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gueniai: If we proceed with this, this will need review.


chardet - https://github.com/chardet/chardet
Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard
License - https://github.com/chardet/chardet/blob/main/LICENSE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundarshankar89: EOL at EOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants