-
Notifications
You must be signed in to change notification settings - Fork 12
created a speculative encoding detector #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
created a speculative encoding detector #312
Conversation
| a text-based IO wrapper that will decode the underlying binary-mode file as text. | ||
| """ | ||
| use_encoding: str | None | ||
| _chardet_confidence_threshold: float = 0.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be determined by client or controlled common library?
|
✅ 40/40 passed, 2 skipped, 1m34s total Running from acceptance #363 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this offline. I understand the motivation, but do have some concerns and maybe there's a different way of achieving the same result while assuaging them.
| from typing import BinaryIO, Literal, NoReturn, TextIO, TypeVar | ||
| from urllib.parse import quote_from_bytes as urlquote_from_bytes | ||
|
|
||
| import chardet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import means it's not an optional dependency, which is why the downstream projects are failing.
| This Software contains code from the following open source projects, licensed under the GNU Lesser GPL v2: | ||
|
|
||
| chardet - https://github.com/chardet/chardet | ||
| Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard | ||
| License - https://github.com/chardet/chardet/blob/main/LICENSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gueniai: If we proceed with this, this will need review.
|
|
||
| chardet - https://github.com/chardet/chardet | ||
| Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard | ||
| License - https://github.com/chardet/chardet/blob/main/LICENSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sundarshankar89: EOL at EOF
What does this PR do?
chardetlibrary to better handle encoding detection when reading files. The change aims to improve the confidence and accuracy of text decoding, falling back to the system’s preferred encoding if confidence in the detected encoding is low.Question: Should we simplify the detection using chardet instead of current approach for non xml files?