Skip to content

Commit 0262811

Browse files
chore(deps): update dependency chardet to v6 (#177)
This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [chardet](https://redirect.github.com/chardet/chardet) | `==5.2.0` → `==6.0.0.post1` | ![age](https://developer.mend.io/api/mc/badges/age/pypi/chardet/6.0.0.post1?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/chardet/5.2.0/6.0.0.post1?slim=true) | --- ### Release Notes <details> <summary>chardet/chardet (chardet)</summary> ### [`v6.0.0`](https://redirect.github.com/chardet/chardet/releases/tag/6.0.0) [Compare Source](https://redirect.github.com/chardet/chardet/compare/5.2.0...6.0.0) ##### Features - **Unified single-byte charset detection**: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case `Latin1Prober` and `MacRomanProber` heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding *and* the language for all supported single-byte encodings. - **38 new languages**: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline. - **`EncodingEra` filtering**: New `encoding_era` parameter to `detect` allows filtering by an `EncodingEra` flag enum (`MODERN_WEB`, `LEGACY_ISO`, `LEGACY_MAC`, `LEGACY_REGIONAL`, `DOS`, `MAINFRAME`, `ALL`) allows callers to restrict detection to encodings from a specific era. `detect()` and `detect_all()` default to `MODERN_WEB`. The new `MODERN_WEB` default should drastically improve accuracy for users who are not working with legacy data. The tiers are: - `MODERN_WEB`: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web) - `LEGACY_ISO`: ISO-8859-x, KOI8-R/U (legacy but well-known standards) - `LEGACY_MAC`: Mac-specific encodings (MacRoman, MacCyrillic, etc.) - `LEGACY_REGIONAL`: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.) - `DOS`: DOS/OEM code pages (CP437, CP850, CP866, etc.) - `MAINFRAME`: EBCDIC variants (CP037, CP500, etc.) - **`--encoding-era` CLI flag**: The `chardetect` CLI now accepts `-e`/`--encoding-era` to control which encoding eras are considered during detection. - **`max_bytes` and `chunk_size` parameters**: `detect()`, `detect_all()`, and `UniversalDetector` now accept `max_bytes` (default 200KB) and `chunk_size` (default 64KB) parameters for controlling how much data is examined. ([#&#8203;314](https://redirect.github.com/chardet/chardet/issues/314), [@&#8203;bysiber](https://redirect.github.com/bysiber)) - **Encoding era preference tie-breaking**: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones. - **Charset metadata registry**: New `chardet.metadata.charsets` module provides structured metadata about all supported encodings, including their era classification and language filter. - **`should_rename_legacy` now defaults intelligently**: When set to `None` (the new default), legacy renaming is automatically enabled when `encoding_era` is `MODERN_WEB`. - **Direct GB18030 support**: Replaced the redundant GB2312 prober with a proper GB18030 prober. - **EBCDIC detection**: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection. - **Binary file detection**: Added basic binary file detection to abort analysis earlier on non-text files. - **Python 3.12, 3.13, and 3.14 support** ([#&#8203;283](https://redirect.github.com/chardet/chardet/issues/283), [@&#8203;hugovk](https://redirect.github.com/hugovk); [#&#8203;311](https://redirect.github.com/chardet/chardet/issues/311)) - **GitHub Codespace support** ([#&#8203;312](https://redirect.github.com/chardet/chardet/issues/312), [@&#8203;oxygen-dioxide](https://redirect.github.com/oxygen-dioxide)) ##### Fixes - **Fix CP949 state machine**: Corrected the state machine for Korean CP949 encoding detection. ([#&#8203;268](https://redirect.github.com/chardet/chardet/issues/268), [@&#8203;nenw](https://redirect.github.com/nenw)) - **Fix SJIS distribution analysis**: Fixed `SJISDistributionAnalysis` discarding valid second-byte range >= 0x80. ([#&#8203;315](https://redirect.github.com/chardet/chardet/issues/315), [@&#8203;bysiber](https://redirect.github.com/bysiber)) - **Fix UTF-16/32 detection for non-ASCII-heavy text**: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a `MIN_RATIO` threshold alongside the existing `EXPECTED_RATIO`. - **Fix `get_charset` crash**: Resolved a crash when looking up unknown charset names. - **Fix GB18030 `char_len_table`**: Corrected the character length table for GB18030 multi-byte sequences. - **Fix UTF-8 state machine**: Updated to be more spec-compliant. - **Fix `detect_all()` returning inactive probers**: Results from probers that determined "definitely not this encoding" are now excluded. - **Fix early cutoff bug**: Resolved an issue where detection could terminate prematurely. - **Default UTF-8 fallback**: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default. ##### Breaking changes - **Dropped Python 3.7, 3.8, and 3.9 support**: Now requires Python 3.10+. ([#&#8203;283](https://redirect.github.com/chardet/chardet/issues/283), [@&#8203;hugovk](https://redirect.github.com/hugovk)) - **Removed `Latin1Prober` and `MacRomanProber`**: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by `SingleByteCharSetProber` with trained language models, giving better accuracy and language identification. - **Removed EUC-TW support**: EUC-TW encoding detection has been removed as it is extremely rare in practice. - **`LanguageFilter.NONE` removed**: Use specific language filters or `LanguageFilter.ALL` instead. - **Enum types changed**: `InputState`, `ProbingState`, `MachineState`, `SequenceLikelihood`, and `CharacterCategory` are now `IntEnum` (previously plain classes or `Enum`). `LanguageFilter` values changed from hardcoded hex to `auto()`. - **`detect()` default behavior change**: `detect()` now defaults to `encoding_era=EncodingEra.MODERN_WEB` and `should_rename_legacy=None` (auto-enabled for `MODERN_WEB`), whereas previously it defaulted to considering all encodings with no legacy renaming. ##### Misc changes - **Switched from Poetry/setuptools to uv + hatchling**: Build system modernized with `hatch-vcs` for version management. - **License text updated**: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. ([#&#8203;304](https://redirect.github.com/chardet/chardet/issues/304), [#&#8203;307](https://redirect.github.com/chardet/chardet/issues/307), [@&#8203;musicinmybrain](https://redirect.github.com/musicinmybrain)) - **CulturaX-based model training**: The `create_language_model.py` training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models. - **`Language` class converted to frozen dataclass**: The language metadata class now uses `@dataclass(frozen=True)` with `num_training_docs` and `num_training_chars` fields replacing `wiki_start_pages`. - **Test infrastructure**: Added `pytest-timeout` and `pytest-xdist` for faster parallel test execution. Reorganized test data directories. ##### Contributors Thank you to everyone who contributed to this release! - [@&#8203;dan-blanchard](https://redirect.github.com/dan-blanchard) (Dan Blanchard) - [@&#8203;bysiber](https://redirect.github.com/bysiber) (Kadir Can Ozden) - [@&#8203;musicinmybrain](https://redirect.github.com/musicinmybrain) (Ben Beasley) - [@&#8203;hugovk](https://redirect.github.com/hugovk) (Hugo van Kemenade) - [@&#8203;oxygen-dioxide](https://redirect.github.com/oxygen-dioxide) - [@&#8203;nenw](https://redirect.github.com/nenw) And a special thanks to [@&#8203;helour](https://redirect.github.com/helour), whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release. </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 5am on monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/Nextdoor/gogo). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4yNi41IiwidXBkYXRlZEluVmVyIjoiNDMuMjYuNSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
1 parent b4a410d commit 0262811

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

resources/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
boto3==1.42.49
22
botocore==1.42.49
33
certifi==2026.1.4
4-
chardet==5.2.0
4+
chardet==6.0.0.post1
55
click==8.3.1
66
docutils==0.22.4
77
Flask==3.1.2

0 commit comments

Comments
 (0)