Merge pull request #1082 from google/py-improve-python

reyammer · web-flow · commit 76281eb8ee63 · 2025-05-02T09:59:09.000-04:00
Many small fixes and improvements (python, docs, link to paper)
diff --git a/README.md b/README.md
@@ -268,8 +268,8 @@ Please consult the [python documentation](./python/README.md) for details on the
 | Artifact                                                                   | Status         | Latest version | Default model                                              |
 | -------------------------------------------------------------------------- | -------------- | -------------- | ---------------------------------------------------------- |
 | [Python `Magika` module](./python/README.md)                               | Stable         | `0.6.1`        | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
-| [Rust `magika` CLI](https://crates.io/crates/magika-cli)                   | Stable         | `0.1.1`        | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
-| [Rust `magika` library](https://docs.rs/magika)                            | Stable         | `0.1.1`        | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
+| [Rust `magika` CLI](https://crates.io/crates/magika-cli)                   | Stable         | `0.1.2`        | [`standard_v3_3`](./assets/models/standard_v3_2/README.md) |
+| [Rust `magika` library](https://docs.rs/magika)                            | Stable         | `0.2.0`        | [`standard_v3_3`](./assets/models/standard_v3_2/README.md) |
 | TypeScript / NPM package ([README](./js/README.md) & [docs](./docs/js.md)) | Stable         | `0.3.2`        | [`standard_v3_3`](./assets/models/standard_v3_3/README.md) |
 | [Demo Website](https://google.github.io/magika/)                           | Stable         | -              | [`standard_v3_3`](./assets/models/standard_v3_3/README.md) |
 | [GoLang](./go/README.md)                                                   | In development | -              | -                                                          |
@@ -298,7 +298,7 @@ See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details.
 
 # Research Paper and Citation
 
-We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. A pre-print of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).
+We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. You can find a copy of the paper [here](./assets/2025_icse_magika.pdf). (A previous version of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).)
 
 If you use this software for your research, please cite it as:
 
diff --git a/assets/2025_icse_magika.pdf b/assets/2025_icse_magika.pdf
diff --git a/js/src/overwrite-reason.ts b/js/src/overwrite-reason.ts
@@ -14,6 +14,6 @@
 
 export enum OverwriteReason {
   NONE = "none",
-  LOW_CONFIDENCE = "low-confidence",
-  OVERWRITE_MAP = "overwrite-map",
+  LOW_CONFIDENCE = "low_confidence",
+  OVERWRITE_MAP = "overwrite_map",
 }
diff --git a/js/src/prediction-mode.ts b/js/src/prediction-mode.ts
@@ -13,7 +13,7 @@
 // limitations under the License.
 
 export enum PredictionMode {
-  BEST_GUESS = "best-guess",
-  MEDIUM_CONFIDENCE = "medium-confidence",
-  HIGH_CONFIDENCE = "high-confidence",
+  BEST_GUESS = "best_guess",
+  MEDIUM_CONFIDENCE = "medium_confidence",
+  HIGH_CONFIDENCE = "high_confidence",
 }
diff --git a/python/CHANGELOG.md b/python/CHANGELOG.md
@@ -14,10 +14,12 @@ semver guidelines for more details about this.
 - Mark python 3.13 as supported.
 - New model `standard_v3_3` model, with better support for TypeScript and non-ascii characters in textual files. See [models' CHANGELOG](../assets/models/CHANGELOG.md) for more information.
 - `identify_stream()` now restores the stream's original position after reading from it, preventing side effects on subsequent stream operations. ([#1020](https://github.com/google/magika/pull/1020))
-- Bugfix: limit the number of bytes we read in case of an input with just many whitespaces. ([#1015](https://github.com/google/magika/pull/1015))
-- Bugfix: do not alter warnings' simplefilter as this has visible side effects for other modules. ([#1017](https://github.com/google/magika/pull/1017))
 - Add `asdict()` utility method to `MagikaResult`.
 - Set `prediction.overwrite_reason` to `Overwrite.NONE` if `output.label` is the same as `dl.label`. ([#1023](https://github.com/google/magika/pull/1023))
+- Bugfix: limit the number of bytes we read in case of an input with just many whitespaces. ([#1015](https://github.com/google/magika/pull/1015))
+- Bugfix: do not alter warnings' simplefilter as this has visible side effects for other modules. ([#1017](https://github.com/google/magika/pull/1017))
+- Bugfix: magika's python client now properly warns for low-confidence predictions.
+- Bugfix: update Magika's StrEnum string representation to be compatible with standard library.
 
 ## [0.6.1] - 2025-03-19
 
diff --git a/python/README.md b/python/README.md
@@ -281,7 +281,7 @@ class ContentTypeLabel(StrEnum):
 
 ## Research Paper and Citation
 
-We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. A pre-print of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).
+We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. You can find a copy of the paper [here](https://github.com/google/magika/blob/main/assets/2025_icse_magika.pdf). (A previous version of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).)
 
 If you use this software for your research, please cite it as:
 
diff --git a/python/src/magika/cli/magika_client.py b/python/src/magika/cli/magika_client.py
@@ -290,7 +290,7 @@ def main(
                             and result.prediction.dl.label
                             != result.prediction.output.label
                             and result.prediction.overwrite_reason
-                            == OverwriteReason.NONE
+                            == OverwriteReason.LOW_CONFIDENCE
                         ):
                             # It seems that we had a low-confidence prediction
                             # from the model. Let's warn the user about our best
diff --git a/python/src/magika/magika.py b/python/src/magika/magika.py
@@ -382,96 +382,73 @@ def _extract_features_from_seekable(
         block_size: int,
         use_inputs_at_offsets: bool,
     ) -> ModelFeatures:
-        """This implement v2 of the features extraction v2 from a seekable,
-        which is an abstraction about anything that can be "read_at" a specific
-        offset, such as a file or buffer. This is implemented so that we do not
-        need to load the entire file in memory, or scan the entire buffer.
+        """Extract features from an input seekable.
+
+        This implements features extraction v2 from a seekable, which is an
+        abstraction about anything that has a size and that can be "read_at" a
+        specific offset, such as a file or a buffer. This is implemented so that
+        we do not need to load the entire file in memory or scan the entire
+        buffer.
 
         High-level overview on what we do:
-        - We extract blocks of bytes from the beginning ("beg"), the middle
-        ("mid"), and at the end ("end").
-        - We then truncate or add padding to these blocks, depending on whether
-        we have too many or too few.
-
-        Blocks extraction and padding:
-        - beg: we read the first block_size bytes, we lstrip() it, and we use
-        this as the basis to extract beg_size integers. If we have too many
-        bytes, we only consider the first beg_size ones. If we do not have
-        enough, we add padding as suffix (up to beg_size integers).
-        - mid: we determine "where the middle is" by using the entire content's
-        size (before stripping the whitespace-like characters), and we take the
-        mid_size bytes in the middle. If we do not have enough bytes, we add
-        padding to the left and to the right. In case we need to add an odd
-        number of padding integers, we add an extra one to the right.
-        - end: same as "beg", but we read the last block_size bytes, we rstrip()
-        (instead of lstrip()), and, if needed, we add padding as a prefix (and
-        not as a suffix like we do with "beg").
+        - We read (at most) `block_size` bytes from the beginning and from the
+        end.
+        - We normalize these bytes by stripping whitespaces.
+        - We consider `beg_size` and `end_size` bytes as `beg` and `end`
+        features. If we don't have enough bytes, we use `padding_token` as
+        padding.
+
+        See comments below for the specifics and handling of corner cases.
+
+        NOTE: This implementation does not support extraction of `mid` features
+        and `use_inputs_at_offsets`.
         """
 
         assert beg_size < block_size
-        assert mid_size < block_size
+        assert mid_size == 0
         assert end_size < block_size
+        assert not use_inputs_at_offsets
 
         # we read at most block_size bytes
         bytes_num_to_read = min(block_size, seekable.size)
 
         if beg_size > 0:
-            beg_content = seekable.read_at(0, bytes_num_to_read).lstrip()
+            # Read at most `block_size` bytes from the beginning; `lstrip()``
+            # them (or `strip()` them if the file size is less or equal than
+            # `block_size`); take at most `beg_size` bytes, and optionally pad
+            # them with `padding_token` to get to a list of `beg_size` integers.
+            beg_content = seekable.read_at(0, bytes_num_to_read)
+            beg_content = beg_content.lstrip()
             beg_ints = Magika._get_beg_ints_with_padding(
                 beg_content, beg_size, padding_token
             )
         else:
             beg_ints = []
 
-        if mid_size > 0:
-            # mid_idx points to the left-most offset to read for the "mid" component
-            # of the features.
-            mid_bytes_num_to_read = min(seekable.size, mid_size)
-            mid_idx = (seekable.size - mid_bytes_num_to_read) // 2
-            mid_content = seekable.read_at(mid_idx, mid_bytes_num_to_read)
-            mid_ints = Magika._get_mid_ints_with_padding(
-                mid_content, mid_size, padding_token
-            )
-        else:
-            mid_ints = []
-
         if end_size > 0:
+            # Read at most `block_size` bytes from the end; `rstrip()`` them (or
+            # `strip()` them if the file size is less or equal than
+            # `block_size`); take at most `end_size` bytes (from the end), and
+            # optionally pad them (at the beginning) with `padding_token` to get
+            # to a list of `end_size` integers.
             end_content = seekable.read_at(
                 seekable.size - bytes_num_to_read, bytes_num_to_read
-            ).rstrip()
+            )
+            end_content = end_content.rstrip()
             end_ints = Magika._get_end_ints_with_padding(
                 end_content, end_size, padding_token
             )
         else:
             end_ints = []
 
-        if use_inputs_at_offsets:
-            offset_0x8000_0x8007 = Magika._get_ints_at_offset_or_padding(
-                seekable, 0x8000, 8, padding_token
-            )
-            offset_0x8800_0x8807 = Magika._get_ints_at_offset_or_padding(
-                seekable, 0x8800, 8, padding_token
-            )
-            offset_0x9000_0x9007 = Magika._get_ints_at_offset_or_padding(
-                seekable, 0x9000, 8, padding_token
-            )
-            offset_0x9800_0x9807 = Magika._get_ints_at_offset_or_padding(
-                seekable, 0x9800, 8, padding_token
-            )
-        else:
-            offset_0x8000_0x8007 = []
-            offset_0x8800_0x8807 = []
-            offset_0x9000_0x9007 = []
-            offset_0x9800_0x9807 = []
-
         return ModelFeatures(
             beg=beg_ints,
-            mid=mid_ints,
+            mid=[],
             end=end_ints,
-            offset_0x8000_0x8007=offset_0x8000_0x8007,
-            offset_0x8800_0x8807=offset_0x8800_0x8807,
-            offset_0x9000_0x9007=offset_0x9000_0x9007,
-            offset_0x9800_0x9807=offset_0x9800_0x9807,
+            offset_0x8000_0x8007=[],
+            offset_0x8800_0x8807=[],
+            offset_0x9000_0x9007=[],
+            offset_0x9800_0x9807=[],
         )
 
     @staticmethod
@@ -498,38 +475,6 @@ def _get_beg_ints_with_padding(
 
         return beg_ints
 
-    @staticmethod
-    def _get_mid_ints_with_padding(
-        mid_content: bytes, mid_size: int, padding_token: int
-    ) -> List[int]:
-        """Take a buffer as input and extract mid ints. This returns a list of
-        integers whose length is exactly mid_size. If the buffer is bigger than
-        required, take only its middle part. If the buffer is shorter, add
-        padding to its left and right. If we need to add an odd number of
-        padding integers, add an extra one to the right.
-        """
-
-        if mid_size < len(mid_content):
-            mid_idx = (len(mid_content) - mid_size) // 2
-            mid_content = mid_content[mid_idx : mid_idx + mid_size]
-
-        mid_ints = list(map(int, mid_content))
-
-        if len(mid_ints) < mid_size:
-            # we don't have enough ints, add padding
-            padding_size = mid_size - len(mid_ints)
-            padding_size_left = padding_size // 2
-            padding_size_right = padding_size - padding_size_left
-            mid_ints = (
-                ([padding_token] * padding_size_left)
-                + mid_ints
-                + ([padding_token] * padding_size_right)
-            )
-
-        assert len(mid_ints) == mid_size
-
-        return mid_ints
-
     @staticmethod
     def _get_end_ints_with_padding(
         end_content: bytes, end_size: int, padding_token: int
@@ -554,14 +499,6 @@ def _get_end_ints_with_padding(
 
         return end_ints
 
-    @staticmethod
-    def _get_ints_at_offset_or_padding(
-        seekable: Seekable, offset: int, size: int, padding_token: int
-    ) -> List[int]:
-        if offset + size <= seekable.size:
-            return list(map(int, seekable.read_at(offset, size)))
-        return [padding_token] * size
-
     def _get_model_outputs_from_features(
         self, all_features: List[Tuple[Path, ModelFeatures]]
     ) -> List[Tuple[Path, ModelOutput]]:
diff --git a/python/src/magika/types/strenum.py b/python/src/magika/types/strenum.py
@@ -16,19 +16,8 @@
 class StrEnum(str, enum.Enum):
     """
     StrEnum is a Python ``enum.Enum`` that inherits from ``str``. The default
-    ``auto()`` behavior uses the member name as its value.
-
-    Example usage::
-
-        class Example(StrEnum):
-            UPPER_CASE = auto()
-            lower_case = auto()
-            MixedCase = auto()
-
-
-        assert Example.UPPER_CASE == "UPPER_CASE"
-        assert Example.lower_case == "lower_case"
-        assert Example.MixedCase == "MixedCase"
+    ``auto()`` behavior uses the lower-case version of the name. This is meant
+    to reflect the behavior of `enum.StrEnum`, available from Python 3.11.
     """
 
     def __new__(cls, value: Union[str, StrEnum], *args, **kwargs):  # type: ignore[no-untyped-def]
@@ -47,4 +36,4 @@ def _generate_next_value_(name, *_):  # type: ignore[no-untyped-def,override]
 
 class LowerCaseStrEnum(StrEnum):
     def _generate_next_value_(name, *_):  # type: ignore[no-untyped-def,override]
-        return name.lower().replace("_", "-")
+        return name.lower()
diff --git a/tests_data/reference/standard_v3_3-inference_examples_by_content.json.gz b/tests_data/reference/standard_v3_3-inference_examples_by_content.json.gz
diff --git a/tests_data/reference/standard_v3_3-inference_examples_by_path.json.gz b/tests_data/reference/standard_v3_3-inference_examples_by_path.json.gz

Original file line number	Diff line number	Diff line change
`@@ -14,6 +14,6 @@`
`14`	`14`
`15`	`15`	`export enum OverwriteReason {`
`16`	`16`	`NONE = "none",`
`17`		`- LOW_CONFIDENCE = "low-confidence",`
`18`		`- OVERWRITE_MAP = "overwrite-map",`
	`17`	`+ LOW_CONFIDENCE = "low_confidence",`
	`18`	`+ OVERWRITE_MAP = "overwrite_map",`
`19`	`19`	`}`