Skip to content

Commit 76281eb

Browse files
authored
Merge pull request #1082 from google/py-improve-python
Many small fixes and improvements (python, docs, link to paper)
2 parents d730f40 + 42cb389 commit 76281eb

File tree

11 files changed

+55
-127
lines changed

11 files changed

+55
-127
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -268,8 +268,8 @@ Please consult the [python documentation](./python/README.md) for details on the
268268
| Artifact | Status | Latest version | Default model |
269269
| -------------------------------------------------------------------------- | -------------- | -------------- | ---------------------------------------------------------- |
270270
| [Python `Magika` module](./python/README.md) | Stable | `0.6.1` | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
271-
| [Rust `magika` CLI](https://crates.io/crates/magika-cli) | Stable | `0.1.1` | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
272-
| [Rust `magika` library](https://docs.rs/magika) | Stable | `0.1.1` | [`standard_v3_2`](./assets/models/standard_v3_2/README.md) |
271+
| [Rust `magika` CLI](https://crates.io/crates/magika-cli) | Stable | `0.1.2` | [`standard_v3_3`](./assets/models/standard_v3_2/README.md) |
272+
| [Rust `magika` library](https://docs.rs/magika) | Stable | `0.2.0` | [`standard_v3_3`](./assets/models/standard_v3_2/README.md) |
273273
| TypeScript / NPM package ([README](./js/README.md) & [docs](./docs/js.md)) | Stable | `0.3.2` | [`standard_v3_3`](./assets/models/standard_v3_3/README.md) |
274274
| [Demo Website](https://google.github.io/magika/) | Stable | - | [`standard_v3_3`](./assets/models/standard_v3_3/README.md) |
275275
| [GoLang](./go/README.md) | In development | - | - |
@@ -298,7 +298,7 @@ See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details.
298298

299299
# Research Paper and Citation
300300

301-
We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. A pre-print of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).
301+
We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. You can find a copy of the paper [here](./assets/2025_icse_magika.pdf). (A previous version of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).)
302302

303303
If you use this software for your research, please cite it as:
304304

assets/2025_icse_magika.pdf

943 KB
Binary file not shown.

js/src/overwrite-reason.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@
1414

1515
export enum OverwriteReason {
1616
NONE = "none",
17-
LOW_CONFIDENCE = "low-confidence",
18-
OVERWRITE_MAP = "overwrite-map",
17+
LOW_CONFIDENCE = "low_confidence",
18+
OVERWRITE_MAP = "overwrite_map",
1919
}

js/src/prediction-mode.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
// limitations under the License.
1414

1515
export enum PredictionMode {
16-
BEST_GUESS = "best-guess",
17-
MEDIUM_CONFIDENCE = "medium-confidence",
18-
HIGH_CONFIDENCE = "high-confidence",
16+
BEST_GUESS = "best_guess",
17+
MEDIUM_CONFIDENCE = "medium_confidence",
18+
HIGH_CONFIDENCE = "high_confidence",
1919
}

python/CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,12 @@ semver guidelines for more details about this.
1414
- Mark python 3.13 as supported.
1515
- New model `standard_v3_3` model, with better support for TypeScript and non-ascii characters in textual files. See [models' CHANGELOG](../assets/models/CHANGELOG.md) for more information.
1616
- `identify_stream()` now restores the stream's original position after reading from it, preventing side effects on subsequent stream operations. ([#1020](https://github.com/google/magika/pull/1020))
17-
- Bugfix: limit the number of bytes we read in case of an input with just many whitespaces. ([#1015](https://github.com/google/magika/pull/1015))
18-
- Bugfix: do not alter warnings' simplefilter as this has visible side effects for other modules. ([#1017](https://github.com/google/magika/pull/1017))
1917
- Add `asdict()` utility method to `MagikaResult`.
2018
- Set `prediction.overwrite_reason` to `Overwrite.NONE` if `output.label` is the same as `dl.label`. ([#1023](https://github.com/google/magika/pull/1023))
19+
- Bugfix: limit the number of bytes we read in case of an input with just many whitespaces. ([#1015](https://github.com/google/magika/pull/1015))
20+
- Bugfix: do not alter warnings' simplefilter as this has visible side effects for other modules. ([#1017](https://github.com/google/magika/pull/1017))
21+
- Bugfix: magika's python client now properly warns for low-confidence predictions.
22+
- Bugfix: update Magika's StrEnum string representation to be compatible with standard library.
2123

2224
## [0.6.1] - 2025-03-19
2325

python/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ class ContentTypeLabel(StrEnum):
281281

282282
## Research Paper and Citation
283283

284-
We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. A pre-print of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).
284+
We describe how we developed Magika and the choices we made in our research paper, which was accepted at the International Conference on Software Engineering (ICSE) 2025. You can find a copy of the paper [here](https://github.com/google/magika/blob/main/assets/2025_icse_magika.pdf). (A previous version of our paper is available on arxiv: [https://arxiv.org/abs/2409.13768](https://arxiv.org/abs/2409.13768).)
285285

286286
If you use this software for your research, please cite it as:
287287

python/src/magika/cli/magika_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ def main(
290290
and result.prediction.dl.label
291291
!= result.prediction.output.label
292292
and result.prediction.overwrite_reason
293-
== OverwriteReason.NONE
293+
== OverwriteReason.LOW_CONFIDENCE
294294
):
295295
# It seems that we had a low-confidence prediction
296296
# from the model. Let's warn the user about our best

python/src/magika/magika.py

Lines changed: 38 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -382,96 +382,73 @@ def _extract_features_from_seekable(
382382
block_size: int,
383383
use_inputs_at_offsets: bool,
384384
) -> ModelFeatures:
385-
"""This implement v2 of the features extraction v2 from a seekable,
386-
which is an abstraction about anything that can be "read_at" a specific
387-
offset, such as a file or buffer. This is implemented so that we do not
388-
need to load the entire file in memory, or scan the entire buffer.
385+
"""Extract features from an input seekable.
386+
387+
This implements features extraction v2 from a seekable, which is an
388+
abstraction about anything that has a size and that can be "read_at" a
389+
specific offset, such as a file or a buffer. This is implemented so that
390+
we do not need to load the entire file in memory or scan the entire
391+
buffer.
389392
390393
High-level overview on what we do:
391-
- We extract blocks of bytes from the beginning ("beg"), the middle
392-
("mid"), and at the end ("end").
393-
- We then truncate or add padding to these blocks, depending on whether
394-
we have too many or too few.
395-
396-
Blocks extraction and padding:
397-
- beg: we read the first block_size bytes, we lstrip() it, and we use
398-
this as the basis to extract beg_size integers. If we have too many
399-
bytes, we only consider the first beg_size ones. If we do not have
400-
enough, we add padding as suffix (up to beg_size integers).
401-
- mid: we determine "where the middle is" by using the entire content's
402-
size (before stripping the whitespace-like characters), and we take the
403-
mid_size bytes in the middle. If we do not have enough bytes, we add
404-
padding to the left and to the right. In case we need to add an odd
405-
number of padding integers, we add an extra one to the right.
406-
- end: same as "beg", but we read the last block_size bytes, we rstrip()
407-
(instead of lstrip()), and, if needed, we add padding as a prefix (and
408-
not as a suffix like we do with "beg").
394+
- We read (at most) `block_size` bytes from the beginning and from the
395+
end.
396+
- We normalize these bytes by stripping whitespaces.
397+
- We consider `beg_size` and `end_size` bytes as `beg` and `end`
398+
features. If we don't have enough bytes, we use `padding_token` as
399+
padding.
400+
401+
See comments below for the specifics and handling of corner cases.
402+
403+
NOTE: This implementation does not support extraction of `mid` features
404+
and `use_inputs_at_offsets`.
409405
"""
410406

411407
assert beg_size < block_size
412-
assert mid_size < block_size
408+
assert mid_size == 0
413409
assert end_size < block_size
410+
assert not use_inputs_at_offsets
414411

415412
# we read at most block_size bytes
416413
bytes_num_to_read = min(block_size, seekable.size)
417414

418415
if beg_size > 0:
419-
beg_content = seekable.read_at(0, bytes_num_to_read).lstrip()
416+
# Read at most `block_size` bytes from the beginning; `lstrip()``
417+
# them (or `strip()` them if the file size is less or equal than
418+
# `block_size`); take at most `beg_size` bytes, and optionally pad
419+
# them with `padding_token` to get to a list of `beg_size` integers.
420+
beg_content = seekable.read_at(0, bytes_num_to_read)
421+
beg_content = beg_content.lstrip()
420422
beg_ints = Magika._get_beg_ints_with_padding(
421423
beg_content, beg_size, padding_token
422424
)
423425
else:
424426
beg_ints = []
425427

426-
if mid_size > 0:
427-
# mid_idx points to the left-most offset to read for the "mid" component
428-
# of the features.
429-
mid_bytes_num_to_read = min(seekable.size, mid_size)
430-
mid_idx = (seekable.size - mid_bytes_num_to_read) // 2
431-
mid_content = seekable.read_at(mid_idx, mid_bytes_num_to_read)
432-
mid_ints = Magika._get_mid_ints_with_padding(
433-
mid_content, mid_size, padding_token
434-
)
435-
else:
436-
mid_ints = []
437-
438428
if end_size > 0:
429+
# Read at most `block_size` bytes from the end; `rstrip()`` them (or
430+
# `strip()` them if the file size is less or equal than
431+
# `block_size`); take at most `end_size` bytes (from the end), and
432+
# optionally pad them (at the beginning) with `padding_token` to get
433+
# to a list of `end_size` integers.
439434
end_content = seekable.read_at(
440435
seekable.size - bytes_num_to_read, bytes_num_to_read
441-
).rstrip()
436+
)
437+
end_content = end_content.rstrip()
442438
end_ints = Magika._get_end_ints_with_padding(
443439
end_content, end_size, padding_token
444440
)
445441
else:
446442
end_ints = []
447443

448-
if use_inputs_at_offsets:
449-
offset_0x8000_0x8007 = Magika._get_ints_at_offset_or_padding(
450-
seekable, 0x8000, 8, padding_token
451-
)
452-
offset_0x8800_0x8807 = Magika._get_ints_at_offset_or_padding(
453-
seekable, 0x8800, 8, padding_token
454-
)
455-
offset_0x9000_0x9007 = Magika._get_ints_at_offset_or_padding(
456-
seekable, 0x9000, 8, padding_token
457-
)
458-
offset_0x9800_0x9807 = Magika._get_ints_at_offset_or_padding(
459-
seekable, 0x9800, 8, padding_token
460-
)
461-
else:
462-
offset_0x8000_0x8007 = []
463-
offset_0x8800_0x8807 = []
464-
offset_0x9000_0x9007 = []
465-
offset_0x9800_0x9807 = []
466-
467444
return ModelFeatures(
468445
beg=beg_ints,
469-
mid=mid_ints,
446+
mid=[],
470447
end=end_ints,
471-
offset_0x8000_0x8007=offset_0x8000_0x8007,
472-
offset_0x8800_0x8807=offset_0x8800_0x8807,
473-
offset_0x9000_0x9007=offset_0x9000_0x9007,
474-
offset_0x9800_0x9807=offset_0x9800_0x9807,
448+
offset_0x8000_0x8007=[],
449+
offset_0x8800_0x8807=[],
450+
offset_0x9000_0x9007=[],
451+
offset_0x9800_0x9807=[],
475452
)
476453

477454
@staticmethod
@@ -498,38 +475,6 @@ def _get_beg_ints_with_padding(
498475

499476
return beg_ints
500477

501-
@staticmethod
502-
def _get_mid_ints_with_padding(
503-
mid_content: bytes, mid_size: int, padding_token: int
504-
) -> List[int]:
505-
"""Take a buffer as input and extract mid ints. This returns a list of
506-
integers whose length is exactly mid_size. If the buffer is bigger than
507-
required, take only its middle part. If the buffer is shorter, add
508-
padding to its left and right. If we need to add an odd number of
509-
padding integers, add an extra one to the right.
510-
"""
511-
512-
if mid_size < len(mid_content):
513-
mid_idx = (len(mid_content) - mid_size) // 2
514-
mid_content = mid_content[mid_idx : mid_idx + mid_size]
515-
516-
mid_ints = list(map(int, mid_content))
517-
518-
if len(mid_ints) < mid_size:
519-
# we don't have enough ints, add padding
520-
padding_size = mid_size - len(mid_ints)
521-
padding_size_left = padding_size // 2
522-
padding_size_right = padding_size - padding_size_left
523-
mid_ints = (
524-
([padding_token] * padding_size_left)
525-
+ mid_ints
526-
+ ([padding_token] * padding_size_right)
527-
)
528-
529-
assert len(mid_ints) == mid_size
530-
531-
return mid_ints
532-
533478
@staticmethod
534479
def _get_end_ints_with_padding(
535480
end_content: bytes, end_size: int, padding_token: int
@@ -554,14 +499,6 @@ def _get_end_ints_with_padding(
554499

555500
return end_ints
556501

557-
@staticmethod
558-
def _get_ints_at_offset_or_padding(
559-
seekable: Seekable, offset: int, size: int, padding_token: int
560-
) -> List[int]:
561-
if offset + size <= seekable.size:
562-
return list(map(int, seekable.read_at(offset, size)))
563-
return [padding_token] * size
564-
565502
def _get_model_outputs_from_features(
566503
self, all_features: List[Tuple[Path, ModelFeatures]]
567504
) -> List[Tuple[Path, ModelOutput]]:

python/src/magika/types/strenum.py

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,8 @@
1616
class StrEnum(str, enum.Enum):
1717
"""
1818
StrEnum is a Python ``enum.Enum`` that inherits from ``str``. The default
19-
``auto()`` behavior uses the member name as its value.
20-
21-
Example usage::
22-
23-
class Example(StrEnum):
24-
UPPER_CASE = auto()
25-
lower_case = auto()
26-
MixedCase = auto()
27-
28-
29-
assert Example.UPPER_CASE == "UPPER_CASE"
30-
assert Example.lower_case == "lower_case"
31-
assert Example.MixedCase == "MixedCase"
19+
``auto()`` behavior uses the lower-case version of the name. This is meant
20+
to reflect the behavior of `enum.StrEnum`, available from Python 3.11.
3221
"""
3322

3423
def __new__(cls, value: Union[str, StrEnum], *args, **kwargs): # type: ignore[no-untyped-def]
@@ -47,4 +36,4 @@ def _generate_next_value_(name, *_): # type: ignore[no-untyped-def,override]
4736

4837
class LowerCaseStrEnum(StrEnum):
4938
def _generate_next_value_(name, *_): # type: ignore[no-untyped-def,override]
50-
return name.lower().replace("_", "-")
39+
return name.lower()
Binary file not shown.

0 commit comments

Comments
 (0)