Skip to content

Commit 1e2da6d

Browse files
authored
fix: ipv4 address regex (#3808)
I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.
1 parent 4379d88 commit 1e2da6d

File tree

3 files changed

+8
-4
lines changed

3 files changed

+8
-4
lines changed

Diff for: CHANGELOG.md

+4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
## 0.16.11-dev1
22

3+
### Fixes
4+
5+
- Fix ipv4 regex to correctly include up to three digit octets.
6+
37
### Enhancements
48

59
- **Enhance quote standardization tests** with additional Unicode scenarios

Diff for: test_unstructured/cleaners/test_extract.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from unstructured.cleaners import extract
66

77
EMAIL_META_DATA_INPUT = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
8-
\n ABC.DEF.local ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
8+
\n ABC.DEF.local ([68.183.71.12]) with mapi id\
99
n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
1010

1111

@@ -37,7 +37,7 @@ def test_extract_email_address():
3737
def test_extract_ip_address():
3838
assert extract.extract_ip_address(EMAIL_META_DATA_INPUT) == [
3939
"ba23::58b5:2236:45g2:88h2",
40-
"ba23::58b5:2236:45g2:88h2%25",
40+
"68.183.71.12",
4141
]
4242

4343

Diff for: unstructured/nlp/patterns.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -92,9 +92,9 @@
9292
ONE_LINE_BREAK_PARAGRAPH_PATTERN = r"^(?:(?!\.\s*$).)*$"
9393
ONE_LINE_BREAK_PARAGRAPH_PATTERN_RE = re.compile(ONE_LINE_BREAK_PARAGRAPH_PATTERN)
9494

95-
# IP Address examples: ba23::58b5:2236:45g2:88h2 or 10.0.2.01
95+
# IP Address examples: ba23::58b5:2236:45g2:88h2, 10.0.2.01 or 68.183.71.12
9696
IP_ADDRESS_PATTERN = (
97-
r"[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}",
97+
r"(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)){3}",
9898
"[a-z0-9]{4}::[a-z0-9]{4}:[a-z0-9]{4}:[a-z0-9]{4}:[a-z0-9]{4}%?[0-9]*",
9999
)
100100
IP_ADDRESS_PATTERN_RE = re.compile(f"({'|'.join(IP_ADDRESS_PATTERN)})")

0 commit comments

Comments
 (0)