Skip to content

Commit cb4725e

Browse files
committed
Merge branch 'main' into 1373-update-nc
2 parents f2a1afa + 922d291 commit cb4725e

7 files changed

Lines changed: 185 additions & 30 deletions

File tree

CHANGES.md

Lines changed: 25 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -11,57 +11,62 @@ words, they're the ones you'll want to watch, and the others are mostly noise.
1111
Releases are also tagged in git, if that's helpful.
1212

1313
## Coming up
14-
15-
- Fix `me` Update maine scraper and add backscraper
16-
- Update `sd` backscraper and extract from text
14+
- Fix `bia` scraper and add extract from text test cases
1715
- update `nc` scraper to OpinionSiteLinear and new website #1373
18-
- Fix `bia` scraper
1916

2017
## Current
2118

22-
**2.6.65 - 2024-04-11**
19+
**2.6.66 - 2025-04-29**
20+
21+
- Add backscraper for `dcd` #1336
22+
- Update `sd` backscraper and extract from text
23+
- Implement datestring format validation in test_ScraperExtractFromTextTest #838
24+
- Implement `or` extract_from_text to collection regional citations #1226
25+
- Fix `bia` scraper
26+
27+
**2.6.65 - 2025-04-11**
2328

2429
- `nh` was blocking; fixed by updating the user agent string #1370
2530
- Update `vtsuperct_*` scrapers to inherit `extract_from_text` from `vt` #1150
2631

2732
## Past
2833

29-
**2.6.64 - 2024-04-10**
34+
**2.6.64 - 2025-04-10**
3035

3136
- Fix `me` Update maine scraper and add backscraper #1360
3237
- Sites were blocking `cafc` scrapers. Fixed by passing a browser user agent #1366
3338

3439

35-
**2.6.63 - 2024-03-25**
40+
**2.6.63 - 2025-03-25**
3641

3742
- Make `ga` backscraper take kwargs; fix a bug in 2018 #1349
3843
- Implement extract from text for `ga` #1349
3944
- Fix `ill` oral argument scraper #1356
4045

41-
**2.6.62 - 2024-03-19**
46+
**2.6.62 - 2025-03-19**
4247

4348
- Fix `uscgcoca` and `asbca` by replicating browser request headers #1352
4449
- Fix `uscgcoca` citation regex #1351
4550

46-
**2.6.61 - 2024-03-06**
51+
**2.6.61 - 2025-03-06**
4752

4853
- Fix `ca8` opinion scraper by setting `request.verify = False` #1346
4954

50-
**2.6.60 - 2024-03-05**
55+
**2.6.60 - 2025-03-05**
5156

5257
- Fix `ca7` scrapers url from http to https
5358

54-
**2.6.59 - 2024-03-04**
59+
**2.6.59 - 2025-03-04**
5560

5661
- Change `colo` user agent to prevent site block #1341
5762

58-
**2.6.58 - 2024-02-26**
63+
**2.6.58 - 2025-02-26**
5964

6065
- Fixes:
6166
- Add backscraper for `mesuperct` #1328
6267
- Fix `mont` cleanup_content, would fail when content was bytes #1323
6368

64-
**2.6.57 - 2024-02-25**
69+
**2.6.57 - 2025-02-25**
6570

6671
- Fixes:
6772
- fix cafc oral argument scraper PR (#1325)[https://github.com/freelawproject/juriscraper/pull/1325]
@@ -73,7 +78,7 @@ Releases are also tagged in git, if that's helpful.
7378
- Add workflow to check for new entries in CHANGES.md file
7479

7580

76-
**2.6.56 - 2024-02-19**
81+
**2.6.56 - 2025-02-19**
7782

7883
- Fixes:
7984
- n/a
@@ -83,7 +88,7 @@ Releases are also tagged in git, if that's helpful.
8388
- Add citation extraction and author for MT
8489

8590

86-
**2.6.55 - 2024-02-10**
91+
**2.6.55 - 2025-02-10**
8792

8893
- Fixes:
8994
- `cafc` opinion scraper now requests using `verify=False` #1314
@@ -94,27 +99,27 @@ Releases are also tagged in git, if that's helpful.
9499
- recap: improvement to the download_pdf method to handle cases where
95100
attachment pages are returned instead of the expected PDF documents. #1309
96101

97-
**2.6.54 - 2024-01-24**
102+
**2.6.54 - 2025-01-24**
98103

99104
- Fixes:
100105
- `ca6` oral argument scraper is no longer failing
101106
- update the pypi.yml github actions workflow to solve a bug with twine and
102107
packaging packages interaction. It now forces the update of packaging
103108
- due to that bug, we discarded the 2.6.53 version
104109

105-
**2.6.52 - 2024-01-20**
110+
**2.6.52 - 2025-01-20**
106111

107112
- Fixes:
108113
- `AppellateDocketReport.download_pdf` now returns a two-tuple containing the
109114
response object or None and a str. This aligns with the changes introduced
110115
in v 2.5.1.
111116

112-
**2.6.51 - 2024-01-14**
117+
**2.6.51 - 2025-01-14**
113118

114119
- Fixes:
115120
- `extract_from_text` now returns plain citation strings, instead of parsed dicts
116121

117-
**2.6.50 - 2024-01-10**
122+
**2.6.50 - 2025-01-10**
118123

119124
- Fixes:
120125
- add tests to ensure that `extract_from_text` does not fail
@@ -128,7 +133,7 @@ Releases are also tagged in git, if that's helpful.
128133
- Features
129134
- `pacer.email._parse_bankruptcy_short_description` now supports Multi Docket NEFs
130135

131-
**2.6.49 - 2024-01-08**
136+
**2.6.49 - 2025-01-08**
132137

133138
- Fixes:
134139
- `nh` scrapers no longer depend on harcoded year filter

juriscraper/opinions/united_states/administrative_agency/bia.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def extract_from_text(self, scraped_text: str) -> Dict[str, Any]:
7070
:return: Metadata to be added to the case
7171
"""
7272
date = re.findall(
73-
r"Decided (?:(?:by (?:Acting\s)?Attorney General|as amended)\s)?(.*\d{4})",
73+
r"Decided (?:(?:by (?:(?:Acting\s)?Attorney General|Board)|as amended)\s)?(.*\d{4})",
7474
scraped_text,
7575
)
7676
if not date:

juriscraper/opinions/united_states/federal_district/dcd.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ def __init__(self, *args, **kwargs):
2626
self.court_id = self.__module__
2727
self.url = f"https://ecf.dcd.uscourts.gov/cgi-bin/Opinions.pl?{date.today().year}"
2828
self.status = "Published"
29+
self.make_backscrape_iterable(kwargs)
2930

3031
def _process_html(self):
3132
"""
@@ -72,3 +73,37 @@ def get_docket_document_number_from_url(self, url: str) -> Tuple[str, str]:
7273
doc_number = match.group(6) if match else url
7374

7475
return doc_number
76+
77+
def _download_backwards(self, year: int) -> None:
78+
"""Build URL with year input and scrape
79+
80+
:param year: year to scrape
81+
:return None
82+
"""
83+
self.url = f"https://ecf.dcd.uscourts.gov/cgi-bin/Opinions.pl?{year}"
84+
self.html = self._download()
85+
self._process_html()
86+
87+
def make_backscrape_iterable(self, kwargs: dict) -> None:
88+
"""Checks if backscrape start and end arguments have been passed
89+
by caller, and parses them accordingly
90+
91+
:param kwargs: passed when initializing the scraper, may or
92+
may not contain backscrape controlling arguments
93+
:return None
94+
"""
95+
start_date = kwargs.get("backscrape_start")
96+
end_date = kwargs.get("backscrape_end")
97+
98+
start = (
99+
datetime.strptime(start_date, "%m/%d/%Y").year
100+
if start_date
101+
else date.today().year
102+
)
103+
end = (
104+
datetime.strptime(end_date, "%m/%d/%Y").year + 1
105+
if end_date
106+
else date.today().year
107+
)
108+
109+
self.back_scrape_iterable = range(max(2005, start), end)

juriscraper/opinions/united_states/state/or.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,10 @@
22
History:
33
- 2014-08-05: Adapted scraper to have year-based URLs.
44
- 2023-11-18: Fixed and updated
5+
- 2025-04-23: implement extract_from_text, grossir
56
"""
67

8+
import re
79
from datetime import datetime, timedelta
810

911
from juriscraper.AbstractSite import logger
@@ -129,3 +131,18 @@ def format_url(self, start_date: datetime, end_date: datetime) -> str:
129131
start = datetime.strftime(start_date, "%Y%m%d")
130132
end = datetime.strftime(end_date, "%Y%m%d")
131133
return self.base_url.format(self.court_code, start, end)
134+
135+
def extract_from_text(self, scraped_text: str) -> dict:
136+
"""Extract citations from text
137+
138+
Be careful with citations referring to other opinions that are
139+
mentioned before the actual citation
140+
141+
See, for example:
142+
https://ojd.contentdm.oclc.org/digital/api/collection/p17027coll5/id/28946/download
143+
"""
144+
regex = r"\n\s+(?P<cite>\d+ P3d \d+)\s+\n"
145+
if match := re.search(regex, scraped_text[:1000]):
146+
return {"Citation": match.group("cite")}
147+
148+
return {}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
from importlib import import_module
22

3+
from juriscraper.OpinionSite import OpinionSite
4+
35
# `or` is a python reserved keyword; can't import the module as usual
46
oregon_module = import_module("juriscraper.opinions.united_states.state.or")
57

68

79
class Site(oregon_module.Site):
810
court_code = "p17027coll6"
911
days_interval = 120
12+
# prevent test_ScraperExtractFromTextTest failure, given that parent class
13+
# `or` implements Site.extract_from_text
14+
extract_from_text = OpinionSite.extract_from_text

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from setuptools import find_packages, setup
55
from setuptools.command.install import install
66

7-
VERSION = "2.6.65"
7+
VERSION = "2.6.66"
88
AUTHOR = "Free Law Project"
99
EMAIL = "info@free.law"
1010
HERE = os.path.abspath(os.path.dirname(__file__))

tests/local/test_ScraperExtractFromTextTest.py

Lines changed: 101 additions & 8 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)