Skip to content

fix(nc): update to OpinionSiteLinear and new site#1381

Merged
flooie merged 12 commits into
mainfrom
1373-update-nc
Jun 9, 2025
Merged

fix(nc): update to OpinionSiteLinear and new site#1381
flooie merged 12 commits into
mainfrom
1373-update-nc

Conversation

@grossir

@grossir grossir commented Apr 23, 2025

Copy link
Copy Markdown
Contributor

Solves #1373

@flooie

flooie commented Apr 28, 2025

Copy link
Copy Markdown
Contributor

This is good, but the summaries data isnt quite the right category. It lists the metadata associated with the case but isnt a summary of the case. They actually have robust headnotes but as far as I can tell there is no way to link the headnotes for a case from the search engine provided. I emailed the court asking if there is a way and will report back.

If we keep the "summary" information though I think it should go into the headnotes field, but should be cleaned up, atleast titled better.

@flooie flooie assigned grossir and unassigned flooie Apr 28, 2025
@grossir grossir moved this from PRs to Review to Blocked in Sprint (Case Law) Apr 29, 2025
@grossir

grossir commented Apr 29, 2025

Copy link
Copy Markdown
Contributor Author

I will put this in "Blocked" until we get the court's answer

From your comment, I noticed that the field I was collecting as "summary" makes more sense in nc.

  • "Whether unacceptable personal conduct provided just cause for termination of a state employee.",
  • "Whether the Business Court erred in ruling that the parties' intrafamilial dispute fell outside the scope of their arbitration agreement with Charles Schwab."

This should be a OpinionCluster.headnote in Courtlistener, right? From CL

Headnotes are summary descriptions of the legal issues discussed by the court in the particular case. They appear at the beginning of each case just after the summary and disposition. They are short paragraphs with a heading in bold face type. From Wikipedia - A headnote is a brief summary of a particular point of law that is added to the text of a courtdecision to aid readers in locating discussion of a legalissue in an opinion. As the term implies, headnotes appearat the beginning of the published opinion. Frequently, headnotes are value-added components appended to decisions by the publisher who compiles the decisions of a court for resale. As handed down by the court, a decision or written opinion does not contain headnotes. These are added later by an editor not connected to the court, but who instead works for a legal publishing house.

For ncctapp it doesn't look like a summary at all, so I think I will just delete it. Examples:

  • Domestic Relations; Attorney Fees Award; American Rule; N.C. Gen. Stat. \u00a7 50-13.6 (2023). Required Findings; N.C. Rev. R. Prof. Conduct 1.5(a)-(b); Amount and Reasonableness of time spent
  • first-degree felony murder; assault with a deadly weapon inflicting serious injury; jury instructions; voluntary manslaughter; rule of lenity; hit and run; N.C. Gen. Stat. 20-166; motion to dismiss; accessory after the fact; N.C. Gen. Stat. 14-7; fatal variance; ex mero motu

@flooie

flooie commented Apr 30, 2025

Copy link
Copy Markdown
Contributor

So I think headnotes is too nice of an opportunity not to include. tell me what you think of this.
Something like this its possible to call the headnotes digest index and match the found URLs for opinions with the digest and include the headnotes directly into our capture. It adds just one call.

Currently we aren't set up for headnotes but that should be an easy addition in opinion site.

def _download(self, request_dict={}):
if self.html == None:
url = "https://appellate.nccourts.org/opinion-filings/digested-index.php?iCourtNumber=1&sFilingYear=2025"
r = self.request["session"].get(url)
self.headnote_html = fromstring(r.text)
return super()._download(request_dict)

and in process_html

        url = url.replace("http:", "https:")
        # divs = self.headnote_html.xpath(f'//div[.//a[@href="{url}"]]')
        p_elt = self.headnote_html.xpath(f'(//a[@href="{url}"]/ancestor::p)[1]')[0]
        # print(tostring(p_elt, pretty_print=True).decode())
        inner_html = "".join(tostring(child, encoding="unicode") for child in p_elt)

@grossir

@grossir grossir moved this from Blocked to In progress in Sprint (Case Law) Apr 30, 2025
@flooie flooie assigned Luis-manzur and unassigned grossir Jun 2, 2025
@flooie flooie moved this from In progress to PRs to Review in Sprint (Case Law) Jun 2, 2025
@flooie flooie marked this pull request as draft June 2, 2025 19:36
@Luis-manzur Luis-manzur requested a review from flooie June 3, 2025 23:08
@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Jun 3, 2025
@flooie

flooie commented Jun 5, 2025

Copy link
Copy Markdown
Contributor

@Luis-manzur is this still a draft?

@Luis-manzur Luis-manzur marked this pull request as ready for review June 5, 2025 17:36
Comment thread juriscraper/opinions/united_states/state/nc.py Outdated

@flooie flooie left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory I like the idea of collecting headnotes, but in practice I dont think we should include them here. They appear to publish the headnotes digest months and months after they publish opinions. This means in reality we will never actually collect them during a scrape and have no mechanism right now to merge headnotes back into the system.

What it does do is overly complicate the tool.

"""Goes into OpinionCluster.attorneys, type: string"""
return self._get_optional_field_by_id("attorney")

def _get_headnotes(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop headnotes from this PR ...

and we should add them in a different pr if we want to include headnotes

# Like: viewOpinion("http://appellate.nccourts.org/opinions/?c=1&pdf=31511")
if len(urls) != 1 or urls[0].find("viewOpinion") != 0:
continue # Only interested in cases with a download link
def _process_html(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring format is not standard juriscraper

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not resolved

Comment thread juriscraper/opinions/united_states/state/nc.py Outdated
Comment thread juriscraper/opinions/united_states/state/ncctapp.py Outdated
Comment thread juriscraper/opinions/united_states/state/nc.py
Comment on lines +54 to +139
Iterates over each row in the HTML, extracting the title, link, summary, headnote,
docket, status, author, per curiam status, date, and citation for each opinion.
Handles cases where opinions may be withdrawn (no link), and parses additional
information from the headnote HTML if available.

path = "./td/span/span[contains(@class,'title')]"
txt = html.tostring(
row_el.xpath(path)[0], method="text", encoding="unicode"
)
case_name, neutral_cite, docket_number = self.parse_title(txt)

summary = ""
path = "./td/span/span[contains(@class,'desc')]/text()"
summaries = row_el.xpath(path)
try:
summary = summaries[0]
except IndexError:
# Not all cases have a summary
pass
if case_name.strip() == "":
continue # A few cases are missing a name

case_dates.append(case_date)
self.my_download_urls.append(download_url)
self.my_case_names.append(case_name)
self.my_docket_numbers.append(docket_number)
self.my_summaries.append(summary)
self.my_neutral_citations.append(neutral_cite)
self.my_precedential_statuses.append(precedential_status)

elif precedential_status == "Unpublished":
for span in row_el.xpath("./td/span"):
if "onclick" not in span.attrib:
continue
download_url = re.search(
r'viewopinion\("(.*)"',
span.attrib["onclick"],
re.IGNORECASE,
).group(1)

txt = span.text_content().strip()
(
case_name,
neutral_cite,
docket_number,
) = self.parse_title(txt)
if case_name.strip() == "":
continue # A few cases are missing a name
case_dates.append(case_date)
self.my_download_urls.append(download_url)
self.my_case_names.append(case_name)
self.my_docket_numbers.append(docket_number)
self.my_summaries.append("")
self.my_neutral_citations.append(neutral_cite)
self.my_precedential_statuses.append(precedential_status)

return case_dates

# Parses case titles like:
# Fields v. Harnett Cnty., 367 NC 12 (13-761)
# Clark v. Clark, (13-612)
@staticmethod
def parse_title(txt):
try:
name_and_citation = txt.rsplit("(", 1)[0].strip()
docket_number = (
re.search(r"(.*\d).*?", txt.rsplit("(", 1)[1]).group(0).strip()
)
case_name = name_and_citation.rsplit(",", 1)[0].strip()
try:
neutral_cite = name_and_citation.rsplit(",", 1)[1].strip()
if not re.search(r"^\d\d.*\d\d$", neutral_cite):
neutral_cite = ""
except IndexError:
# Unable to find comma to split on. No neutral cite.
neutral_cite = ""
except Exception:
raise InsanityException(
f"Unable to parse: {txt}\n{traceback.format_exc()}"
)
return case_name, neutral_cite, docket_number

def _get_download_urls(self):
return self.my_download_urls
Appends a dictionary of extracted case data to self.cases for each valid row.
"""
for row in self.html.xpath(self.row_xpath):
title = row.xpath("string(span[@class='title'])")

def _get_case_names(self):
return self.my_case_names
link = row.xpath("span[@class='title']/@onclick")
if not link:
# some opinions may be withdrawn
logger.warning("No link for row %s", title)
continue

def _get_docket_numbers(self):
return self.my_docket_numbers
url = link[0].split('("')[1].strip('")')

def _get_summaries(self):
return self.my_summaries
summary = (
row.xpath("string(span[@class='desc'])")
if self.collect_summary
else ""
)
headnote = (
""
if self.collect_summary
else row.xpath("string(span[@class='desc'])")
)

def _get_citations(self):
return self.my_neutral_citations
url = url.replace("http:", "https:")
divs = self.headnote_html.xpath(
f'(//a[@href="{url}"]/ancestor::p)[1]'
)
if divs:
p_elt = divs[0]
all_text = p_elt.xpath("text()")

summary = "".join(
text.replace("—", "")
for text in all_text
if not (text.startswith("<b>") or text.startswith("<a"))
).strip()
headnote = p_elt.xpath("./b//text()")[0]

match = re.search(self.title_regex, title)
name = title[: match.start()].strip(" ,")

state_cite = ""
if cite_match := re.search(self.state_cite_regex, name):
state_cite = cite_match.group(0)
name = name[: cite_match.start()].strip(" ,")

docket = match.group("docket")
status = match.group("status")

author = row.xpath("string(span[@class='author']/i)").strip()
per_curiam = False
if author.lower() == "per curiam":
per_curiam = True
author = ""

# pick the last preceding-sibling, the most recent date block
date_block = row.xpath(self.date_xpath)[-1].text_content()

if match := re.search(self.date_regex, date_block):
date = match.group("date")
else:
# # for ncctapp unpublished opinions
date = self.secondary_date_regex.search(date_block).group(
"date"
)

def _get_precedential_statuses(self):
return self.my_precedential_statuses
self.cases.append(
{
"author": author,
"per_curiam": per_curiam,
"summary": summary,
"headnote": headnote,
"status": status,
"docket": docket,
"name": name,
"url": url,
"date": date,
"citation": state_cite,
}
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this entire function

    def _process_html(self):
        """
        """
        for row in self.html.xpath(self.row_xpath):
            title = row.xpath("string(span[@class='title'])")
            links = row.xpath("span[@class='title']/@onclick")
            summaries = row.xpath("span[@class='desc']/text()")
            summary = summaries[0] if summaries else ""
            if not links:
                logger.warning("No link for row %s", title)
                continue
            url = links[0][13:-2].replace("http:", "https:")
            m = re.search(r"(?P<name>.*),?\s?(?P<cite>\d+ NC \d+)? \((?P<docket>.*) - (?P<status>.*)\)", title)
            name, citation, docket, status = m.groups()

            author = row.xpath("string(span[@class='author']/i)").strip()
            per_curiam = True if author == "Per Curiam" else False
            date = row.xpath(self.date_xpath)[-1]

            if date == "Zip File of Published Opinions":
                date_parent = "../../preceding-sibling::tr//a/../text()"
                date = row.xpath(date_parent)[0].strip()[7:]
            elif not isinstance(date, str):
                date = date.xpath(".//text()")[1].split("\n")[0]

            self.cases.append({
                "per_curiam": per_curiam,
                "author": author if not per_curiam else "",
                "docket": docket,
                "status": status,
                "name": name,
                "url": url,
                "date": date,
                "summary": summary,
                "citation": citation if citation else "",
            })

to something like this. with an improved date xpath.
date_xpath = "../../preceding-sibling::tr//a/text()"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some wonkiness around dates for a few edge cases but I think this code is cleaner and works for both sc and coa

@flooie flooie assigned Luis-manzur and unassigned flooie Jun 5, 2025
@flooie

flooie commented Jun 9, 2025

Copy link
Copy Markdown
Contributor

@Luis-manzur whats is the status here?

@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Jun 9, 2025
# Conflicts:
#	CHANGES.md
@Luis-manzur

Copy link
Copy Markdown
Contributor

I Updated the

@Luis-manzur whats is the status here?

I did all the recommended changes

Update tests
Move code to nc when not needed in ncctapp
fix docstrings
Move date parsing to its own function
Enable summaries in both courts
Fix citation extraction
@flooie flooie merged commit 2b169ee into main Jun 9, 2025
9 checks passed
@flooie flooie deleted the 1373-update-nc branch June 9, 2025 20:52
@github-project-automation github-project-automation Bot moved this from PRs to Review to Done in Sprint (Case Law) Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants