fix(nc): update to OpinionSiteLinear and new site by grossir · Pull Request #1381 · freelawproject/juriscraper

grossir · 2025-04-23T17:39:28Z

flooie · 2025-04-28T21:12:26Z

This is good, but the summaries data isnt quite the right category. It lists the metadata associated with the case but isnt a summary of the case. They actually have robust headnotes but as far as I can tell there is no way to link the headnotes for a case from the search engine provided. I emailed the court asking if there is a way and will report back.

If we keep the "summary" information though I think it should go into the headnotes field, but should be cleaned up, atleast titled better.

grossir · 2025-04-29T15:42:10Z

I will put this in "Blocked" until we get the court's answer

From your comment, I noticed that the field I was collecting as "summary" makes more sense in nc.

"Whether unacceptable personal conduct provided just cause for termination of a state employee.",
"Whether the Business Court erred in ruling that the parties' intrafamilial dispute fell outside the scope of their arbitration agreement with Charles Schwab."

This should be a OpinionCluster.headnote in Courtlistener, right? From CL

Headnotes are summary descriptions of the legal issues discussed by the court in the particular case. They appear at the beginning of each case just after the summary and disposition. They are short paragraphs with a heading in bold face type. From Wikipedia - A headnote is a brief summary of a particular point of law that is added to the text of a courtdecision to aid readers in locating discussion of a legalissue in an opinion. As the term implies, headnotes appearat the beginning of the published opinion. Frequently, headnotes are value-added components appended to decisions by the publisher who compiles the decisions of a court for resale. As handed down by the court, a decision or written opinion does not contain headnotes. These are added later by an editor not connected to the court, but who instead works for a legal publishing house.

For ncctapp it doesn't look like a summary at all, so I think I will just delete it. Examples:

Domestic Relations; Attorney Fees Award; American Rule; N.C. Gen. Stat. \u00a7 50-13.6 (2023). Required Findings; N.C. Rev. R. Prof. Conduct 1.5(a)-(b); Amount and Reasonableness of time spent
first-degree felony murder; assault with a deadly weapon inflicting serious injury; jury instructions; voluntary manslaughter; rule of lenity; hit and run; N.C. Gen. Stat. 20-166; motion to dismiss; accessory after the fact; N.C. Gen. Stat. 14-7; fatal variance; ex mero motu

flooie · 2025-04-30T16:16:44Z

So I think headnotes is too nice of an opportunity not to include. tell me what you think of this.
Something like this its possible to call the headnotes digest index and match the found URLs for opinions with the digest and include the headnotes directly into our capture. It adds just one call.

Currently we aren't set up for headnotes but that should be an easy addition in opinion site.

def _download(self, request_dict={}):
if self.html == None:
url = "https://appellate.nccourts.org/opinion-filings/digested-index.php?iCourtNumber=1&sFilingYear=2025"
r = self.request["session"].get(url)
self.headnote_html = fromstring(r.text)
return super()._download(request_dict)

and in process_html

        url = url.replace("http:", "https:")
        # divs = self.headnote_html.xpath(f'//div[.//a[@href="{url}"]]')
        p_elt = self.headnote_html.xpath(f'(//a[@href="{url}"]/ancestor::p)[1]')[0]
        # print(tostring(p_elt, pretty_print=True).decode())
        inner_html = "".join(tostring(child, encoding="unicode") for child in p_elt)

@grossir

# Conflicts: # CHANGES.md # juriscraper/opinions/united_states/state/nc.py # juriscraper/opinions/united_states/state/ncctapp.py

…hods in NC scraper

flooie · 2025-06-05T17:06:54Z

@Luis-manzur is this still a draft?

flooie

in theory I like the idea of collecting headnotes, but in practice I dont think we should include them here. They appear to publish the headnotes digest months and months after they publish opinions. This means in reality we will never actually collect them during a scrape and have no mechanism right now to merge headnotes back into the system.

What it does do is overly complicate the tool.

flooie · 2025-06-05T23:28:23Z

        """Goes into OpinionCluster.attorneys, type: string"""
        return self._get_optional_field_by_id("attorney")

+    def _get_headnotes(self):


drop headnotes from this PR ...

and we should add them in a different pr if we want to include headnotes

flooie · 2025-06-05T23:28:49Z

-                # Like: viewOpinion("http://appellate.nccourts.org/opinions/?c=1&amp;pdf=31511")
-                if len(urls) != 1 or urls[0].find("viewOpinion") != 0:
-                    continue  # Only interested in cases with a download link
+    def _process_html(self):


docstring format is not standard juriscraper

this is not resolved

flooie · 2025-06-05T23:31:56Z

+        Iterates over each row in the HTML, extracting the title, link, summary, headnote,
+        docket, status, author, per curiam status, date, and citation for each opinion.
+        Handles cases where opinions may be withdrawn (no link), and parses additional
+        information from the headnote HTML if available.

-                path = "./td/span/span[contains(@class,'title')]"
-                txt = html.tostring(
-                    row_el.xpath(path)[0], method="text", encoding="unicode"
-                )
-                case_name, neutral_cite, docket_number = self.parse_title(txt)
-
-                summary = ""
-                path = "./td/span/span[contains(@class,'desc')]/text()"
-                summaries = row_el.xpath(path)
-                try:
-                    summary = summaries[0]
-                except IndexError:
-                    # Not all cases have a summary
-                    pass
-                if case_name.strip() == "":
-                    continue  # A few cases are missing a name
-
-                case_dates.append(case_date)
-                self.my_download_urls.append(download_url)
-                self.my_case_names.append(case_name)
-                self.my_docket_numbers.append(docket_number)
-                self.my_summaries.append(summary)
-                self.my_neutral_citations.append(neutral_cite)
-                self.my_precedential_statuses.append(precedential_status)
-
-            elif precedential_status == "Unpublished":
-                for span in row_el.xpath("./td/span"):
-                    if "onclick" not in span.attrib:
-                        continue
-                    download_url = re.search(
-                        r'viewopinion\("(.*)"',
-                        span.attrib["onclick"],
-                        re.IGNORECASE,
-                    ).group(1)
-
-                    txt = span.text_content().strip()
-                    (
-                        case_name,
-                        neutral_cite,
-                        docket_number,
-                    ) = self.parse_title(txt)
-                    if case_name.strip() == "":
-                        continue  # A few cases are missing a name
-                    case_dates.append(case_date)
-                    self.my_download_urls.append(download_url)
-                    self.my_case_names.append(case_name)
-                    self.my_docket_numbers.append(docket_number)
-                    self.my_summaries.append("")
-                    self.my_neutral_citations.append(neutral_cite)
-                    self.my_precedential_statuses.append(precedential_status)
-
-        return case_dates
-
-    # Parses case titles like:
-    # Fields v. Harnett Cnty., 367 NC 12 (13-761)
-    # Clark v. Clark,  (13-612)
-    @staticmethod
-    def parse_title(txt):
-        try:
-            name_and_citation = txt.rsplit("(", 1)[0].strip()
-            docket_number = (
-                re.search(r"(.*\d).*?", txt.rsplit("(", 1)[1]).group(0).strip()
-            )
-            case_name = name_and_citation.rsplit(",", 1)[0].strip()
-            try:
-                neutral_cite = name_and_citation.rsplit(",", 1)[1].strip()
-                if not re.search(r"^\d\d.*\d\d$", neutral_cite):
-                    neutral_cite = ""
-            except IndexError:
-                # Unable to find comma to split on. No neutral cite.
-                neutral_cite = ""
-        except Exception:
-            raise InsanityException(
-                f"Unable to parse: {txt}\n{traceback.format_exc()}"
-            )
-        return case_name, neutral_cite, docket_number
-
-    def _get_download_urls(self):
-        return self.my_download_urls
+        Appends a dictionary of extracted case data to self.cases for each valid row.
+        """
+        for row in self.html.xpath(self.row_xpath):
+            title = row.xpath("string(span[@class='title'])")

-    def _get_case_names(self):
-        return self.my_case_names
+            link = row.xpath("span[@class='title']/@onclick")
+            if not link:
+                # some opinions may be withdrawn
+                logger.warning("No link for row %s", title)
+                continue

-    def _get_docket_numbers(self):
-        return self.my_docket_numbers
+            url = link[0].split('("')[1].strip('")')

-    def _get_summaries(self):
-        return self.my_summaries
+            summary = (
+                row.xpath("string(span[@class='desc'])")
+                if self.collect_summary
+                else ""
+            )
+            headnote = (
+                ""
+                if self.collect_summary
+                else row.xpath("string(span[@class='desc'])")
+            )

-    def _get_citations(self):
-        return self.my_neutral_citations
+            url = url.replace("http:", "https:")
+            divs = self.headnote_html.xpath(
+                f'(//a[@href="{url}"]/ancestor::p)[1]'
+            )
+            if divs:
+                p_elt = divs[0]
+                all_text = p_elt.xpath("text()")
+
+                summary = "".join(
+                    text.replace("—", "")
+                    for text in all_text
+                    if not (text.startswith("<b>") or text.startswith("<a"))
+                ).strip()
+                headnote = p_elt.xpath("./b//text()")[0]
+
+            match = re.search(self.title_regex, title)
+            name = title[: match.start()].strip(" ,")
+
+            state_cite = ""
+            if cite_match := re.search(self.state_cite_regex, name):
+                state_cite = cite_match.group(0)
+                name = name[: cite_match.start()].strip(" ,")
+
+            docket = match.group("docket")
+            status = match.group("status")
+
+            author = row.xpath("string(span[@class='author']/i)").strip()
+            per_curiam = False
+            if author.lower() == "per curiam":
+                per_curiam = True
+                author = ""
+
+            # pick the last preceding-sibling, the most recent date block
+            date_block = row.xpath(self.date_xpath)[-1].text_content()
+
+            if match := re.search(self.date_regex, date_block):
+                date = match.group("date")
+            else:
+                # # for ncctapp unpublished opinions
+                date = self.secondary_date_regex.search(date_block).group(
+                    "date"
+                )

-    def _get_precedential_statuses(self):
-        return self.my_precedential_statuses
+            self.cases.append(
+                {
+                    "author": author,
+                    "per_curiam": per_curiam,
+                    "summary": summary,
+                    "headnote": headnote,
+                    "status": status,
+                    "docket": docket,
+                    "name": name,
+                    "url": url,
+                    "date": date,
+                    "citation": state_cite,
+                }
+            )


I think we can simplify this entire function

def _process_html(self): """ """ for row in self.html.xpath(self.row_xpath): title = row.xpath("string(span[@class='title'])") links = row.xpath("span[@class='title']/@onclick") summaries = row.xpath("span[@class='desc']/text()") summary = summaries[0] if summaries else "" if not links: logger.warning("No link for row %s", title) continue url = links[0][13:-2].replace("http:", "https:") m = re.search(r"(?P<name>.*),?\s?(?P<cite>\d+ NC \d+)? \((?P<docket>.*) - (?P<status>.*)\)", title) name, citation, docket, status = m.groups() author = row.xpath("string(span[@class='author']/i)").strip() per_curiam = True if author == "Per Curiam" else False date = row.xpath(self.date_xpath)[-1] if date == "Zip File of Published Opinions": date_parent = "../../preceding-sibling::tr//a/../text()" date = row.xpath(date_parent)[0].strip()[7:] elif not isinstance(date, str): date = date.xpath(".//text()")[1].split("\n")[0] self.cases.append({ "per_curiam": per_curiam, "author": author if not per_curiam else "", "docket": docket, "status": status, "name": name, "url": url, "date": date, "summary": summary, "citation": citation if citation else "", })

to something like this. with an improved date xpath.
date_xpath = "../../preceding-sibling::tr//a/text()"

there is some wonkiness around dates for a few edge cases but I think this code is cleaner and works for both sc and coa

…TML processing

flooie · 2025-06-09T17:20:33Z

@Luis-manzur whats is the status here?

# Conflicts: # CHANGES.md

Luis-manzur · 2025-06-09T19:15:04Z

I Updated the

@Luis-manzur whats is the status here?

I did all the recommended changes

Update tests Move code to nc when not needed in ncctapp fix docstrings Move date parsing to its own function Enable summaries in both courts Fix citation extraction

fix(nc): update to OpinionSiteLinear and new site

adbd7e9

Solves #1373

grossir added this to Sprint (Case Law) Apr 23, 2025

grossir moved this to PRs to Review in Sprint (Case Law) Apr 23, 2025

grossir assigned flooie Apr 23, 2025

flooie assigned grossir and unassigned flooie Apr 28, 2025

grossir moved this from PRs to Review to Blocked in Sprint (Case Law) Apr 29, 2025

Merge branch 'main' into 1373-update-nc

f2a1afa

Merge branch 'main' into 1373-update-nc

cb4725e

grossir moved this from Blocked to In progress in Sprint (Case Law) Apr 30, 2025

flooie assigned Luis-manzur and unassigned grossir Jun 2, 2025

flooie moved this from In progress to PRs to Review in Sprint (Case Law) Jun 2, 2025

flooie marked this pull request as draft June 2, 2025 19:36

Luis-manzur added 5 commits June 3, 2025 15:29

Merge branch 'main' into 1373-update-nc

450bbfb

# Conflicts: # CHANGES.md # juriscraper/opinions/united_states/state/nc.py # juriscraper/opinions/united_states/state/ncctapp.py

chore: add nc update to CHANGES.md

a919fae

feat: enhance NC scraper to collect headnotes and summaries

3945808

feat: update NC scraper to support test mode with headnote HTML files

f637e14

feat: add detailed docstrings for HTML processing and downloading met…

e550c21

…hods in NC scraper

Luis-manzur requested a review from flooie June 3, 2025 23:08

Luis-manzur assigned flooie and unassigned Luis-manzur Jun 3, 2025

fix: improve ncctapp testing time

91dcbc0

Luis-manzur marked this pull request as ready for review June 5, 2025 17:36