Skip to content

Fix parsing problems when silver balls (NEF links) are present.#1784

Open
johnhawkinson wants to merge 6 commits into
freelawproject:mainfrom
johnhawkinson:2026.01.28-silverballs
Open

Fix parsing problems when silver balls (NEF links) are present.#1784
johnhawkinson wants to merge 6 commits into
freelawproject:mainfrom
johnhawkinson:2026.01.28-silverballs

Conversation

@johnhawkinson
Copy link
Copy Markdown
Contributor

When a docket report is run with the Notices of Electronic Filing checkbox enabled, silver ball links appear in the document number field of the docket report, which breaks our parser.
This checkbox is only available to filing users and non-filing (non-PACER) users in some districts, and it's not usually the default, but it apparently is in D.Minn. Hence my docket reports were not appearing in RECAP.

Fix the parser by ignoring "view" if it appears as text in the document number field; properly raise a ValueError if there is a problemetic document number rather than silently throwing away the docket entry; document some of the fragility problems with the parser (that can lead to things like ['pacer_doc_id': 'Dis0layReceipt.pl']; clarify some pretty confusing code. Improve docstrings.

Add a test.

@johnhawkinson johnhawkinson marked this pull request as draft January 28, 2026 08:37
Do not ever use \\ inside a plain string when you can use \ instde a
r'raw string'.
Consolidate nbsp regexps into WHITESPACE_WITH_NBSP, don't just repeat
it twice.
Too much mental juggling was required.
Adds more lines of code, but makes it more clear and maintainable.
Previously we silently dropped docket entries, instead of flagging
this as an error. Bad.
Don't fail when the Notices of Electronic Filing checkbox is on, which
produces "silver ball" icons with the text "view" behind them, leading
to Document Number "view" when we take the first text word of the
table cell.

Instead, ignore the word "view", and add a discussion of potential
other approaches that require looking at more than text nodes.

Silver balls are an option only presented to non-PACER ECF accounts (filing
users, generally), and in some courts the checkbox is enabled by default.

Add a test docket sheet (mnd)
There exist unnumbered bankruptcy entries that also link to attached
PDFs. Previously we threw away the entire docket entry, silently.
Now we include it, but we throw away the attachment link (xxx).

A better fix requires a schema change.

Adjust test results for same (lawb_18072.json).
Move comment from interior to docstring and then document the heck out
of how fragile this method is.
@johnhawkinson johnhawkinson force-pushed the 2026.01.28-silverballs branch from 58b3acb to ead3d6a Compare January 28, 2026 09:44
@johnhawkinson johnhawkinson marked this pull request as ready for review January 28, 2026 09:52
@johnhawkinson
Copy link
Copy Markdown
Contributor Author

Oops, I meant to include the screenshots here.

Screen Shot 2026-01-27 at 13 12 32 Screen Shot 2026-01-27 at 13 10 33

Copy link
Copy Markdown
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. A couple thoughts/questions for you. Can you also do release notes here, please?

Thank you!

Comment thread juriscraper/pacer/docket_report.py
Comment on lines +1589 to +1607
#
_ = """
# This code can go terribly wrong, resulting in things like:
[ 'pacer_doc_id': 'Dis0layReceipt.pl' ]
# which occurred when this was invoked with document_number as 'view'.

# This happened when the anchors list began:
(Pdb) tostring(anchors[0])
b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>'

# which came from this:
(Pdb) tostring(cell)
b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span>&#160;</td>'


# This code was designed to deal with txsb:
<a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script>

""" # noqa
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this style before.

Suggested change
#
_ = """
# This code can go terribly wrong, resulting in things like:
[ 'pacer_doc_id': 'Dis0layReceipt.pl' ]
# which occurred when this was invoked with document_number as 'view'.
# This happened when the anchors list began:
(Pdb) tostring(anchors[0])
b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>'
# which came from this:
(Pdb) tostring(cell)
b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span>&#160;</td>'
# This code was designed to deal with txsb:
<a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script>
""" # noqa
"""
# This code can go terribly wrong, resulting in things like:
[ 'pacer_doc_id': 'Dis0layReceipt.pl' ]
# which occurred when this was invoked with document_number as 'view'.
# This happened when the anchors list began:
(Pdb) tostring(anchors[0])
b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>'
# which came from this:
(Pdb) tostring(cell)
b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span>&#160;</td>'
# This code was designed to deal with txsb:
<a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script>
"""

What's your approach for? Can you fix it here and elsewhere?

Copy link
Copy Markdown
Contributor Author

@johnhawkinson johnhawkinson Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the triple-quoted strings as comments is for lengthy comments that exceed the linter's line-length constraints.
The _ = """ assignment is unnecessary, I guess it's just a habit I can't explain.
But the # noqa is necessary to pacify linters like flake8.
Without it:

(juriscraper) jhawk@lrr juriscraper % flake8 juriscraper/pacer/docket_report.py > /tmp/d1
# (manual edit to remote "noqa" from line 1671)
(juriscraper) jhawk@lrr juriscraper % flake8 juriscraper/pacer/docket_report.py > /tmp/d2
(juriscraper) jhawk@lrr juriscraper % diff -u /tmp/d[12]
--- /tmp/d1	2026-01-28 22:22:39.156195760 -0500
+++ /tmp/d2	2026-01-28 22:22:54.287570031 -0500
@@ -14,3 +14,12 @@
 juriscraper/pacer/docket_report.py:1359:80: E501 line too long (80 > 79 characters)
 juriscraper/pacer/docket_report.py:1390:80: E501 line too long (90 > 79 characters)
 juriscraper/pacer/docket_report.py:1391:80: E501 line too long (87 > 79 characters)
+juriscraper/pacer/docket_report.py:1650:80: E501 line too long (147 > 79 characters)
+juriscraper/pacer/docket_report.py:1656:80: E501 line too long (148 > 79 characters)
+juriscraper/pacer/docket_report.py:1660:80: E501 line too long (276 > 79 characters)
+juriscraper/pacer/docket_report.py:1664:80: E501 line too long (544 > 79 characters)
+juriscraper/pacer/docket_report.py:1666:1: W191 indentation contains tabs
+juriscraper/pacer/docket_report.py:1666:1: E101 indentation contains mixed spaces and tabs
+juriscraper/pacer/docket_report.py:1667:1: W191 indentation contains tabs
+juriscraper/pacer/docket_report.py:1667:1: E101 indentation contains mixed spaces and tabs
+juriscraper/pacer/docket_report.py:1669:80: E501 line too long (118 > 79 characters)

I realize you're now using "Ruff" (which confusingly doesn't seem to do proper linting in my dev environment and I'm not sure why), but I would like my code to pass flake8.

(Although the spaces/tabs probably should be fixed)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's just drop the _ = """ business, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants