Fix parsing problems when silver balls (NEF links) are present.#1784
Fix parsing problems when silver balls (NEF links) are present.#1784johnhawkinson wants to merge 6 commits into
Conversation
Do not ever use \\ inside a plain string when you can use \ instde a r'raw string'. Consolidate nbsp regexps into WHITESPACE_WITH_NBSP, don't just repeat it twice.
Too much mental juggling was required. Adds more lines of code, but makes it more clear and maintainable.
Previously we silently dropped docket entries, instead of flagging this as an error. Bad.
Don't fail when the Notices of Electronic Filing checkbox is on, which produces "silver ball" icons with the text "view" behind them, leading to Document Number "view" when we take the first text word of the table cell. Instead, ignore the word "view", and add a discussion of potential other approaches that require looking at more than text nodes. Silver balls are an option only presented to non-PACER ECF accounts (filing users, generally), and in some courts the checkbox is enabled by default. Add a test docket sheet (mnd)
There exist unnumbered bankruptcy entries that also link to attached PDFs. Previously we threw away the entire docket entry, silently. Now we include it, but we throw away the attachment link (xxx). A better fix requires a schema change. Adjust test results for same (lawb_18072.json).
Move comment from interior to docstring and then document the heck out of how fragile this method is.
58b3acb to
ead3d6a
Compare
mlissner
left a comment
There was a problem hiding this comment.
Looks pretty good. A couple thoughts/questions for you. Can you also do release notes here, please?
Thank you!
| # | ||
| _ = """ | ||
| # This code can go terribly wrong, resulting in things like: | ||
| [ 'pacer_doc_id': 'Dis0layReceipt.pl' ] | ||
| # which occurred when this was invoked with document_number as 'view'. | ||
|
|
||
| # This happened when the anchors list began: | ||
| (Pdb) tostring(anchors[0]) | ||
| b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>' | ||
|
|
||
| # which came from this: | ||
| (Pdb) tostring(cell) | ||
| b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span> </td>' | ||
|
|
||
|
|
||
| # This code was designed to deal with txsb: | ||
| <a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script> | ||
|
|
||
| """ # noqa |
There was a problem hiding this comment.
I haven't seen this style before.
| # | |
| _ = """ | |
| # This code can go terribly wrong, resulting in things like: | |
| [ 'pacer_doc_id': 'Dis0layReceipt.pl' ] | |
| # which occurred when this was invoked with document_number as 'view'. | |
| # This happened when the anchors list began: | |
| (Pdb) tostring(anchors[0]) | |
| b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>' | |
| # which came from this: | |
| (Pdb) tostring(cell) | |
| b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span> </td>' | |
| # This code was designed to deal with txsb: | |
| <a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script> | |
| """ # noqa | |
| """ | |
| # This code can go terribly wrong, resulting in things like: | |
| [ 'pacer_doc_id': 'Dis0layReceipt.pl' ] | |
| # which occurred when this was invoked with document_number as 'view'. | |
| # This happened when the anchors list began: | |
| (Pdb) tostring(anchors[0]) | |
| b'<a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a>' | |
| # which came from this: | |
| (Pdb) tostring(cell) | |
| b'<td align="right"><span class="iconContainer"><a href="https://ecf.mnd.uscourts.gov/cgi-bin/DisplayReceipt.pl?230820,26" rel="noopener noreferrer"><span class="receiptLink">view</span></a><a href="https://ecf.mnd.uscourts.gov/doc1/101111363917" onclick="goDLS(\'/doc1/101111363917\',\'230820\',\'26\',\'\',\'1\',\'1\',\'\',\'\',\'\');return(false);" rel="noopener noreferrer">4</a></span> </td>' | |
| # This code was designed to deal with txsb: | |
| <a href='/cgi-bin/show_doc.pl?caseid=322636&de_seq_num=2&dm_id=21705446&doc_num=1&pdf_header=0' id='documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0'>1</a><script>DocLink('documentKcaseidV322636Kde_seq_numV2Kdm_idV21705446Kdoc_numV1Kpdf_headerV0');</script> | |
| """ |
What's your approach for? Can you fix it here and elsewhere?
There was a problem hiding this comment.
The use of the triple-quoted strings as comments is for lengthy comments that exceed the linter's line-length constraints.
The _ = """ assignment is unnecessary, I guess it's just a habit I can't explain.
But the # noqa is necessary to pacify linters like flake8.
Without it:
(juriscraper) jhawk@lrr juriscraper % flake8 juriscraper/pacer/docket_report.py > /tmp/d1
# (manual edit to remote "noqa" from line 1671)
(juriscraper) jhawk@lrr juriscraper % flake8 juriscraper/pacer/docket_report.py > /tmp/d2
(juriscraper) jhawk@lrr juriscraper % diff -u /tmp/d[12]
--- /tmp/d1 2026-01-28 22:22:39.156195760 -0500
+++ /tmp/d2 2026-01-28 22:22:54.287570031 -0500
@@ -14,3 +14,12 @@
juriscraper/pacer/docket_report.py:1359:80: E501 line too long (80 > 79 characters)
juriscraper/pacer/docket_report.py:1390:80: E501 line too long (90 > 79 characters)
juriscraper/pacer/docket_report.py:1391:80: E501 line too long (87 > 79 characters)
+juriscraper/pacer/docket_report.py:1650:80: E501 line too long (147 > 79 characters)
+juriscraper/pacer/docket_report.py:1656:80: E501 line too long (148 > 79 characters)
+juriscraper/pacer/docket_report.py:1660:80: E501 line too long (276 > 79 characters)
+juriscraper/pacer/docket_report.py:1664:80: E501 line too long (544 > 79 characters)
+juriscraper/pacer/docket_report.py:1666:1: W191 indentation contains tabs
+juriscraper/pacer/docket_report.py:1666:1: E101 indentation contains mixed spaces and tabs
+juriscraper/pacer/docket_report.py:1667:1: W191 indentation contains tabs
+juriscraper/pacer/docket_report.py:1667:1: E101 indentation contains mixed spaces and tabs
+juriscraper/pacer/docket_report.py:1669:80: E501 line too long (118 > 79 characters)
I realize you're now using "Ruff" (which confusingly doesn't seem to do proper linting in my dev environment and I'm not sure why), but I would like my code to pass flake8.
(Although the spaces/tabs probably should be fixed)
There was a problem hiding this comment.
Cool. Let's just drop the _ = """ business, please.


When a docket report is run with the Notices of Electronic Filing checkbox enabled, silver ball links appear in the document number field of the docket report, which breaks our parser.
This checkbox is only available to filing users and non-filing (non-PACER) users in some districts, and it's not usually the default, but it apparently is in D.Minn. Hence my docket reports were not appearing in RECAP.
Fix the parser by ignoring "view" if it appears as text in the document number field; properly raise a ValueError if there is a problemetic document number rather than silently throwing away the docket entry; document some of the fragility problems with the parser (that can lead to things like
['pacer_doc_id': 'Dis0layReceipt.pl']; clarify some pretty confusing code. Improve docstrings.Add a test.