Skip to content

Latest commit

 

History

History
104 lines (63 loc) · 8.46 KB

File metadata and controls

104 lines (63 loc) · 8.46 KB

Reddit Post Draft, r/webscraping (RSS Surveillance)

DO NOT PUBLISH THIS FILE or copy to public repo. Draft for Dan's review only.


Title

Every US bankruptcy court publishes a free, unauthenticated RSS feed with every new filing. Here's how to scrape them and what you can catch.


Post Body


PACER is the US federal court system's electronic filing system. It charges $0.10/page to access documents. No real API. Legacy CGI interface from the early 2000s. Most people assume you need to pay for everything.

But every bankruptcy court publishes a free, unauthenticated RSS feed at a predictable URL:

https://ecf.{court-code}.uscourts.gov/cgi-bin/rss_outside.pl

No login. No API key. No robots.txt blocking it. Just XML. Every new docket entry, motions, orders, petitions, hearings, new cases, shows up in near-real-time. There are 94 bankruptcy courts and they all use the same endpoint pattern.

The catch: the feeds roll every ~24 hours. If you don't poll and store the entries, they're gone.


The basic scraping loop:

  1. Fetch the XML with urllib.request
  2. Parse with xml.etree.ElementTree
  3. Each <item> has a GUID, title (contains the case number and docket text), a link (to the PACER document, that's the paid part), and a date
  4. Store each entry in SQLite, keyed by GUID for deduplication
  5. Run on a cron job or Task Scheduler. Once a day catches everything. Twice a day gives you margin.

That's it. You now have a permanent, growing record of every filing in whatever courts you're watching. The feeds are small, maybe 50-150 entries per court per day, so even polling all 94 courts takes about 90 seconds.


Quirks you'll hit:

  • Court codes aren't consistent. N.D. Ill. is ilnb, S.D. Tex. is txsb, D. Colo. is cob. There's no official master list anywhere. You have to map them manually. I did all 94.
  • XML schemas vary between courts. Some use dc:date for timestamps, others bury the date in the title text. Some include the case number in the GUID, others use a numeric ID that tells you nothing. Your parser needs to handle both.
  • Feeds occasionally 503. Federal infrastructure. Build in retries with backoff, three attempts with a 5-second wait handles it. Catch URLError and OSError.
  • Deduplication matters. GUIDs are unique per entry, but if you poll before the feed rolls, you'll see the same entries again. The state file (or a UNIQUE constraint in SQLite) prevents double-processing.
  • Case numbers aren't unique across courts. Case 25-12345 in Northern Illinois is a completely different person than 25-12345 in Southern Texas. Always store the court code alongside the case number. I learned this the hard way when I got a false alarm that scared the hell out of me.
  • email.utils.parsedate_to_datetime handles the RFC 2822 date format that RSS uses. Saved me from writing a manual date parser.
  • Rate limiting is minimal but add polite delays between courts anyway. It's government infrastructure and you want it to keep working.

What you can actually catch with this data:

The raw feed is just a stream of docket entries. The value comes from what you layer on top of it. Here's what's possible once you have the data accumulating:

1. Case-specific monitoring. Track specific case numbers. When a new entry matches, check the docket text against keywords, "dismissed," "order," "motion to withdraw," "relief from stay." You get an alert within hours of a filing instead of checking PACER manually.

2. Attorney portfolio tracking. PACER Case Locator (pcl.uscourts.gov) lets you download free CSV exports of every case a specific attorney has filed. Load those into your database. Now when any of those cases shows activity on the RSS feed, you see it. You can track an attorney's entire active caseload in real time for $0.

3. New filing detection. Every new bankruptcy case starts with "Voluntary Petition" as Doc #1. When you see that in the feed, you know someone just filed. Track filing velocity by attorney, how many new cases per week, per month. Some attorneys file 1-2 a month. Some file 10+ per week. The pattern tells you a lot about their practice.

4. Statutory violation screening. This is the one that gets interesting. Federal law (Section 1328(f)) says if you got a bankruptcy discharge recently, you can't get another one for 2-4 years depending on the chapter. It's three data points: prior discharge date, new filing date, prior chapter. One subtraction. But nobody checks systematically.

When a new Ch. 13 filing hits the feed, you can fuzzy-match the debtor's name against every prior case in your database. If they got a discharge 18 months ago, the new filing can never end in discharge. The attorney either didn't check or didn't care.

For name matching, stdlib works fine, it's a structured data problem. Strip suffixes (Jr., III, Sr.), normalize casing, handle "NMN" (no middle name) placeholders, split joint filings ("John Smith and Jane Smith") and match each spouse independently.

For classifying what a docket entry means, regex falls apart fast. "Motion to Dismiss" vs "Hearing on Motion to Dismiss" vs "Order Denying Motion to Dismiss" are three completely different events. I ended up using sentence-transformers (all-MiniLM-L6-v2, 384 dimensions, runs locally) for docket text classification. It lazy-loads the model, caches embeddings to disk, and falls back to regex when the model isn't available. One real dependency, but it earned its spot.

5. Pattern detection. Same fuzzy matching, different questions. Has this debtor filed before with the same attorney? Has this attorney had an unusual number of dismissals this month? Are cases being filed and dismissed in cycles? Cross-court matching catches things that single-court analysis misses, an attorney filing in two districts can move a client between them.

6. Outcome tracking. Over time, your database accumulates the full lifecycle of every case you're watching. Filed → confirmed → dismissed. Filed → discharged. Filed → converted to Ch. 7. You can compute dismissal rates, time-to-dismissal, discharge rates, and compare them across attorneys or firms. The numbers are public. Nobody aggregates them.


The authenticated side:

When the RSS feed tells you something interesting happened, you sometimes need the actual document. That means logging into PACER, which now requires login.gov with MFA. Playwright handles the browser automation, cookie persistence, session management, navigating the CGI endpoints. The interface hasn't changed in decades, so the selectors are stable. You pay $0.10/page for what you pull, but the RSS layer means you only pull documents that matter instead of checking dockets blind.


Stack: urllib.request, xml.etree.ElementTree, json, csv, sqlite3, collections, argparse, and optionally sentence-transformers for the text classification layer. Python 3.8+.

One sweep of 3 courts: ~3 seconds. Full 94-court scan: ~90 seconds. SQLite database is about 15MB after months of daily accumulation.

Daily operating cost: $0. The data is public. The feeds are free. The courts publish them intentionally, it's part of the e-government transparency mandate. You're just the first person to actually read them programmatically.


Posting Notes

  • Flair: Use project showcase or similar if available
  • Timing: Different day than r/Python post (at least 3-4 days apart). Tue-Thu morning US time.
  • Tone: Developer sharing scraping infrastructure. Focus on the RSS discovery and government site quirks.
  • Cross-reference: Mention r/Python post naturally in the opening line. Link to it.
  • If asked "what are you monitoring?": "Attorney quality metrics in a specific district. The tool is general-purpose, point it at any bankruptcy court's RSS feed."
  • If asked about scale: "Three courts, ~100 entries/day each. The state file is a few hundred KB after months of polling."
  • If asked about legality: "RSS feeds are public. PACER itself charges for document downloads, but the docket entry text in RSS is free. This reads only the free data."
  • If asked about the authenticated scraping: Keep it high-level. "Playwright + cookie management. PACER's login has MFA via login.gov now. The CGI endpoints haven't changed in decades so the selectors are stable."
  • Do NOT: mention any specific firm, attorney, case, personal connection, or use the word "surveillance" in comments (use "monitoring")
  • Do NOT: volunteer details about the 1328(f) screener unless someone asks