Skip to content

Failed extraction on combined text and images PDF #230

@grossir

Description

@grossir

A ca11 opinion. Found 71, see below

The PDF has 3 pages, each with a text header. The extracted content shown is the text header 3 times

The actual opinion content is in an image

Image Image
oc = OpinionCluster.objects.filter(
    docket__court__id="ca11",
    source="C",
).annotate(
    op_text = Substr('sub_opinions__plain_text', 1, 500)
).filter(
    # 
    op_text__contains="2 of "
)

['https://courtlistener.com/opinion/2674718/x/',
 'https://courtlistener.com/opinion/2655980/x/',
 'https://courtlistener.com/opinion/2657641/x/',
 'https://courtlistener.com/opinion/2646625/x/',
 'https://courtlistener.com/opinion/2681361/x/',
 'https://courtlistener.com/opinion/2677536/x/',
 'https://courtlistener.com/opinion/2678749/x/',
 'https://courtlistener.com/opinion/2681362/x/',
 'https://courtlistener.com/opinion/806351/x/',
 'https://courtlistener.com/opinion/2723719/x/',
 'https://courtlistener.com/opinion/810880/x/',
 'https://courtlistener.com/opinion/1037991/x/',
 'https://courtlistener.com/opinion/2641548/x/',
 'https://courtlistener.com/opinion/2794631/x/',
 'https://courtlistener.com/opinion/2771794/x/',
 'https://courtlistener.com/opinion/2773605/x/',
 'https://courtlistener.com/opinion/2755546/x/',
 'https://courtlistener.com/opinion/2780908/x/',
 'https://courtlistener.com/opinion/2781755/x/',
 'https://courtlistener.com/opinion/4394856/x/',
 'https://courtlistener.com/opinion/2799600/x/',
 'https://courtlistener.com/opinion/2821380/x/',
 'https://courtlistener.com/opinion/3045056/x/',
 'https://courtlistener.com/opinion/3045459/x/',
 'https://courtlistener.com/opinion/3046101/x/',
 'https://courtlistener.com/opinion/3049640/x/',
 'https://courtlistener.com/opinion/3160871/x/',
 'https://courtlistener.com/opinion/3160876/x/',
 'https://courtlistener.com/opinion/2787505/x/',
 'https://courtlistener.com/opinion/2812340/x/',
 'https://courtlistener.com/opinion/3043449/x/',
 'https://courtlistener.com/opinion/3167749/x/',
 'https://courtlistener.com/opinion/3168312/x/',
 'https://courtlistener.com/opinion/3206688/x/',
 'https://courtlistener.com/opinion/3210519/x/',
 'https://courtlistener.com/opinion/3217219/x/',
 'https://courtlistener.com/opinion/4238226/x/',
 'https://courtlistener.com/opinion/4585196/x/',
 'https://courtlistener.com/opinion/4425963/x/',
 'https://courtlistener.com/opinion/4569261/x/',
 'https://courtlistener.com/opinion/4475198/x/',
 'https://courtlistener.com/opinion/4548618/x/',
 'https://courtlistener.com/opinion/9426347/x/',
 'https://courtlistener.com/opinion/4676739/x/',
 'https://courtlistener.com/opinion/6454512/x/',
 'https://courtlistener.com/opinion/9431954/x/',
 'https://courtlistener.com/opinion/10041520/x/',
 'https://courtlistener.com/opinion/4689950/x/',
 'https://courtlistener.com/opinion/9427661/x/',
 'https://courtlistener.com/opinion/9998970/x/',
 'https://courtlistener.com/opinion/7859126/x/',
 'https://courtlistener.com/opinion/9453118/x/',
 'https://courtlistener.com/opinion/9454010/x/',
 'https://courtlistener.com/opinion/9454012/x/',
 'https://courtlistener.com/opinion/9454159/x/',
 'https://courtlistener.com/opinion/9446919/x/',
 'https://courtlistener.com/opinion/9453117/x/',
 'https://courtlistener.com/opinion/9452980/x/',
 'https://courtlistener.com/opinion/9453010/x/',
 'https://courtlistener.com/opinion/9453150/x/',
 'https://courtlistener.com/opinion/9453883/x/',
 'https://courtlistener.com/opinion/9454045/x/',
 'https://courtlistener.com/opinion/9454117/x/',
 'https://courtlistener.com/opinion/9454118/x/',
 'https://courtlistener.com/opinion/9454158/x/',
 'https://courtlistener.com/opinion/9454157/x/',
 'https://courtlistener.com/opinion/9454376/x/',
 'https://courtlistener.com/opinion/10274136/x/',
 'https://courtlistener.com/opinion/10687678/x/',
 'https://courtlistener.com/opinion/9434365/x/',
 'https://courtlistener.com/opinion/10360454/x/']

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

PR'd Issues 🤞

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions