Skip to content

Word list in eng.traineddata #179

Open
@jbarth-ubhd

Description

The word list in eng.traineddata contains relatively (in comparison with fra, deu, ita, spa) many ambigious words (checked with https://gist.github.com/jbarth-ubhd/8d5ceb4035bf2d89700117a311209f20 ):

AMBIGIOUS (EXCERPT): Abstract;In addRole Alberta.ca AngMarTV AppSight aXe BarCap Betting| BioTalent BOX/VPOWER B|S|T BTsites CafeMom CATEGORY:NONE ChemGrout classi®cation CMDs CyberCoders d’Alzon Disc™ DomainTools EARTHWEBNEWS.COM ebizQ EBV-infected Elly_Brown ESPN.com Fire).gba FishBowlDC GEO's getFieldType GFP-Fes GOV/PGC/A GreatSeats.com HKFlix HMSHost icon.gif IconLover image/file JobList KCAL/MOL kgw.com KrF LFTs liveCD load_five MbePoint McBurney McGrady MESSAGE Metz® MOVIES/HDTV NCN-pincer NetFlix ~NEW NotesViewColumn NowBuy NowVisit om/fresh PollDaddy <POSSIBLE <<PREVIOUS PRICES|TIPS ProGrad QCard Quotes.net RakionSEA Re:finlay RTDs SciencesLocation Security| >see SEOs ServerBeach Services/Armed Solution™ <STDIO.H> TheBlackElf T/L UNjobs.org usawallpaper.com Ventolin® ViewVC VivirLatino vWD WebCopier www.ask.com <?xml

338080 lines
0.00 % lines with »ſ«
27.71 % lines all-UPPERCASE
8.68 % lines ambigious

PS: fra, deu, ita, spa contain also ~30% all-UPPERCASE words - is this intended?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions