Skip to content

whoisjones/FiNERweb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

FiNERweb

A multilingual NER dataset covering 91 languages and 25 scripts. See our paper for details!

Get Started

We host all materials on the huggingface-hub! The code for the project can be found here!

How to load the datasets

from datasets import load_dataset

finerweb = load_dataset('whoisjones/finerweb')
finerweb_de = load_dataset('whoisjones/finerweb', split='deu')

How to load the regression models

from transformers import AutoModelForSequenceClassification, AutoTokenizer

    model = AutoModelForSequenceClassification.from_pretrained("whoisjones/finerweb-multilabel-classifier-xlmr-4o")
    tokenizer = AutoTokenizer.from_pretrained("whoisjones/finerweb-multilabel-classifier-xlmr-4o")

    good_example = """'Kraft Foods has taken the Cadbury chocolate brand in a new direction, by combining it with cheese for the first time.
    The company is bringing together two of its brands and launching Philadelphia with Cadbury, a chilled chocolate spread made from Philadelphia Light and Cadbury chocolate.
    Kraft believes the new product has the potential to do very well and is targeting £10m in sales in the first year.
    The new cheese and chocolate spread is being launched on 1 February and will be appear in the chilled dairy aisle next to plain Philadelphia Light.
    It is launching in a 160g tub and a 120g four-pack of mini tubs, both with an rsp of £1.62.
    Kraft is supporting the launch with a £3.2m marketing budget in 2012 and is targeting 2,000 tonnes in volume sales – equivalent to about £10m – in the first year.
    If they reached this volume of sales, the new Philadelphia with Cadbury would have the same market value as Garlic & Herb, currently the biggest-selling flavour in the Philadelphia portfolio.
    Kraft already offers chocolate variants of Philadelphia in Italy and Germany, using Milka chocolate and targeting the breakfast occasion.
    In Germany, Philadelphia with Milka has generated €22.2m in sales since its October 2010 launch and has a 6.6% value share of the chocolate spread market.
    Kraft Foods UK marketing manager Bruce Newman said:
    “The UK product would be positioned as a snack.
    “The breakfast market in countries such as Germany is more developed, and our consumer research firmly identified Philadelphia with Cadbury as a snack.”'"""

    bad_example = """'|Viewing Single Post From: Spoilers for the Week of February 11th| |Lil||Feb 1 2013, 09:58 AM| Don\'t care about Chloe/Taniel/Jen-Jen . Don\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it . And "newfound" feelings . Please . If only . STEFANO!! STEFANO, STEFANO, STEFANO!!!!: cheer: |Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|'"""

    with torch.no_grad():
        good_example_inputs = tokenizer(good_example, return_tensors='pt')
        bad_example_inputs = tokenizer(bad_example, return_tensors="pt")
        good_example_outputs = model(**good_example_inputs)
        bad_example_outputs = model(**bad_example_inputs)
        print(good_example_outputs.logits)
        print(bad_example_outputs.logits)

Datasets

Regression Models

Raw Materials

Note: These materials are the raw annotations, we recommend using the datasets above.

Citation

If you find our work useful, please consider citing our paper!

@misc{golde2025finerwebdatasetsartifactsscalable,
      title={FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition}, 
      author={Jonas Golde and Patrick Haller and Alan Akbik},
      year={2025},
      eprint={2512.13884},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.13884}, 
}

About

A multilingual NER dataset covering 91 languages and 25 scripts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors