diff --git a/README.md b/README.md index f3d939de..841f06ea 100644 --- a/README.md +++ b/README.md @@ -116,7 +116,7 @@ Above, `use_best=True` implements mitigation so that the uncertainty-minimized r * Exact Match ([Cole et al., 2023](https://arxiv.org/abs/2305.14613); [Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175)) * BERTScore ([Manakul et al., 2023](https://arxiv.org/abs/2303.08896); [Zheng et al., 2020](https://arxiv.org/abs/1904.09675)) * Cosine Similarity ([Shorinwa et al., 2024](https://arxiv.org/abs/2412.05563)) -* Functional Entropy for Code Generation ([Bouchard et al., 2026](https://arxiv.org/abs/2605.28500)) +* Code-adapted scorers via [`CodeGenUQ`](#code-generation-uq). ### White-Box Scorers (Token-Probability-Based) @@ -322,6 +322,29 @@ Above `response` and `entailment` reflect the original response and response-lev * Graph-based scorers ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783)) * Generalized long-form semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0)) + +### Code Generation UQ + +For code-generation tasks, UQLM provides `CodeGenUQ`, a specialized interface for predicting whether LLM-generated code is functionally correct without requiring execution. `CodeGenUQ` includes white-box scorers, code-adapted black-box scorers based on functional equivalence, and reflexive self-evaluation scorers. The white-box methods are the same token-probability-based scorers available through `WhiteBoxUQ`. + +**Example Usage:** + +```python +from langchain_openai import ChatOpenAI +llm = ChatOpenAI(model="gpt-4o-mini") + +from uqlm import CodeGenUQ +cguq = CodeGenUQ( + llm=llm, + scorers=["functional_equivalence_rate"] +) + +results = await cguq.generate_and_score(prompts=prompts, num_responses=5) +results.to_df() +``` + +For a more detailed demo, refer to our [`CodeGenUQ` Demo](./examples/codegen_demo.ipynb). More details on code generation scorers are available in [Bouchard et al., 2026](https://arxiv.org/abs/2605.28500). + ## Documentation Check out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more. diff --git a/docs/source/scorer_definitions/code_generation/code_similarity.rst b/docs/source/scorer_definitions/code_generation/code_similarity.rst new file mode 100644 index 00000000..554a5746 --- /dev/null +++ b/docs/source/scorer_definitions/code_generation/code_similarity.rst @@ -0,0 +1,52 @@ +Code Similarity Scorers +======================= + +.. currentmodule:: uqlm.scorers + +Definition +---------- + +Code similarity scorers generate sampled code responses from the same prompt and compare each sampled response with the original response. Higher average similarity indicates higher confidence. + +``cosine_sim`` embeds the original and sampled code responses with a code embedding model, then computes normalized average cosine similarity: + +.. math:: + + NCS(y; \tilde{\mathbf{y}}) = \frac{1}{2} + \frac{1}{2m} \sum_{j=1}^{m} \frac{V(y) \cdot V(\tilde{y}_j)}{\|V(y)\| \cdot \|V(\tilde{y}_j)\|} + +``code_bleu`` computes average CodeBLEU similarity between the original code response and sampled responses: + +.. math:: + + CBC(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \text{CodeBLEU}(y, \tilde{y}_j) + +**Key Properties:** + +- Code-adapted black-box consistency scoring +- Uses structural or embedding-based similarity rather than natural-language entailment +- Score range: :math:`[0, 1]` + +Parameters +---------- + +When using :class:`CodeGenUQ`, specify ``"cosine_sim"`` or ``"code_bleu"`` in the ``scorers`` list. You can also set ``sentence_transformer`` for ``cosine_sim`` and ``language`` for ``code_bleu``. + +Example +------- + +.. code-block:: python + + from uqlm import CodeGenUQ + + code_uq = CodeGenUQ( + llm=llm, + scorers=["cosine_sim", "code_bleu"], + language="python", + ) + + results = await code_uq.generate_and_score(prompts=prompts, num_responses=5) + +See Also +-------- + +- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification diff --git a/docs/source/scorer_definitions/code_generation/functional_equivalence.rst b/docs/source/scorer_definitions/code_generation/functional_equivalence.rst new file mode 100644 index 00000000..6ac376f4 --- /dev/null +++ b/docs/source/scorer_definitions/code_generation/functional_equivalence.rst @@ -0,0 +1,76 @@ +Functional Equivalence Scorers +============================== + +.. currentmodule:: uqlm.scorers + +Definition +---------- + +Functional equivalence scorers use an LLM to judge whether two code snippets are functionally equivalent, meaning they would produce the same outputs for valid inputs. These scorers were proposed by Bouchard et al. (2026). + +``functional_equivalence_rate`` estimates the proportion of sampled responses that are functionally equivalent to the original response: + +.. math:: + + FER(y; \tilde{\mathbf{y}}) = \frac{1}{m} \sum_{j=1}^{m} \mathbb{I}[y \equiv \tilde{y}_j] + +``functional_negentropy`` clusters the original and sampled responses by functional equivalence, computes entropy over the cluster distribution, and normalizes it to a confidence score. Let :math:`\mathcal{C}` denote the set of functional equivalence clusters, and let :math:`P(C)` denote the proportion of responses in cluster :math:`C`. Functional entropy is: + +.. math:: + + FE(y; \tilde{\mathbf{y}}) = -\sum_{C \in \mathcal{C}} P(C) \log P(C) + +The normalized confidence score is: + +.. math:: + + NFN(y; \tilde{\mathbf{y}}) = 1 - \frac{FE(y; \tilde{\mathbf{y}})}{\log(m + 1)} + +``functional_sets_confidence`` counts the number of functional equivalence clusters and normalizes it to :math:`[0, 1]`: + +.. math:: + + FSC(y; \tilde{\mathbf{y}}) = \frac{m + 1 - |\mathcal{C}|}{m} + +where :math:`|\mathcal{C}|` is the number of functional equivalence clusters among the original response and :math:`m` sampled responses. + +**Key Properties:** + +- Directly targets functional agreement rather than textual similarity +- Requires an LLM for equivalence judgments +- Score range: :math:`[0, 1]` + +Parameters +---------- + +When using :class:`CodeGenUQ`, specify ``"functional_equivalence_rate"``, ``"functional_negentropy"``, or ``"functional_sets_confidence"`` in the ``scorers`` list. You can set ``equivalence_llm`` to use a separate model for equivalence judgments. + +Example +------- + +.. code-block:: python + + from uqlm import CodeGenUQ + + code_uq = CodeGenUQ( + llm=llm, + equivalence_llm=equivalence_llm, + scorers=[ + "functional_equivalence_rate", + "functional_negentropy", + "functional_sets_confidence", + ], + language="python", + ) + + results = await code_uq.generate_and_score(prompts=prompts, num_responses=5) + +References +---------- + +- Bouchard, D., et al. (2026). `Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification `_. *arXiv*. + +See Also +-------- + +- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification diff --git a/docs/source/scorer_definitions/code_generation/index.rst b/docs/source/scorer_definitions/code_generation/index.rst new file mode 100644 index 00000000..5f7d2233 --- /dev/null +++ b/docs/source/scorer_definitions/code_generation/index.rst @@ -0,0 +1,29 @@ +Code-Generation Scorers +======================= + +.. currentmodule:: uqlm.scorers + +Code-generation uncertainty quantification uses :class:`CodeGenUQ` to score generated code. These scorers either reuse existing short-form UQ methods or adapt black-box consistency scoring to code by comparing structural similarity or functional equivalence across sampled generations. + +**Key Characteristics:** + +- **White-box compatibility:** Token-probability scorers are identical to the corresponding :doc:`white-box scorers <../white_box/index>`. +- **Code-aware consistency:** Black-box scorers compare sampled code generations using code embeddings, CodeBLEU, or LLM-judged functional equivalence. +- **Score range:** :math:`[0, 1]`, where higher values indicate higher confidence. + +**Trade-offs:** + +- **Dependency requirements:** Some code-aware scorers require code-specific models or language tooling. +- **Higher cost:** Functional equivalence scorers require additional LLM calls. + +Code-Generation Scoring Methods +------------------------------- + +There are three main categories of code-generation scoring methods offered by UQLM: + +.. toctree:: + :maxdepth: 1 + + token_probability + code_similarity + functional_equivalence diff --git a/docs/source/scorer_definitions/code_generation/token_probability.rst b/docs/source/scorer_definitions/code_generation/token_probability.rst new file mode 100644 index 00000000..a2a53367 --- /dev/null +++ b/docs/source/scorer_definitions/code_generation/token_probability.rst @@ -0,0 +1,51 @@ +Token-Probability Code Scorers +============================== + +.. currentmodule:: uqlm.scorers + +Definition +---------- + +Token-probability code scorers are the same methods used by :class:`WhiteBoxUQ`, applied to generated code responses through :class:`CodeGenUQ`. + +Available scorers: + +- ``sequence_probability`` +- ``min_probability`` +- ``mean_token_negentropy`` +- ``min_token_negentropy`` +- ``probability_margin`` +- ``monte_carlo_probability`` +- ``p_true`` + +**Key Properties:** + +- Requires token probabilities from the LLM/API +- Uses the same definitions as the corresponding white-box short-form scorers +- Score range: :math:`[0, 1]` + +Parameters +---------- + +When using :class:`CodeGenUQ`, specify one or more token-probability scorer names in the ``scorers`` list. + +Example +------- + +.. code-block:: python + + from uqlm import CodeGenUQ + + code_uq = CodeGenUQ( + llm=llm, + scorers=["sequence_probability", "min_probability"], + language="python", + ) + + results = await code_uq.generate_and_score(prompts=prompts) + +See Also +-------- + +- :class:`CodeGenUQ` - Class for code-generation uncertainty quantification +- :class:`WhiteBoxUQ` - Class for white-box uncertainty quantification diff --git a/docs/source/scorer_definitions/index.rst b/docs/source/scorer_definitions/index.rst index 0cbdfb15..a2bcc486 100644 --- a/docs/source/scorer_definitions/index.rst +++ b/docs/source/scorer_definitions/index.rst @@ -15,4 +15,4 @@ For detailed API documentation and usage examples, see the :doc:`API Reference < llm_judges/index ensemble/index long_text/index - + code_generation/index diff --git a/examples/long_text_graph_demo.ipynb b/examples/long_text_graph_demo.ipynb index 44bb63f3..e0065f14 100644 --- a/examples/long_text_graph_demo.ipynb +++ b/examples/long_text_graph_demo.ipynb @@ -16,8 +16,8 @@ "* Betweenness Centrality ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n", "* PageRank ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))\n", "* Degree Centrality ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n", - "* Harmonic Centrality\n", - "* Laplacian Centrality\n", + "* Harmonic Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n", + "* Laplacian Centrality ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n", "\n", "\n", "\n", @@ -106,12 +106,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**. " + "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "tags": [] }, @@ -204,7 +204,11 @@ "source": [ "# Load example dataset (FactScore)\n", "factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n", - "factscore.head()" + "factscore.head()\n", + "\n", + "# # Alternative dataset (FactScore-STEM-Geo)\n", + "# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n", + "# factscore_stem_geo.head()\n" ] }, { diff --git a/examples/long_text_qa_demo.ipynb b/examples/long_text_qa_demo.ipynb index 3c97ca86..133f466c 100644 --- a/examples/long_text_qa_demo.ipynb +++ b/examples/long_text_qa_demo.ipynb @@ -12,7 +12,7 @@ "

\n", " \n", "* Long-form Semantic Entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))\n", - "* Black-Box Generalizations of Long-form Semantic Entropy\n", + "* Black-Box Generalizations of Long-form Semantic Entropy ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n", "\n", "\n", "\n", @@ -92,15 +92,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**. " + "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, + "execution_count": null, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -182,15 +180,18 @@ "4 Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ... " ] }, - "execution_count": 2, "metadata": {}, - "output_type": "execute_result" + "output_type": "display_data" } ], "source": [ "# Load example dataset (FactScore)\n", - "factscore = load_example_dataset(\"factscore\", n=15)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n", - "factscore.head()" + "factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n", + "factscore.head()\n", + "\n", + "# # Alternative dataset (FactScore-STEM-Geo)\n", + "# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n", + "# factscore_stem_geo.head()\n" ] }, { diff --git a/examples/long_text_uq_demo.ipynb b/examples/long_text_uq_demo.ipynb index 4a6758e9..bcb407e4 100644 --- a/examples/long_text_uq_demo.ipynb +++ b/examples/long_text_uq_demo.ipynb @@ -15,7 +15,7 @@ "* Long-text Uncertainty Quantification (LUQ) ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n", "* LUQ-Atomic ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n", "* LUQ-pair ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n", - "* Generalized LUQ-pair ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279))\n", + "* Generalized LUQ-pair ([Bouchard et al., 2026](https://arxiv.org/abs/2602.17431))\n", "\n", "\n", "\n", @@ -94,15 +94,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. To implement with your use case, simply **replace the example prompts with your data**. " + "In this demo, we will illustrate this approach using the [FactScore](https://github.com/shmsw25/FActScore/tree/main/factscore) longform QA dataset. Alternatively, specify `factscore-stem-geo` for use of FactScore-STEM-Geo from [Bouchard et al., 2026](https://arxiv.org/abs/2602.17431). To implement with your use case, simply **replace the example prompts with your data**. " ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, + "execution_count": null, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -154,6 +152,16 @@ " Tell me a bio of Iggy Azalea within 100 words.\\n\n", " Amethyst Amelia Kelly (born 7 June 1990), know...\n", " \n", + " \n", + " 3\n", + " Tell me a bio of Fernando da Costa Novaes with...\n", + " Fernando da Costa Novaes (April 6, 1927 – Marc...\n", + " \n", + " \n", + " 4\n", + " Tell me a bio of Jan Zamoyski within 100 words.\\n\n", + " Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ...\n", + " \n", " \n", "\n", "" @@ -163,22 +171,29 @@ "0 Tell me a bio of Suthida within 100 words.\\n \n", "1 Tell me a bio of Miguel Ángel Félix Gallardo w... \n", "2 Tell me a bio of Iggy Azalea within 100 words.\\n \n", + "3 Tell me a bio of Fernando da Costa Novaes with... \n", + "4 Tell me a bio of Jan Zamoyski within 100 words.\\n \n", "\n", " wikipedia_text \n", "0 Suthida Bajrasudhabimalalakshana (Thai: สมเด็จ... \n", "1 Miguel Ángel Félix Gallardo (born January 8, 1... \n", - "2 Amethyst Amelia Kelly (born 7 June 1990), know... " + "2 Amethyst Amelia Kelly (born 7 June 1990), know... \n", + "3 Fernando da Costa Novaes (April 6, 1927 – Marc... \n", + "4 Jan Sariusz Zamoyski (Latin: Ioannes Zamoyski ... " ] }, - "execution_count": 2, "metadata": {}, - "output_type": "execute_result" + "output_type": "display_data" } ], "source": [ "# Load example dataset (FactScore)\n", - "factscore = load_example_dataset(\"factscore\", n=3)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n", - "factscore.head()" + "factscore = load_example_dataset(\"factscore\", n=5)[[\"hundredw_prompt\", \"wikipedia_text\"]].rename(columns={\"hundredw_prompt\": \"prompt\"})\n", + "factscore.head()\n", + "\n", + "# # Alternative dataset (FactScore-STEM-Geo)\n", + "# factscore_stem_geo = load_example_dataset(\"factscore-stem-geo\", n=5)\n", + "# factscore_stem_geo.head()\n" ] }, { diff --git a/tests/test_load_example_dataset.py b/tests/test_load_example_dataset.py index 1ee40b46..2f3f350e 100644 --- a/tests/test_load_example_dataset.py +++ b/tests/test_load_example_dataset.py @@ -14,6 +14,8 @@ import os import platform +import sys +import types import pytest import pandas as pd from uqlm.utils.dataloader import load_example_dataset, list_dataset_names, _combine_question_and_choices @@ -24,6 +26,7 @@ def test_list_dataset_names(): datasets = list_dataset_names() assert isinstance(datasets, list) assert "gsm8k" in datasets + assert "factscore-stem-geo" in datasets def test_load_nonexistent_dataset(): @@ -31,6 +34,41 @@ def test_load_nonexistent_dataset(): load_example_dataset("nonexistent_dataset") +def test_load_factscore_stem_geo(monkeypatch): + import uqlm.utils.dataloader as dataloader + + class FakePage: + text = "Hydrogen is a chemical element." + + class FakeWikipedia: + def __init__(self, user_agent, language): + self.user_agent = user_agent + self.language = language + + def page(self, entity): + return FakePage() + + fake_wikipediaapi = types.SimpleNamespace(Wikipedia=FakeWikipedia) + monkeypatch.setitem(sys.modules, "wikipediaapi", fake_wikipediaapi) + monkeypatch.setattr(dataloader.importlib.util, "find_spec", lambda name: object() if name == "wikipediaapi" else None) + monkeypatch.setattr(dataloader, "FACTSCORE_STEM_GEO_ENTITIES", {"chemical element": ["Hydrogen"]}) + + df = load_example_dataset("factscore-stem-geo") + + assert list(df.columns) == ["entity_type", "entity", "question", "wikipedia_text"] + assert df["question"].iloc[0] == "Write a paragraph with some facts about the chemical element Hydrogen." + assert df["wikipedia_text"].iloc[0] == "Hydrogen is a chemical element." + + +def test_load_factscore_stem_geo_warns_without_wikipediaapi(monkeypatch): + import uqlm.utils.dataloader as dataloader + + monkeypatch.setattr(dataloader.importlib.util, "find_spec", lambda name: None if name == "wikipediaapi" else object()) + with pytest.warns(UserWarning, match="wikipedia-api"): + with pytest.raises(ImportError, match="wikipedia-api"): + load_example_dataset("factscore-stem-geo") + + # @unittest.skipIf(((os.getenv("CI") == "true") & (platform.system() == "Darwin")), "Skipping test in macOS CI due to connection issues.") @pytest.mark.skipif(((os.getenv("CI") == "true") & (platform.system() == "Darwin")), reason="Skipping test in macOS CI due to connection issues.") @pytest.mark.flaky(reruns=3) diff --git a/uqlm/utils/dataloader.py b/uqlm/utils/dataloader.py index 43d7e67d..f2cd5757 100644 --- a/uqlm/utils/dataloader.py +++ b/uqlm/utils/dataloader.py @@ -13,12 +13,14 @@ # limitations under the License. import pandas as pd -from typing import Optional, Union +from typing import List, Optional, Union from datasets import load_dataset, concatenate_datasets from datasets import disable_progress_bars +import importlib.util import re import ast import numpy as np +import warnings from copy import deepcopy """This module uses the _dataset_default_params dict to control what datasets load_example_dataset can load and how they are loaded. @@ -86,8 +88,11 @@ "hotpotqa": {"load_params": {"path": "hotpotqa/hotpot_qa", "name": "distractor", "split": "validation"}, "extra_processing": {"subset_columns": ["question", "answer"]}}, "simpleqa": {"load_params": {"path": "google/simpleqa-verified", "split": "eval"}, "extra_processing": {"subset_columns": ["question", "answer"], "rename_columns": {"problem": "question"}}}, "livecodebench": {"load_params": {"path": "livecodebench/code_generation_lite", "split": "test"}, "extra_processing": {"subset_columns": ["question_title", "question_content", "platform", "question_id", "starter_code", "public_test_cases", "metadata", "difficulty"]}}, + "factscore-stem-geo": {"load_params": {"loader": "_load_factscore_stem_geo_dataset"}, "extra_processing": {}}, } +USER_AGENT = "uqlm/0.6.0 (https://github.com/cvs-health/uqlm)" + def list_dataset_names() -> list: """ @@ -102,7 +107,7 @@ def list_dataset_names() -> list: ------- >>> from uqlm.utils.dataloader import list_dataset_names >>> list_dataset_names() - ['ai2_arc', 'csqa', 'dialogue_sum', 'gsm8k', 'nq_open', 'popqa', 'svamp', 'triviaqa', 'factscore', 'hotpotqa', 'simpleqa','livecodebench'] + ['ai2_arc', 'csqa', 'gsm8k', 'nq_open', 'popqa', 'svamp', 'factscore', 'hotpotqa', 'simpleqa', 'livecodebench', 'factscore-stem-geo'] """ return list(_dataset_default_params.keys()) @@ -115,11 +120,12 @@ def load_example_dataset(name: str, n: int = None, cols: Optional[Union[list, st ---------- name : str The name of the dataset to load. Must be one of "svamp", "gsm8k", "ai2_arc", - "csqa", "nq_open", "popqa" + "csqa", "nq_open", "popqa", "factscore", "hotpotqa", "simpleqa", + "livecodebench", "factscore-stem-geo" n : int, optional - Number of rows to load from the dataset. - + Number of rows to load from the dataset. Ignored for "factscore-stem-geo", + which always returns the longest 100 articles for each of four categories." Returns ------- pd.DataFrame @@ -136,6 +142,15 @@ def load_example_dataset(name: str, n: int = None, cols: Optional[Union[list, st if name in dataset_dict.keys(): # loads from huggingface hub disable_progress_bars() # disable hf tqdm bars b/c it's a little ugly print(f"Loading dataset - {name}...") + if dataset_dict[name]["load_params"].get("loader") == "_load_factscore_stem_geo_dataset": + if isinstance(n, int): + print("""Note: the 'n' parameter is not used for 'factscore-stem-geo' — the longest 100 articles will be returned for four categories: chemical elements, nerves in the human body, mountains on Earth, and scientific laws.""") + print("Fetching Wikipedia articles — this may take a few minutes...") + df = _load_factscore_stem_geo_dataset() + if cols: + df = _dataset_processing(df=df, subset_columns=cols) + print("Dataset ready!") + return df if split: dataset_dict[name]["load_params"]["split"] = split ds = load_dataset(**dataset_dict[name]["load_params"]) @@ -301,3 +316,763 @@ def _combine_question_and_choices(df: pd.DataFrame, question_col: str, choice_co else: raise TypeError(f"'choice_col' must be str or list, but received '{type(choice_col)}'") return df + + +FACTSCORE_STEM_GEO_ENTITIES = { + "nerve": [ + "Abdominal aortic plexus", + "Abducens nerve", + "Accessory nerve", + "Accessory obturator nerve", + "Alderman's nerve", + "Anococcygeal nerve", + "Ansa cervicalis", + "Anterior interosseous nerve", + "Anterior superior alveolar nerve", + "Auerbach's plexus", + "Auriculotemporal nerve", + "Axillary nerve", + "Brachial plexus", + "Buccal branch of the facial nerve", + "Buccal nerve", + "Cardiac plexus", + "Cavernous nerves", + "Cavernous plexus", + "Celiac ganglia", + "Cervical branch of the facial nerve", + "Cervical plexus", + "Chorda tympani", + "Ciliary ganglion", + "Coccygeal nerve", + "Cochlear nerve", + "Common fibular nerve", + "Common palmar digital nerves of median nerve", + "Deep branch of the radial nerve", + "Deep fibular nerve", + "Deep petrosal nerve", + "Deep temporal nerves", + "Diagonal band of Broca", + "Digastric branch of facial nerve", + "Dorsal branch of ulnar nerve", + "Dorsal nerve of clitoris", + "Dorsal nerve of the penis", + "Dorsal scapular nerve", + "Esophageal plexus", + "Ethmoidal nerves", + "External laryngeal nerve", + "External nasal nerve", + "Facial nerve", + "Femoral nerve", + "Frontal nerve", + "Gastric plexuses", + "Geniculate ganglion", + "Genital branch of genitofemoral nerve", + "Genitofemoral nerve", + "Glossopharyngeal nerve", + "Greater auricular nerve", + "Greater occipital nerve", + "Greater petrosal nerve", + "Hepatic plexus", + "Hypoglossal nerve", + "Iliohypogastric nerve", + "Ilioinguinal nerve", + "Inferior alveolar nerve", + "Inferior anal nerves", + "Inferior cardiac nerve", + "Inferior cervical ganglion", + "Inferior gluteal nerve", + "Inferior hypogastric plexus", + "Inferior mesenteric plexus", + "Inferior palpebral nerve", + "Infraorbital nerve", + "Infraorbital plexus", + "Infratrochlear nerve", + "Intercostal nerves", + "Intercostobrachial nerve", + "Intermediate cutaneous nerve", + "Internal carotid plexus", + "Internal laryngeal nerve", + "Interneuron", + "Jugular ganglion", + "Lacrimal nerve", + "Lateral cord", + "Lateral cutaneous nerve of forearm", + "Lateral cutaneous nerve of thigh", + "Lateral pectoral nerve", + "Lateral plantar nerve", + "Lateral pterygoid nerve", + "Lesser occipital nerve", + "Lingual nerve", + "Long ciliary nerves", + "Long root of the ciliary ganglion", + "Long thoracic nerve", + "Lower subscapular nerve", + "Lumbar nerves", + "Lumbar plexus", + "Lumbar splanchnic nerves", + "Lumboinguinal nerve", + "Lumbosacral plexus", + "Lumbosacral trunk", + "Mandibular nerve", + "Marginal mandibular branch of facial nerve", + "Masseteric nerve", + "Maxillary nerve", + "Medial cord", + "Medial cutaneous nerve of arm", + "Medial cutaneous nerve of forearm", + "Medial cutaneous nerve", + "Medial pectoral nerve", + "Medial plantar nerve", + "Medial pterygoid nerve", + "Median nerve", + "Meissner's plexus", + "Mental nerve", + "Middle cardiac nerve", + "Middle cervical ganglion", + "Middle meningeal nerve", + "Motor nerve", + "Muscular branches of the radial nerve", + "Musculocutaneous nerve", + "Mylohyoid nerve", + "Nasociliary nerve", + "Nasopalatine nerve", + "Nerve of pterygoid canal", + "Nerve to obturator internus", + "Nerve to quadratus femoris", + "Nerve to the Piriformis", + "Nerve to the stapedius", + "Nerve to the subclavius", + "Nervus intermedius", + "Nervus spinosus", + "Nodose ganglion", + "Obturator nerve", + "Oculomotor nerve", + "Olfactory nerve", + "Ophthalmic nerve", + "Optic nerve", + "Otic ganglion", + "Ovarian plexus", + "Palatine nerves", + "Palmar branch of the median nerve", + "Palmar branch of ulnar nerve", + "Pancreatic plexus", + "Patellar plexus", + "Pelvic splanchnic nerves", + "Perforating cutaneous nerve", + "Perineal branches of posterior femoral cutaneous nerve", + "Perineal nerve", + "Petrous ganglion", + "Pharyngeal branch of vagus nerve", + "Pharyngeal branches of glossopharyngeal nerve", + "Pharyngeal nerve", + "Pharyngeal plexus", + "Phrenic nerve", + "Phrenic plexus", + "Posterior auricular nerve", + "Posterior branch of spinal nerve", + "Posterior cord", + "Posterior cutaneous nerve of arm", + "Posterior cutaneous nerve of forearm", + "Posterior cutaneous nerve of thigh", + "Posterior scrotal nerves", + "Posterior superior alveolar nerve", + "Proper palmar digital nerves of median nerve", + "Prostatic plexus (nervous)", + "Pterygopalatine ganglion", + "Pudendal nerve", + "Pudendal plexus", + "Pulmonary branches of vagus nerve", + "Radial nerve", + "Recurrent laryngeal nerve", + "Renal plexus", + "Sacral plexus", + "Sacral splanchnic nerves", + "Saphenous nerve", + "Sciatic nerve", + "Semilunar ganglion", + "Sensory nerve", + "Short ciliary nerves", + "Sphenopalatine nerves", + "Splenic plexus", + "Stylohyoid branch of facial nerve", + "Subcostal nerve", + "Submandibular ganglion", + "Suboccipital nerve", + "Superficial branch of the radial nerve", + "Superficial fibular nerve", + "Superior cardiac nerve", + "Superior cervical ganglion", + "Superior ganglion of glossopharyngeal nerve", + "Superior ganglion of vagus nerve", + "Superior gluteal nerve", + "Superior hypogastric plexus", + "Superior labial nerve", + "Superior laryngeal nerve", + "Superior lateral cutaneous nerve of arm", + "Superior mesenteric plexus", + "Superior rectal plexus", + "Supraclavicular nerves", + "Supraorbital nerve", + "Suprarenal plexus", + "Suprascapular nerve", + "Supratrochlear nerve", + "Sural nerve", + "Sympathetic trunk", + "Temporal branches of the facial nerve", + "Third occipital nerve", + "Thoracic aortic plexus", + "Thoracic splanchnic nerves", + "Thoraco-abdominal nerves", + "Thoracodorsal nerve", + "Tibial nerve", + "Transverse cervical nerve", + "Trigeminal nerve", + "Trochlear nerve", + "Tympanic nerve", + "Ulnar nerve", + "Upper subscapular nerve", + "Uterovaginal plexus", + "Vagus nerve", + "Ventral ramus", + "Vesical nervous plexus", + "Vestibular nerve", + "Vestibulocochlear nerve", + "Zygomatic branches of facial nerve", + "Zygomatic nerve", + "Zygomaticofacial nerve", + "Zygomaticotemporal nerve", + ], + "scientific law or theorem": [ + "Abel's theorem", + "Ariadne's thread", + "Amdahl's law", + "Ampère's circuital law", + "Archie's law", + "Archimedes's principle", + "Axiom of Archimedes", + "Arrhenius equation", + "Avogadro's law", + "Basquin's Law of Fatigue", + "Bell's theorem", + "Benford's law", + "Beer–Lambert law", + "Bernoulli's principle", + "Bernoulli's equation", + "Biot–Savart law", + "Birch's law", + "Bogoliubov–Born–Green–Kirkwood–Yvon hierarchy", + "Bogoliubov transformation", + "Boltzmann equation", + "Born's law", + "Boyle's law", + "Bragg's Law", + "Bradford's law", + "Bruun Rule", + "Buys Ballot's law", + "Byerlee's law", + "Carnot's theorem", + "Cauchy's integral formula", + "Cauchy–Riemann equations", + "Cayley–Hamilton theorem", + "Charles's law", + "Chandrasekhar limit", + "Church–Turing thesis", + "Coulomb's law", + "Law of Charles and Gay-Lussac", + "Clifford's theorem", + "Clifford's circle theorems", + "Curie's law", + "Curie–Weiss law", + "D'Alembert's paradox", + "D'Alembert's principle", + "Dalton's law of partial pressure", + "Darcy's law", + "De Bruijn–Erdős theorem", + "De Morgan's law", + "Dermott's law", + "Descartes's theorem", + "Dirac equation", + "Dirac delta function", + "Dirac comb", + "Dirac spinor", + "Dirac operator", + "Drake equation", + "Doppler effect", + "Ehrenfest's theorem", + "Einstein's general theory of relativity", + "Einstein's special theory of relativity", + "El-Sayed rule", + "Erdős–Anning theorem", + "Erdős–Beck theorem", + "Erdős–Gallai theorem", + "Erdős–Kac theorem", + "Erdős–Ko–Rado theorem", + "Erdős–Nagy theorem", + "Erdős–Rado theorem", + "Erdős–Stone theorem", + "Erdős–Szekeres theorem", + "Erdős–Szemerédi theorem", + "Euclid's theorem", + "Euler's theorem", + "Faraday's law of induction", + "Faraday's law of electrolysis", + "Faxén's law", + "Fermat's principle", + "Fermat's Last Theorem", + "Fermat's little theorem", + "Fermi paradox", + "Fermi's golden rule", + "Fermi acceleration", + "Fermi hole", + "Fermionic field", + "Fermi level", + "Fick's law of diffusion", + "Fitts's law", + "Fourier's law", + "Gauss's law", + "Gauss's law for magnetism", + "Gauss's principle of least constraint", + "Gauss's digamma theorem", + "Gauss's hypergeometric theorem", + "Gaussian function", + "Gay-Lussac's law", + "Gibbs–Helmholtz equation", + "Gödel's incompleteness theorems", + "Graham's law", + "Green's law", + "Grimm's law", + "Gustafson's law", + "Heisenberg's uncertainty principle", + "Haüy's law of rational indices", + "Haüy's law of symmetry", + "Heaps' law", + "Hellmann–Feynman theorem", + "Henry's law", + "Hertz observations", + "Hess's law", + "Hilbert's basis theorem", + "Hilbert's axioms", + "Hilbert function", + "Hilbert's irreducibility theorem", + "Hilbert's syzygy theorem", + "Hilbert's Theorem 90", + "Hilbert's theorem", + "Hohenberg–Kohn theorem", + "Helmholtz's theorems", + "Helmholtz theorem", + "Helmholtz free energy", + "Helmholtz decomposition", + "Helmholtz equation", + "Helmholtz resonance", + "Hollomon's law", + "Hooke's law", + "Hopkinson's law", + "Hubble's law", + "Hund's rules", + "Huygens–Fresnel principle", + "Joule's laws", + "Jurin's law", + "Kasha's rule", + "Kepler's laws of planetary motion", + "Kirchhoff's laws", + "Kopp's law", + "Larmor formula", + "Leidenfrost effect", + "Lagrangian point", + "Lagrange reversion theorem", + "Lagrange polynomial", + "Lagrange's four-square theorem", + "Lagrange's theorem", + "Lagrange's theorem (group theory)", + "Lagrange invariant", + "Lagrange multiplier", + "Lambert's cosine law", + "Lamm equation", + "Langmuir equation", + "Laplace transform", + "Laplace's equation", + "Laplace operator", + "Laplace distribution", + "Laplace invariant", + "Laplace expansion", + "Laplace principle", + "Laplace limit", + "Le Chatelier's principle", + "Leibniz's law", + "Lenz's law", + "Leonard–Merritt mass estimator", + "l'Hôpital's rule", + "Llinás's law", + "Ludwik's law", + "Mach principle", + "Mach reflection", + "Marconi's law", + "Markovnikov's rule", + "Maupertuis's principle", + "Maxwell's equations", + "Maxwell relations", + "McCulloch's Iron Laws of Conferences", + "Mendelian inheritance", + "Mendel's laws", + "Metcalfe's law", + "Mikheyev–Smirnov–Wolfenstein effect", + "Milner–Rado paradox", + "Minkowski's theorem", + "Mitscherlich's law", + "Moore's law", + "Nash embedding theorem", + "Nash equilibrium", + "Nernst equation", + "Newton's law of cooling", + "Newton's law of universal gravitation", + "Newton's laws of motion", + "Niven's theorem", + "Noether's theorem", + "Nyquist–Shannon sampling theorem", + "Occam's razor", + "Ohm's law", + "Osipkov–Merritt model", + "Ostwald dilution law", + "Paley–Wiener theorem", + "Pareto distribution", + "Pareto efficiency", + "Pareto index", + "Pareto principle", + "Pascal's law", + "Pascal's theorem", + "Pauli exclusion principle", + "Peano axioms", + "Planck's law", + "Poincaré–Bendixson theorem", + "Poincaré–Birkhoff–Witt theorem", + "Poincaré–Hopf theorem", + "Poincaré recurrence theorem", + "Poincaré conjecture", + "Poincaré lemma", + "Poiseuille's law", + "Poisson distribution", + "Poisson's equation", + "Price's theorem", + "Ptolemy's theorem", + "Pythagorean theorem", + "Raman scattering", + "Rado's theorem", + "Ramanujan–Nagell equation", + "Raoult's law", + "Riemann zeta function", + "Riemann hypothesis", + "Riemann integral", + "Riemann lemma", + "Riemannian manifold", + "Riemann sphere", + "Riemann theta function", + "Rolle's theorem", + "Saha ionization equation", + "Schrödinger equation", + "Seebeck effect", + "Sérsic's law", + "Snell's law", + "Sokolov–Ternov effect", + "Sommerfeld–Kossel displacement law", + "Stefan–Boltzmann law", + "Steno's law", + "Stokes' law", + "Stoletov's law", + "Swift's law", + "Tarski's undefinability theorem", + "Tarski's axioms", + "Thales's theorem", + "Titius–Bode law", + "Torricelli's law", + "Umov effect", + "Van der Waals equation", + "Vlasov equation", + "Voce's law", + "Von Neumann bicommutant theorem", + "Von Neumann entropy", + "von Neumann paradox", + "Von Neumann ergodic theorem", + "Von Neumann universe", + "Von Neumann neighborhood", + "Von Neumann's trace inequality", + "Weinberg–Witten theorem", + "Weyl character formula", + "Wien's law", + "Wiener–Khinchin theorem", + "Young–Laplace equation", + "Zener-Hollomon law", + "Zipf's law", + ], + "chemical element": [ + "Hydrogen", + "Helium", + "Lithium", + "Beryllium", + "Boron", + "Carbon", + "Nitrogen", + "Oxygen", + "Fluorine", + "Neon", + "Sodium", + "Magnesium", + "Aluminium", + "Silicon", + "Phosphorus", + "Sulfur", + "Chlorine", + "Argon", + "Potassium", + "Calcium", + "Scandium", + "Titanium", + "Vanadium", + "Chromium", + "Manganese", + "Iron", + "Cobalt", + "Nickel", + "Copper", + "Zinc", + "Gallium", + "Germanium", + "Arsenic", + "Selenium", + "Bromine", + "Krypton", + "Rubidium", + "Strontium", + "Yttrium", + "Zirconium", + "Niobium", + "Molybdenum", + "Technetium", + "Ruthenium", + "Rhodium", + "Palladium", + "Silver", + "Cadmium", + "Indium", + "Tin", + "Antimony", + "Tellurium", + "Iodine", + "Xenon", + "Caesium", + "Barium", + "Lanthanum", + "Cerium", + "Praseodymium", + "Neodymium", + "Promethium", + "Samarium", + "Europium", + "Gadolinium", + "Terbium", + "Dysprosium", + "Holmium", + "Erbium", + "Thulium", + "Ytterbium", + "Lutetium", + "Hafnium", + "Tantalum", + "Tungsten", + "Rhenium", + "Osmium", + "Iridium", + "Platinum", + "Gold", + "Mercury", + "Thallium", + "Lead", + "Bismuth", + "Polonium", + "Astatine", + "Radon", + "Francium", + "Radium", + "Actinium", + "Thorium", + "Protactinium", + "Uranium", + "Neptunium", + "Plutonium", + "Americium", + "Curium", + "Berkelium", + "Californium", + "Einsteinium", + "Fermium", + "Mendelevium", + "Nobelium", + "Lawrencium", + "Rutherfordium", + "Dubnium", + "Seaborgium", + "Bohrium", + "Hassium", + "Meitnerium", + "Darmstadtium", + "Roentgenium", + "Copernicium", + "Nihonium", + "Flerovium", + "Moscovium", + "Livermorium", + "Tennessine", + "Oganesson", + ], + "mountain": [ + "Mount Everest", + "K2", + "Kangchenjunga", + "Lhotse", + "Makalu", + "Cho Oyu", + "Dhaulagiri I", + "Manaslu", + "Nanga Parbat", + "Annapurna I", + "Gasherbrum I", + "Broad Peak", + "Gasherbrum II", + "Shishapangma", + "Gyachung Kang", + "Gasherbrum III", + "Annapurna II", + "Gasherbrum IV", + "Himalchuli", + "Distaghil Sar", + "Ngadi Chuli", + "Nuptse", + "Khunyang Chhish", + "Masherbrum", + "Nanda Devi", + "Chomo Lonzo", + "Batura Sar", + "Rakaposhi", + "Namcha Barwa", + "Kanjut Sar", + "Kamet", + "Saltoro Kangri", + "Tirich Mir", + "Molamenqing", + "Gurla Mandhata", + "Saser Kangri I", + "Chogolisa", + "Kongur Tagh", + "Shispare", + "Trivor", + "Gangkhar Puensum", + "Gongga Shan", + "Annapurna III", + "Skyang Kangri", + "Changtse", + "Kula Kangri", + "Kongur Tiube", + "Annapurna IV", + "Mamostong Kangri", + "Saser Kangri II E", + "Muztagh Ata", + "Ismoil Somoni Peak", + "Saser Kangri III", + "Noshaq", + "Pumari Chhish", + "Passu Sar", + "Yukshin Gardan Sar", + "Teram Kangri I", + "Jongsong Peak", + "Malubiting", + "Gangapurna", + "Jengish Chokusu", + "Sunanda Devi", + "Yangra", + "Sia Kangri", + "Momhil Sar", + "Kabru N", + "Skil Brum", + "Haramosh Peak", + "Istor-o-Nal", + "Ghent Kangri", + "Ultar", + "Churen Himal", + "Teram Kangri III", + "Sherpi Kangri", + "Labuche Kang", + "Kirat Chuli", + "Abi Gamin", + "Gimmigela Chuli", + "Nangpai Gosum", + "Saraghrar", + "Talung", + "Jomolhari", + "Chamlang", + "Chongtar", + "Baltoro Kangri", + "Siguang Ri", + "The Crown (mountain)", + "Gyala Peri", + "Porong Ri", + "Baintha Brakk", + "Yutmaru Sar", + "K6", + "Kangpenqing", + "Muztagh Tower", + "Mana Peak", + "Diran", + "Putha Hiunchuli", + "Apsarasas Kangri", + "Mukut Parbat", + "Rimo III", + "Langtang Lirung", + "Karjiang", + "Annapurna Dakshin (Annapurna South)", + "Khartaphu", + "Tongshanjiabu", + "Malangutti Sar", + "Noijin Kangsang", + "Langtang Ri", + "Kangphu Kang", + "Singhi Kangri", + "Lupghar Sar", + ], +} + + +def _get_wiki_texts_from_entities(entities: List[str]) -> dict: + """ + Retrieve Wikipedia article text for a list of entities. + + Requires the optional ``wikipedia-api`` package. If more than 100 articles + are retrieved, only the 100 longest article texts are returned. + """ + if importlib.util.find_spec("wikipediaapi") is None: + message = "The optional dependency 'wikipedia-api' is required to load 'factscore-stem-geo'. Install it with `pip install wikipedia-api`." + warnings.warn(message, UserWarning, stacklevel=2) + raise ImportError(message) + + import wikipediaapi + + wiki_wiki = wikipediaapi.Wikipedia(user_agent=USER_AGENT, language="en") + texts = {} + for entity in entities: + page = wiki_wiki.page(entity) + page_text = page.text + if page_text: + texts[entity] = page_text + + if len(texts) > 100: + sorted_entities = sorted(texts.keys(), key=lambda x: len(texts[x]), reverse=True)[:100] + texts = {entity: texts[entity] for entity in sorted_entities} + + return texts + + +def _load_factscore_stem_geo_dataset() -> pd.DataFrame: + rows = [] + for entity_type, entities in FACTSCORE_STEM_GEO_ENTITIES.items(): + wiki_texts = _get_wiki_texts_from_entities(entities) + rows.extend({"entity_type": entity_type, "entity": entity, "question": f"Write a paragraph with some facts about the {entity_type} {entity}.", "wikipedia_text": wiki_text} for entity, wiki_text in wiki_texts.items()) + + return pd.DataFrame(rows)