Add FactScore-STEM-Geo dataset; Include CodeGenUQ in docs#409
Conversation
| if cols: | ||
| df = _dataset_processing(df=df, subset_columns=cols) | ||
| if isinstance(n, int): | ||
| df = df.iloc[:n] |
There was a problem hiding this comment.
what is this slicing used for? Reason I'm calling it out is that because it happens at the very end you've already gone through the hard part of all the http calls to wikipedia only to throw it away here if you're doing .iloc[:5] or something right?
There was a problem hiding this comment.
True, good point. Should we just ignore n parameter for factscore-stem-geo and user can use .head() or sample()? Open to ideas here
There was a problem hiding this comment.
yeah either that or passing n to load_factscore_stem_geo_dataset() so that it can handle what to do with n before fetching pages.
How long do you find it takes to load this dataset with this code? Might be worth putting a progress bar on the for loop calling wikipedia so the user understands what's taking so long. Unless you find that it happens fast b/c wikipedia+the wiki lib just hit the server hard and it's ok with that. But I think I originally put factscore in HF 1) to keep the HF-centric approach and 2) to avoid issues with having to scrape on demand... not that I'm arguing for this needing to be static in HF but just where all of my motivation for this comment thread is coming from :D
There was a problem hiding this comment.
updated as discussed!
virenbajaj
left a comment
There was a problem hiding this comment.
One function that should be private is public I think.
2 documentation nits. Otherwise looks good!
| print(f"Loading dataset - {name}...") | ||
| if dataset_dict[name]["load_params"].get("loader") == "_load_factscore_stem_geo_dataset": | ||
| if isinstance(n, int): | ||
| print("Note: the 'n' parameter is not used for 'factscore-stem-geo' — all available articles will be returned.") |
There was a problem hiding this comment.
nit: this note says all available articles will be returned, but this is capped at 100 articles per entity type. Can we say something like: "At most 100 longest articles per entity will be returned"?
| "livecodebench", "factscore-stem-geo" | ||
|
|
||
| n : int, optional | ||
| Number of rows to load from the dataset. |
There was a problem hiding this comment.
nit: change to
"n : int, optional
Number of rows to load from the dataset. Ignored for "factscore-stem-geo",
which always returns all fetched articles."
| } | ||
|
|
||
|
|
||
| def get_wiki_texts_from_entities(entities: List[str]) -> dict: |
There was a problem hiding this comment.
Should this be a private helper that starts with an underscore _ like _load_factscore_stem_geo_dataset()?
def _get_wiki_texts_from_entities(entities: List[str]) -> dict:
Description
load_example_datasetutility. This useswikipedia-apilibrary to create the long-form answer key.Type of Change
Checklist
ruff checkandruff formatpass locally