Skip to content

Commit cf40a31

Browse files
SrdjanLLqn895
andauthored
[HF Data Loader] Load onechat datasets with externally defined mappings. (elastic#230252)
## Summary Closes: elastic#227993 This PR introduces OneChat datasets to the HuggingFace dataset loader CLI tool. ### Key Changes * OneChat Data Loading Support - Added support for loading OneChat datasets (CSV-based) from the `elastic/OneChatAgent` repository. - OneChat datasets can now be specified using the `onechat/<directory>/<dataset>` syntax. * Refactored File Download and Parsing - Extracted HuggingFace file download logic for reuse. - Separated CSV and JSON parsing into dedicated functions for clarity and maintainability. Previously only JSON download was supported. This uses `papaparse` library for csv parsing which was very handy for parsing multi-line csv files. * Configurable Kibana URL - The loader now accepts a `--kibana-url` flag to override auto-discovery of the Kibana instance. This is needed locally for the evaluation framework that uses scout server running on port `5620`. * Documentation Updates: - Expanded the README with clear instructions for loading both HuggingFace and OneChat datasets. - Added OneChat eval specific instructions in the corresponding README * Fixed a bug with the existing dataset loader that was previously only storing one document in ES. ### How to Test Follow the instructions from the updated README: 1. Prerequisites - Ensure you have a running Kibana and Elasticsearch instance. - Obtain a HuggingFace access token. - (For OneChat datasets) Ensure your HuggingFace account is part of the Elastic organization. 2. Run the Loader ```bash HUGGING_FACE_ACCESS_TOKEN=<YOUR_TOKEN> \ node --require ./src/setup_node_env/index.js \ x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts \ --datasets onechat/knowledge-base/users,huffpost \ --limit 100 \ --clear \ --kibana-url http://<username>:<password>@localhost:5601 ``` 3. Verify - Check the logs for successful ingestion and embedding. - Confirm the new indices in your local ES instance. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [x] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: Quynh Nguyen (Quinn) <[email protected]>
1 parent 44b7f19 commit cf40a31

File tree

14 files changed

+655
-149
lines changed

14 files changed

+655
-149
lines changed

x-pack/platform/packages/shared/kbn-ai-tools-cli/index.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@
77

88
export { loadHuggingFaceDatasets } from './src/hf_dataset_loader/load_hugging_face_datasets';
99
export type { HuggingFaceDatasetSpec } from './src/hf_dataset_loader/types';
10-
export { ALL_HUGGING_FACE_DATASETS } from './src/hf_dataset_loader/config';
10+
export { PREDEFINED_HUGGING_FACE_DATASETS } from './src/hf_dataset_loader/datasets/config';

x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts

Lines changed: 54 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,14 @@
77

88
import { run } from '@kbn/dev-cli-runner';
99
import { createKibanaClient, toolingLogToLogger } from '@kbn/kibana-api-cli';
10-
import { castArray, keyBy } from 'lodash';
10+
import { castArray } from 'lodash';
1111
import { loadHuggingFaceDatasets } from '../src/hf_dataset_loader/load_hugging_face_datasets';
12-
import { ALL_HUGGING_FACE_DATASETS } from '../src/hf_dataset_loader/config';
12+
import {
13+
PREDEFINED_HUGGING_FACE_DATASETS,
14+
getDatasetSpecs,
15+
} from '../src/hf_dataset_loader/datasets/config';
16+
import { listAllOneChatDatasets } from '../src/hf_dataset_loader/datasets/onechat';
17+
import { HuggingFaceDatasetSpec } from '../src/hf_dataset_loader/types';
1318

1419
interface Flags {
1520
// the number of rows per dataset to load into ES
@@ -18,11 +23,39 @@ interface Flags {
1823
datasets?: string | string[];
1924
// whether all specified dataset's indices should be cleared before loading
2025
clear?: boolean;
26+
// the kibana URL to connect to
27+
'kibana-url'?: string;
28+
}
29+
30+
async function showAvailableDatasets(accessToken: string, logger: any) {
31+
let output = 'No datasets specified. Here are the available datasets:\n\n';
32+
33+
output += 'Pre-defined HuggingFace datasets:\n';
34+
output += PREDEFINED_HUGGING_FACE_DATASETS.map((d, index) => ` ${index + 1}. ${d.name}`).join(
35+
'\n'
36+
);
37+
output += '\n\n';
38+
39+
const oneChatDatasets = await listAllOneChatDatasets(accessToken, logger);
40+
output += 'OneChat datasets:\n';
41+
if (oneChatDatasets.length > 0) {
42+
output += oneChatDatasets.map((dataset, index) => ` ${index + 1}. ${dataset}`).join('\n');
43+
} else {
44+
output +=
45+
' (none available - you may need to join Elastic oranization on HuggingFace to access OneChat datasets)';
46+
}
47+
48+
output += '\n\n';
49+
output += 'Usage: Use --datasets to specify which datasets to load\n';
50+
output += 'Example: --datasets onechat/knowledge-base/wix_knowledge_base';
51+
52+
logger.info(output);
2153
}
2254

2355
run(
2456
async ({ log, flags }) => {
2557
const signal = new AbortController().signal;
58+
const logger = toolingLogToLogger({ flags, log });
2659

2760
const accessToken = process.env.HUGGING_FACE_ACCESS_TOKEN;
2861

@@ -32,38 +65,40 @@ run(
3265
);
3366
}
3467

68+
// destructure and normalize CLI flags
69+
const { limit, datasets, clear } = flags as Flags;
70+
const kibanaUrl = typeof flags['kibana-url'] === 'string' ? flags['kibana-url'] : undefined;
71+
3572
const kibanaClient = await createKibanaClient({
3673
log,
3774
signal,
75+
baseUrl: kibanaUrl,
3876
});
3977

40-
// destructure and normalize CLI flags
41-
const { limit, datasets, clear } = flags as Flags;
42-
4378
const datasetNames = !!datasets
4479
? castArray(datasets)
4580
.flatMap((set) => set.split(','))
4681
.map((set) => set.trim())
4782
.filter(Boolean)
4883
: undefined;
4984

50-
const specsByName = keyBy(ALL_HUGGING_FACE_DATASETS, (val) => val.name);
85+
let specs: HuggingFaceDatasetSpec[];
5186

52-
const specs =
53-
datasetNames?.map((name) => {
54-
if (!specsByName[name]) {
55-
throw new Error(`Dataset spec for ${name} not found`);
56-
}
57-
return specsByName[name];
58-
}) ?? ALL_HUGGING_FACE_DATASETS;
87+
if (datasetNames) {
88+
specs = await getDatasetSpecs(accessToken, logger, datasetNames);
89+
} else {
90+
// Show available datasets and exit
91+
await showAvailableDatasets(accessToken, logger);
92+
return;
93+
}
5994

6095
if (!specs.length) {
6196
throw new Error(`No datasets to load`);
6297
}
6398

6499
await loadHuggingFaceDatasets({
65100
esClient: kibanaClient.es,
66-
logger: toolingLogToLogger({ flags, log }),
101+
logger,
67102
clear: Boolean(clear),
68103
limit: !!limit ? Number(limit) : undefined,
69104
datasets: specs,
@@ -73,14 +108,17 @@ run(
73108
{
74109
description: `Loads HuggingFace datasets into an Elasticsearch cluster`,
75110
flags: {
76-
string: ['limit', 'datasets'],
111+
string: ['limit', 'datasets', 'kibana-url'],
77112
boolean: ['clear'],
78113
help: `
79114
Usage: node --require ./src/setup_node_env/index.js x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts [options]
80115
81-
--datasets Comma-separated list of HuggingFace dataset names to load
116+
--datasets Comma-separated list of HuggingFace dataset names to load.
117+
For OneChat datasets, use format: onechat/<directory>/<dataset_name>
118+
Example: --datasets onechat/knowledge-base/wix_knowledge_base
82119
--limit Number of rows per dataset to load into Elasticsearch
83120
--clear Clear the existing indices for the specified datasets before loading
121+
--kibana-url Kibana URL to connect to (bypasses auto-discovery when provided)
84122
`,
85123
default: {
86124
clear: false,

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,23 +18,63 @@ node --require ./src/setup_node_env/index.js \
1818
x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts \
1919
--datasets beir-trec-covid,beir-msmarco \
2020
--limit 1000 \
21-
--clear
21+
--clear \
22+
--kibana-url http://<username>:<password>@localhost:5601
2223
```
2324

2425
### CLI flags
2526

26-
| Flag | Type | Description |
27-
| ------------ | --------- | ----------------------------------------------------------------------------------------------------- |
28-
| `--datasets` | `string` | Comma-separated list of dataset **names** to load. Omit the flag to load **all** predefined datasets. |
29-
| `--limit` | `number` | Max docs per dataset (handy while testing). Defaults to 1k. |
30-
| `--clear` | `boolean` | Delete the target index **before** indexing. Defaults to `false`. |
27+
| Flag | Type | Description |
28+
| -------------- | --------- | ----------------------------------------------------------------------------------------------------- |
29+
| `--datasets` | `string` | Comma-separated list of dataset **names** to load. Omit the flag to load **all** predefined datasets. |
30+
| `--limit` | `number` | Max docs per dataset (handy while testing). Defaults to 1k. |
31+
| `--clear` | `boolean` | Delete the target index **before** indexing. Defaults to `false`. |
32+
| `--kibana-url` | `string` | Kibana URL to connect to (bypasses auto-discovery when provided). |
3133

3234
## Built-in dataset specs
3335

3436
The script ships with ready-made specifications located in `config.ts`.
3537

3638
Feel free to extend or tweak these specs in `src/hf_dataset_loader/config.ts`.
3739

40+
## OneChat datasets
41+
42+
The loader also supports **OneChat datasets** from the `elastic/OneChatAgent` repository. These are CSV-based datasets with predefined mappings stored in `index-mappings.jsonl` files.
43+
44+
**Note**: To access OneChat datasets, you need to be a member of the Elastic organization on HuggingFace. Sign up with your `@elastic.co` email address to request access (automated process).
45+
46+
### OneChat syntax
47+
48+
Use the format `onechat/<directory>/<dataset>` to load OneChat datasets:
49+
50+
```bash
51+
# Load a single OneChat dataset
52+
--datasets onechat/knowledge-base/wix_knowledge_base
53+
54+
# Mix OneChat and regular datasets
55+
--datasets onechat/knowledge-base/wix_knowledge_base,beir-msmarco
56+
57+
# Load multiple OneChat datasets
58+
--datasets onechat/knowledge-base/wix_knowledge_base,onechat/users/user_profiles
59+
```
60+
61+
### How it works
62+
63+
1. The loader fetches `<directory>/index-mappings.jsonl` from the OneChat repository
64+
2. Downloads the corresponding CSV file from `<directory>/datasets/`
65+
3. Creates Elasticsearch indices with the predefined mappings from `index-mappings.jsonl` file.
66+
4. Loads the CSV data into the index.
67+
68+
### Available datasets
69+
70+
Run the loader without `--datasets` to see all available OneChat and regular HuggingFace datasets.
71+
72+
### Naming convention
73+
74+
- Repository file: `knowledge-base/datasets/wix_knowledge_base.csv`
75+
- Loader dataset name: `onechat/knowledge-base/wix_knowledge_base`
76+
- Elasticsearch index: `wix_knowledge_base`
77+
3878
## Disabling local cache
3979

4080
Set the environment variable `DISABLE_KBN_CLI_CACHE=1` to force fresh downloads instead of using the on-disk cache.

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/config.ts renamed to x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/config.ts

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@
55
* 2.0.
66
*/
77

8-
import type { HuggingFaceDatasetSpec } from './types';
8+
import { Logger } from '@kbn/core/server';
9+
import type { HuggingFaceDatasetSpec } from '../types';
10+
import { createOneChatDatasetSpec, isOneChatDataset } from './onechat';
911

1012
const BEIR_NAMES = [
1113
'trec-covid',
@@ -81,7 +83,34 @@ const EXTRA_DATASETS: HuggingFaceDatasetSpec[] = [
8183
},
8284
];
8385

84-
export const ALL_HUGGING_FACE_DATASETS: HuggingFaceDatasetSpec[] = [
86+
export const PREDEFINED_HUGGING_FACE_DATASETS: HuggingFaceDatasetSpec[] = [
8587
...BEIR_DATASETS,
8688
...EXTRA_DATASETS,
8789
];
90+
91+
/**
92+
* Get dataset specifications, including dynamically generated OneChat datasets
93+
*/
94+
export async function getDatasetSpecs(
95+
accessToken: string,
96+
logger: Logger,
97+
datasetNames: string[]
98+
): Promise<HuggingFaceDatasetSpec[]> {
99+
const specs: HuggingFaceDatasetSpec[] = [];
100+
for (const name of datasetNames) {
101+
if (isOneChatDataset(name)) {
102+
const spec = await createOneChatDatasetSpec(name, accessToken, logger);
103+
specs.push(spec);
104+
} else {
105+
// Look for static datasets
106+
const staticSpec = PREDEFINED_HUGGING_FACE_DATASETS.find((spec) => spec.name === name);
107+
if (staticSpec) {
108+
specs.push(staticSpec);
109+
} else {
110+
throw new Error(`Dataset '${name}' not found`);
111+
}
112+
}
113+
}
114+
115+
return specs;
116+
}

0 commit comments

Comments
 (0)