Skip to content

Commit b38808f

Browse files
docs: DIA-1776: add QnA synthetic data tutorial (#6945)
Co-authored-by: caitlinwheeless <[email protected]> Co-authored-by: C L W <[email protected]>
1 parent 908f54d commit b38808f

7 files changed

+121
-15
lines changed

docs/source/guide/prompts_examples.md

+121-15
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ This example demonstrates how to set up Prompts to predict image captions.
8383
!!! note
8484
Prompts does not currently support image data uploaded as raw images. Only image references (HTTP URIs to images) or images imported via cloud storage are supported.
8585

86-
2. Create a [label config](setup) for image captioning, for example:
86+
2. Create a [label config](setup) for image captioning (or Ask AI to create one for you), for example:
8787

8888
```xml
8989
<View>
@@ -104,23 +104,21 @@ This example demonstrates how to set up Prompts to predict image captions.
104104
!!! note
105105
Ensure you include `{image}` in your instructions. Click `image` above the instruction field to insert it.
106106

107-
![Screenshot pointing to how to insert image into your instructions](/images/prompts/example_insert_image.png)
108-
109107
!!! info Tip
110108
You can also automatically generate the instructions using the [**Enhance Prompt** action](prompts_draft#Enhance-prompt). Before you can use this action, you must at least add the variable name `{image}` and then click **Save**.
111109

112110
![Screenshot pointing to Enhance Prompt action](/images/prompts/example_enhance_prompt.png)
113111

114-
5. Run the prompt. View predictions to accept or correct.
112+
5. Run the prompt! View predictions to accept or correct.
115113

116114
You can [read more about evaluation metrics](prompts_draft#Evaluation-results) and ways to assess your prompt performance.
117115

118116
!!! info Tip
119-
Use the drop-down menu above the results field to change the subset of data being used (e.g. only data with Ground Truth annotations, or a small sample of records).
117+
You can change the subset of data being used (e.g. only data with Ground Truth annotations, or a small sample of records).
120118

121119
![Screenshot pointing to subset dropdown](/images/prompts/example_subset.png)
122120

123-
6. Accept the [predictions as annotations](prompts_predictions#Create-annotations-from-predictions).
121+
6. Accept the [predictions as annotations](prompts_predictions#Create-annotations-from-predictions)!
124122

125123

126124
### Evaluate LLM outputs for toxicity
@@ -131,7 +129,7 @@ This example demonstrates how to set up Prompts to evaluate if the LLM-generated
131129

132130
For example: you can use the [jigsaw_toxicity](https://huggingface.co/datasets/tasksource/jigsaw_toxicity) dataset as an example. See [the appendix](#Appendix-Generate-dataset) for how you can pre-process and (optionally) downsample this dataset to use with this guide.
133131

134-
2. Create a [label config](setup) for toxicity detection, for example:
132+
2. Create a [label config](setup) for toxicity detection (or Ask AI to create one for you), for example:
135133

136134
```xml
137135
<View>
@@ -192,25 +190,23 @@ This example demonstrates how to set up Prompts to evaluate if the LLM-generated
192190
!!! note
193191
Ensure you include `{comment_text}` in your instructions. Click `comment_text` above the instruction field to insert it.
194192

195-
![Screenshot pointing to how to insert comment text into your instructions](/images/prompts/example_insert_comment_text.png)
196-
197193
!!! info Tip
198194
You can also automatically generate the instructions using the [**Enhance Prompt** action](prompts_draft#Enhance-prompt). Before you can use this action, you must at least add the variable name `{comment_text}` and then click **Save**.
199195

200-
![Screenshot pointing to Enhance Prompt action](/images/prompts/example_enhance_prompt2.png)
196+
![Screenshot pointing to Enhance Prompt action](/images/prompts/example_enhance_prompt.png)
201197

202-
5. Run the prompt. View predictions to accept or correct.
198+
5. Run the prompt! View predictions to accept or correct.
203199

204200
You can [read more about evaluation metrics](prompts_draft#Evaluation-results) and ways to assess your prompt performance.
205201

206202
!!! info Tip
207-
Use the drop-down menu above the results field to change the subset of data being used (e.g. only data with Ground Truth annotations, or a small sample of records).
203+
You can change the subset of data being used (e.g. only data with Ground Truth annotations, or a small sample of records).
208204

209-
![Screenshot pointing to subset dropdown](/images/prompts/example_subset2.png)
205+
![Screenshot pointing to subset dropdown](/images/prompts/example_subset.png)
210206

211-
6. Accept the [predictions as annotations](prompts_predictions#Create-annotations-from-predictions).
207+
6. Accept the [predictions as annotations](prompts_predictions#Create-annotations-from-predictions)!
212208

213-
### Appendix: Preprocess jigsaw toxicity dataset
209+
#### Appendix: Preprocess jigsaw toxicity dataset
214210

215211
Download the jigsaw_toxicity dataset, then downsample/format using the following script (modify the `INPUT_PATH` and `OUTPUT_PATH` to suit your needs):
216212

@@ -259,3 +255,113 @@ with open(OUTPUT_PATH, "w") as f:
259255
```
260256

261257
If you choose to, you could also easily change how many records to use (or use the entire dataset by removing the sample step).
258+
259+
### Generate Synthetic Q&A Datasets
260+
261+
#### Overview
262+
263+
Synthetic datasets are datasets artificially generated rather than being collected from real-world observations. They encode characteristics similar to real data, but allow for scaling up data diversity or volume gaps in general purpose application, such as model training and evaluation. Synthetic datasets also work well in enhancing AI systems that have unbound human context as inputs and output, such as chatbot question and answers, test datasets for evaluation, and rich knowledge datasets for contextual retrieval. LLMs are particularly effective for generating synthetic datasets for these use cases, and allow you to enhance your AI system’s performance by creating diversity to learn from.
264+
265+
#### Example
266+
267+
Let’s expand on the Q&A use case above with an example demonstrating how to use Prompts to generate synthetic user prompts for a chatbot RAG system. Given a dataset of chatbot answers, we’ll generate some questions that could return each answer.
268+
269+
270+
1. [Create a new label studio project](setup_project) by importing chunks of text that would be meaningful answers from a chatbot.
271+
272+
You can use a preprocessed sample of the [SQuAD](https://huggingface.co/datasets/rajpurkar/squad) dataset as an example. See [the appendix](#Appendix-Preprocess-SQuAD-Q-A-dataset) for how this was generated.
273+
274+
2. Create a [label config](setup) for question generation (or Ask AI to create one for you), for example:
275+
276+
```xml
277+
<View>
278+
<Header value="Context" />
279+
<Text name="context" value="$context" />
280+
<Header value="Answer" />
281+
<Text name="answer" value="$answer" />
282+
283+
<Header value="Questions" />
284+
<TextArea name="question1" toName="context"
285+
placeholder="Enter question 1"
286+
rows="2"
287+
maxSubmissions="1" />
288+
289+
<TextArea name="question2" toName="context"
290+
placeholder="Enter question 2"
291+
rows="2"
292+
maxSubmissions="1" />
293+
294+
<TextArea name="question3" toName="context"
295+
placeholder="Enter question 3"
296+
rows="2"
297+
maxSubmissions="1" />
298+
</View>
299+
```
300+
301+
3. Navigate to **Prompts** from the sidebar, and [create a prompt](prompts_create) for the project
302+
303+
If you have not yet set up API the keys you want to use, do that now: [API keys](prompts_create#Model-provider-API-keys).
304+
305+
4. Add instructions to create 3 questions:
306+
307+
*Using the "context" below as context, come up with 3 questions ("question1", "question2", and "question3") for which the appropriate answer would be the "answer" below:*
308+
309+
*Context:*
310+
311+
*---*
312+
313+
*{context}*
314+
315+
*---*
316+
317+
*Answer:*
318+
319+
*---*
320+
321+
*{answer}*
322+
323+
*---*
324+
325+
326+
!!! note
327+
Ensure you include `{answer}` and `{context}` in your instructions. Click `answer`/`context` above the instruction field to insert them.
328+
329+
!!! info Tip
330+
You can also automatically generate the instructions using the [**Enhance Prompt** action](prompts_draft#Enhance-prompt). Before you can use this action, you must at least add a variable name (e.g. `{context}` or `{answer}`) and then click **Save**.
331+
332+
![Screenshot pointing to Enhance Prompt action](/images/prompts/example_enhance_prompt.png)
333+
334+
5. Run the Prompt! View predictions to accept or correct.
335+
336+
You can [read more about evaluation metrics](prompts_draft#Evaluation-results) and ways to assess your prompt performance.
337+
338+
!!! info Tip
339+
You can change the subset of data being used (e.g. only data with Ground Truth annotations, or a small sample of records).
340+
341+
![Screenshot pointing to subset dropdown](/images/prompts/example_subset.png)
342+
343+
6. Accept the [predictions as annotations](prompts_predictions#Create-annotations-from-predictions)!
344+
345+
#### Appendix: Preprocess SQuAD Q&A dataset
346+
347+
This downloads the SQuAD dataset from Huggingface and formats it for use in Label Studio.
348+
349+
```python
350+
import pandas as pd
351+
import json
352+
353+
OUTPUT_PATH = "/Users/pakelley/Downloads/qna-sample-ls-format.json"
354+
N_SAMPLES = 100
355+
356+
splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'validation': 'plain_text/validation-00000-of-00001.parquet'}
357+
df = pd.read_parquet("hf://datasets/rajpurkar/squad/" + splits["train"])
358+
359+
sample = df.sample(n=N_SAMPLES)
360+
361+
sample['answer'] = sample['answers'].map(lambda item: item['text'][0])
362+
label_studio_tasks = [{"context": row.context, "answer": row.answer} for row in sample.itertuples()]
363+
with open(OUTPUT_PATH, "w") as f:
364+
json.dump(label_studio_tasks, f)
365+
```
366+
367+
If you choose to, you could also easily change how many records to use (or use the entire dataset by removing the sample step).
Loading
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Binary file not shown.

0 commit comments

Comments
 (0)