You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/readme/indexer-skills.md
+49-1Lines changed: 49 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,19 @@ This document describes all available skills that can be used in the indexer pip
27
27
3. An `embedding` to generate embeddings from the chunks.
28
28
4. A `vector-store` to store the embeddings.
29
29
30
+
4. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
31
+
32
+
1. An `exporter` (Scroll Word) or `file-scanner` to get the `.docx` files.
33
+
2. A `confluence-faq-splitter` to extract Q&A pairs directly from the `.docx` headings.
34
+
3. An `embedding` to generate embeddings from the Q&A chunks.
35
+
4. A `vector-store` to store the embeddings.
36
+
37
+
5. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
38
+
39
+
1. A `teams-qna-loader` to load the enriched Q&A pairs from the JSON file.
40
+
2. An `embedding` to generate embeddings from the Q&A content.
41
+
3. A `vector-store` to store the embeddings.
42
+
30
43
31
44
# Available Skills
32
45
@@ -103,7 +116,7 @@ Supported file extensions:
103
116
</details>
104
117
105
118
<details><summary>Web loaders</summary>
106
-
Load data from web.
119
+
Load data from web or structured files.
107
120
108
121
### Jira Loader
109
122
Loads data from Jira issues
@@ -119,6 +132,18 @@ Loads data from Jira issues
119
132
- JSTAD-XYZ
120
133
- JIRA-1234
121
134
```
135
+
136
+
### Teams Q&A Loader
137
+
Loads enriched Q&A pairs from a JSON file produced by the FAQ enrichment pipeline. Each Q&A pair becomes a single document with one chunk. The skill prefers rephrased questions/answers when available, falling back to originals.
138
+
139
+
```yaml
140
+
- skill: &TeamsQnALoader
141
+
type: loader
142
+
name: teams-qna-loader
143
+
params:
144
+
file_path: data/processed_output/enriched_qna.json # Required: path to enriched Q&A JSON file
145
+
tag: teams-faq # Optional: tag for chunks (default: "enriched-qna")
146
+
```
122
147
</details>
123
148
124
149
@@ -151,6 +176,29 @@ Splits text by grouping semantically equivalent chunks together. A bit more adva
151
176
api_version: your-api-version
152
177
deployment_name: your-deployment-name
153
178
```
179
+
180
+
### Confluence FAQ Splitter
181
+
Extracts Q&A pairs directly from FAQ `.docx` files exported from Confluence. Each heading that contains a `?` or starts with a problem/question pattern (e.g. "How do I", "I cannot") is treated as a question, and the body content below it becomes the answer. Each Q&A pair is produced as a single atomic chunk. No `file-reader` is needed — this skill reads `.docx` files directly via `python-docx`.
182
+
183
+
All parameters are optional with sensible defaults.
184
+
185
+
```yaml
186
+
- skill: &ConfluenceFAQSplitter
187
+
type: splitter
188
+
name: confluence-faq-splitter
189
+
params:
190
+
min_heading_level: 2 # Minimum heading level for questions (default: 2)
191
+
max_heading_level: 6 # Maximum heading level for questions (default: 6)
192
+
skip_headings: # Heading titles to skip (default: ['summary'])
193
+
- summary
194
+
skip_patterns: # Text patterns to skip in answer content (default: ['CONFIDENTIAL', 'Search the FAQ', 'Search Artifactory FAQ'])
195
+
- CONFIDENTIAL
196
+
question_patterns: # Prefixes that indicate a question (default: ['i am ', 'i cannot ', 'how do i ', 'what is ', ...])
197
+
- "how do i "
198
+
- "i cannot "
199
+
stop_sections: # Regex patterns for sections that end Q&A extraction (default: ['related articles', 'see also'])
0 commit comments