Skip to content

Commit 54dced7

Browse files
authored
Merge pull request #49 from martipath/new_skills
Add ConfluenceFAQSplitter and TeamsQnALoader indexer skills
2 parents 28ba943 + 405da53 commit 54dced7

6 files changed

Lines changed: 658 additions & 2 deletions

File tree

docs/readme/indexer-skills.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,19 @@ This document describes all available skills that can be used in the indexer pip
2727
3. An `embedding` to generate embeddings from the chunks.
2828
4. A `vector-store` to store the embeddings.
2929

30+
4. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
31+
32+
1. An `exporter` (Scroll Word) or `file-scanner` to get the `.docx` files.
33+
2. A `confluence-faq-splitter` to extract Q&A pairs directly from the `.docx` headings.
34+
3. An `embedding` to generate embeddings from the Q&A chunks.
35+
4. A `vector-store` to store the embeddings.
36+
37+
5. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
38+
39+
1. A `teams-qna-loader` to load the enriched Q&A pairs from the JSON file.
40+
2. An `embedding` to generate embeddings from the Q&A content.
41+
3. A `vector-store` to store the embeddings.
42+
3043

3144
# Available Skills
3245

@@ -103,7 +116,7 @@ Supported file extensions:
103116
</details>
104117
105118
<details><summary>Web loaders</summary>
106-
Load data from web.
119+
Load data from web or structured files.
107120
108121
### Jira Loader
109122
Loads data from Jira issues
@@ -119,6 +132,18 @@ Loads data from Jira issues
119132
- JSTAD-XYZ
120133
- JIRA-1234
121134
```
135+
136+
### Teams Q&A Loader
137+
Loads enriched Q&A pairs from a JSON file produced by the FAQ enrichment pipeline. Each Q&A pair becomes a single document with one chunk. The skill prefers rephrased questions/answers when available, falling back to originals.
138+
139+
```yaml
140+
- skill: &TeamsQnALoader
141+
type: loader
142+
name: teams-qna-loader
143+
params:
144+
file_path: data/processed_output/enriched_qna.json # Required: path to enriched Q&A JSON file
145+
tag: teams-faq # Optional: tag for chunks (default: "enriched-qna")
146+
```
122147
</details>
123148
124149
@@ -151,6 +176,29 @@ Splits text by grouping semantically equivalent chunks together. A bit more adva
151176
api_version: your-api-version
152177
deployment_name: your-deployment-name
153178
```
179+
180+
### Confluence FAQ Splitter
181+
Extracts Q&A pairs directly from FAQ `.docx` files exported from Confluence. Each heading that contains a `?` or starts with a problem/question pattern (e.g. "How do I", "I cannot") is treated as a question, and the body content below it becomes the answer. Each Q&A pair is produced as a single atomic chunk. No `file-reader` is needed — this skill reads `.docx` files directly via `python-docx`.
182+
183+
All parameters are optional with sensible defaults.
184+
185+
```yaml
186+
- skill: &ConfluenceFAQSplitter
187+
type: splitter
188+
name: confluence-faq-splitter
189+
params:
190+
min_heading_level: 2 # Minimum heading level for questions (default: 2)
191+
max_heading_level: 6 # Maximum heading level for questions (default: 6)
192+
skip_headings: # Heading titles to skip (default: ['summary'])
193+
- summary
194+
skip_patterns: # Text patterns to skip in answer content (default: ['CONFIDENTIAL', 'Search the FAQ', 'Search Artifactory FAQ'])
195+
- CONFIDENTIAL
196+
question_patterns: # Prefixes that indicate a question (default: ['i am ', 'i cannot ', 'how do i ', 'what is ', ...])
197+
- "how do i "
198+
- "i cannot "
199+
stop_sections: # Regex patterns for sections that end Q&A extraction (default: ['related articles', 'see also'])
200+
- "^\\s*related\\s*articles?\\s*$"
201+
```
154202
</details>
155203

156204
<details><summary>Embedding</summary>

src/docs2vecs/subcommands/indexer/config/config_schema.yaml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,37 @@ definitions:
105105
type: integer
106106
required: False
107107
min: 0
108+
# ConfluenceFAQSplitter params
109+
min_heading_level:
110+
type: integer
111+
required: False
112+
min: 1
113+
max: 9
114+
max_heading_level:
115+
type: integer
116+
required: False
117+
min: 1
118+
max: 9
119+
skip_patterns:
120+
type: list
121+
required: False
122+
schema:
123+
type: string
124+
skip_headings:
125+
type: list
126+
required: False
127+
schema:
128+
type: string
129+
question_patterns:
130+
type: list
131+
required: False
132+
schema:
133+
type: string
134+
stop_sections:
135+
type: list
136+
required: False
137+
schema:
138+
type: string
108139
mode:
109140
type: string
110141
required: False
@@ -162,6 +193,9 @@ definitions:
162193
path:
163194
type: string
164195
required: False
196+
file_path:
197+
type: string
198+
required: False
165199
embedding_model:
166200
type: dict
167201
schema:

src/docs2vecs/subcommands/indexer/skills/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
from .llama_fastembed_embedding_skill import LlamaFastembedEmbeddingSkill
1414
from .local_document_parser import LocalDocumentParser
1515
from .faiss_vector_store_skill import FaissVectorStoreSkill
16+
from .teams_qna_loader_skill import TeamsQnALoaderSkill
17+
from .confluence_faq_splitter_skill import ConfluenceFAQSplitter
1618

1719

1820
__all__ = [
@@ -31,4 +33,6 @@
3133
"LlamaFastembedEmbeddingSkill",
3234
"LocalDocumentParser",
3335
"FaissVectorStoreSkill",
36+
"TeamsQnALoaderSkill",
37+
"ConfluenceFAQSplitter",
3438
]

0 commit comments

Comments
 (0)