This document describes all available skills that can be used in the indexer pipeline. Each skill serves a specific purpose in the data processing pipeline, from data collection to vectorization and storage.
-
You have a bunch of files locally, and you would like to vectorize them? You'll typically need the following type of skills in your config file:
- A
file-scannerto scan your local folder for documents to be indexed. - A
file-readerto read the content of the files. - A
splitterto split the documents into chunks. - A
embeddingto generate embeddings from the chunks. - A
vector-storeto store the embeddings.
- A
-
You have a list of Confluence pages that you'd like to vectorize? You'll typically need the following type of skills in your config file:
- An
exporterto export the Confluence pages to word documents. - A
file-readerto read the content of the word documents. - A
splitterto split the word documents into chunks. - An
embeddingto generate embeddings from the chunks. - A
vector-storeto store the embeddings.
- An
-
You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:
- A
jira-loaderto extract the data from the jira tickets - A
splitterto split the data into chunks. - An
embeddingto generate embeddings from the chunks. - A
vector-storeto store the embeddings.
- A
-
You have FAQ documents exported from Confluence (
.docxfiles) and want to extract Q&A pairs for vectorization? You'll typically need:- An
exporter(Scroll Word) orfile-scannerto get the.docxfiles. - A
confluence-faq-splitterto extract Q&A pairs directly from the.docxheadings. - An
embeddingto generate embeddings from the Q&A chunks. - A
vector-storeto store the embeddings.
- An
-
You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
- A
teams-qna-loaderto load the enriched Q&A pairs from the JSON file. - An
embeddingto generate embeddings from the Q&A content. - A
vector-storeto store the embeddings.
- A
-
You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a
writer(json-writer) skill as a change gate:- A
file-scanner(orexporter) to locate/export your source documents. - A
file-readerto read their content. - A
splitterto split the documents into chunks. - A
writer(json-writer) withchecksum_pathset — it computes a SHA-256 checksum of each chunk's content individually (keyed bydocument_id); only chunks whose content has changed (or are new) pass downstream, so unchanged chunks are stripped and their embedding and indexing are skipped automatically. - An
embeddingto generate embeddings (skipped when content is unchanged). - A
vector-storeto store the embeddings (skipped when content is unchanged).
- A
Exporter Skills
Export data from one source to another. For example export a confluence page to a markdown file.Exports Confluence pages to Microsoft Word documents. Each entry in page_urls and page_ids supports an optional inline tag. Entries without a tag fall back to the top-level tag param.
- skill: &Exporter
type: exporter
name: scrollword-exporter
params:
api_url: https://scroll-word.us.exporter.k15t.app/api/public/1/exports
auth_token: env.SWE_AUTH_TOKEN # Scroll Word API token - can be obtained in Confluence
poll_interval: 20 # Interval in seconds to check the status of the export
export_folder: ~/Downloads/sw_export_temp # Path where the exported file(s) should be saved
scope: current # Possible values: [current | descendants]
confluence_prefix: https://your/corporate/confluence/prefix
tag: generic # Optional: default tag for all pages (fallback)
page_urls:
- url: https://your/confluence/spaces/SPACE/pages/123/Page+Title
tag: my-tag # Optional: overrides top-level tag for this page
- url: https://your/confluence/spaces/SPACE/pages/456/Another+Page
# no tag — falls back to top-level tag
page_ids:
- id: 1774209540
tag: my-tag # Optional
- id: 1234567890
# no tag — falls back to top-level tagFile Scanner Skills
Scans local folder for documents to be indexed.Scans local disk for documents to be indexed.
- skill: &FileScanner
type: file-scanner
name: multi-file-scanner
params:
path: /path/to/your/documents
filter: ["*.md"] # Optional. If missing or empty - all the files will be considered. Use filter to narrow down the scope. Example: ["*.md", "*.txt"], ["globaldns*.md"]
recursive: false # false - scans only the folder indicated by `path`, true - scans the folder indicated by `path` and all its subfoldersFile Reader Skills
Read the content of files.Uses Azure Document Intelligence to extract the textual content from the input file.
- skill: &DocumentIntelligence
type: file-reader
name: azure-document-intelligence
params:
endpoint: "https://your-form-recognizer-endpoint"
api_key: env.AZURE_FORM_RECOGNIZER_KEYSupported file extensions:
- .md: Markdown files using UnstructuredMarkdownLoader
- .txt: Text files using TextLoader
- .pdf: PDF files using PyPDFLoader
- .doc, .docx: Word documents using UnstructuredWordDocumentLoader
- .ppt, .pptx: PowerPoint files using UnstructuredPowerPointLoader
- .xls, .xlsx: Excel files using UnstructuredExcelLoader
- skill: &FileReader
type: file-reader # fixed parameter, do not change
name: multi-file-reader # fixed parameter, do not changeWeb loaders
Load data from web or structured files.Loads data from Jira issues
- skill: &JiraLoader
type: loader
name: jira-loader
params:
server_url: https://your/corporate/jira/url
api_token: env.JIRA_PAT # Jira Personal Authentication Token. Can be obtained from Jira.
issues: # You need to list jira issues one by one. This is intentional and allows you to control exactly what data goes in.
- JSTAD-XYZ
- JIRA-1234Loads enriched Q&A pairs from a JSON file produced by the FAQ enrichment pipeline. Each Q&A pair becomes a single document with one chunk. The skill prefers rephrased questions/answers when available, falling back to originals.
Each Q&A object in the JSON can optionally include a tag field that overrides the skill-level tag for that specific chunk, allowing fine-grained tagging within a single file.
- skill: &TeamsQnALoader
type: loader
name: teams-qna-loader
params:
file_path: data/processed_output/enriched_qna.json # Required: path to enriched Q&A JSON file
tag: teams-faq # Optional: default tag for chunks (default: "enriched-qna"); can be overridden per Q&A object via a "tag" field in the JSONText Splitters
Split large text data into smaller chunks.Splits text into chunks of a certain size, with overlap. Ideal to get you started.
- skill: &TextSplitter
type: splitter # fixed parameter, do not change
name: recursive-character-splitter # fixed parameter, do not change
params:
chunk_size: 1200 # you can experiment with this value. Don't go too big or too small
overlap: 180 # you can experiment with this value. Don't go too big or too smallSplits text by grouping semantically equivalent chunks together. A bit more advanced than the Recursive Character Splitter.
- skill: &SemanticSplitter
type: splitter # fixed parameter, do not change
name: semantic-splitter # fixed parameter, do not change
params:
embedding_model: # Currently only Azure embedding models are supported
endpoint: https://your-embedding-endpoint
api_key: env.AZURE_EMBEDDING_KEY
api_version: your-api-version
deployment_name: your-deployment-nameExtracts Q&A pairs directly from FAQ .docx files exported from Confluence. Each heading that contains a ? or starts with a problem/question pattern (e.g. "How do I", "I cannot") is treated as a question, and the body content below it becomes the answer. Each Q&A pair is produced as a single atomic chunk. No file-reader is needed — this skill reads .docx files directly via python-docx.
Each chunk's document_id is a SHA-256 hash of the question text only, so the ID stays stable even when the answer is updated. This makes it a reliable unique key for Azure AI Search upserts — changed Q&A pairs are re-indexed in place without creating duplicates and pairs whose answers haven't changed are skipped by the json-writer change gate.
All parameters are optional with sensible defaults.
- skill: &ConfluenceFAQSplitter
type: splitter
name: confluence-faq-splitter
params:
min_heading_level: 2 # Minimum heading level for questions (default: 2)
max_heading_level: 6 # Maximum heading level for questions (default: 6)
skip_headings: # Heading titles to skip (default: ['summary'])
- summary
skip_patterns: # Text patterns to skip in answer content (default: ['CONFIDENTIAL', 'Search the FAQ', 'Search Artifactory FAQ'])
- CONFIDENTIAL
question_patterns: # Prefixes that indicate a question (default: ['i am ', 'i cannot ', 'how do i ', 'what is ', ...])
- "how do i "
- "i cannot "
stop_sections: # Regex patterns for sections that end Q&A extraction (default: ['related articles', 'see also'])
- "^\\s*related\\s*articles?\\s*$"Writer Skills
Capture and optionally gate intermediate pipeline state to a file.Extracts text content from all chunks and writes it as a sorted JSON array to a file. Useful for inspecting intermediate pipeline state (e.g. after splitting) and as a per-chunk change-detection gate: when checksum_path is configured, the skill computes a SHA-256 checksum of each chunk's content individually and stores the results in a JSON map keyed by document_id. On subsequent runs, only chunks whose content has changed (or are new) are passed downstream — unchanged chunks are stripped from their documents, so embedding and indexing are skipped for those chunks only.
This works well with Azure AI Search's key-based upsert — changed documents are re-indexed in place without creating duplicates.
- skill: &JSONWriter
type: writer
name: json-writer
params:
output_path: data/pipeline_output.json # Path to the combined output JSON file (default: "data/pipeline_output.json")
checksum_path: data/checksums.json # Optional: path to a JSON file storing per-chunk SHA-256 checksums keyed by document_id. Enables per-chunk change detection.
skip_downstream_if_unchanged: true # Optional: if true (default) and checksum_path is set, strips unchanged chunks from their documents, skipping their embedding/indexingEmbedding
Generate embeddings from text. Embeddings is a vector representation of your text data.Generates embeddings from text using embedding models deployed in Azure portal.
- skill: &Ada002Embedding
type: embedding
name: azure-ada002-embedding
params: # Configuration can be retrieved from your Azure Portal
endpoint: https://your-embedding-endpoint
api_key: env.AZURE_EMBEDDING_API_KEY
api_version: your-api-version
deployment_name: your-deployment-nameGenerates embeddings from text using llama_index library.
- skill: &FastEmbed
type: embedding
name: llama-fastembedVector Store
Store embedding in a vector store.Stores embeddings in an Azure AI Search index.
- skill: &AzureAISearch
type: vector-store
name: azure-ai-search
params:
endpoint: http://your-endpoint
index_name: your-index-name
api_key: env.AZURE_AI_SEARCH_API_KEY # This parameter is optional. If missing, it will attempt an RBAC-based authentication. You might need to login to azure beforehand in your terminal using `azd` tool.
field_mapping: # on the left hand side is the internal representation. On the right hand side is your schema. You need to provide a mapping between our internal representation and your schema. If there's a field you don't need, just remove the line
document_id: document_id
content: content
source_link: source_link
document_name: document_name
embedding: embedding
overwrite_index: true # true - before storing data, it will remove all the documents from your index. false - will append documents to your index
batch_size: 50 # Optional: number of documents uploaded per API call (default: 50, max: 50)Stores embeddings in a Chroma vector store. Ideal for prototyping.
- skill: &ChromaDbVectorStore
type: vector-store
name: chromadb
params:
db_path: path/to/where/your/chroma/db/is # if you don't have any yet, a new one will be created at the specified path
collection_name: replace-this-with-your-collection-name # if you don't have a collection yet, a new one will be created when documents are insertedStores embeddings in a faiss vector store.
- skill: &FaissDbVectorStore
type: vector-store
name: faissdb
params:
db_path: path/to/where/your/faiss/db/is # if you don't have any yet, a new one will be created at the specified path
dimension : replace-with-your-embeddings-dimension # Ensure that the correct dimension is provided. The expected dimension must match the embedding model you have selected
overwrite_index: true # true - before storing data, it will remove all the documents from your index. false - will append documents to your index
All contributions are welcome!
The above list of skills and functionality covers the typical use cases identified so far.
If you have any cool ideas to extend the existing skills, or to create new ones, please contribute! Your contributions and feedback are the key to make this project a success!