Skip to content

Latest commit

 

History

History
408 lines (325 loc) · 18.8 KB

File metadata and controls

408 lines (325 loc) · 18.8 KB

Indexer Skills Documentation

This document describes all available skills that can be used in the indexer pipeline. Each skill serves a specific purpose in the data processing pipeline, from data collection to vectorization and storage.

Typical use cases

  1. You have a bunch of files locally, and you would like to vectorize them? You'll typically need the following type of skills in your config file:

    1. A file-scanner to scan your local folder for documents to be indexed.
    2. A file-reader to read the content of the files.
    3. A splitter to split the documents into chunks.
    4. A embedding to generate embeddings from the chunks.
    5. A vector-store to store the embeddings.
  2. You have a list of Confluence pages that you'd like to vectorize? You'll typically need the following type of skills in your config file:

    1. An exporter to export the Confluence pages to word documents.
    2. A file-reader to read the content of the word documents.
    3. A splitter to split the word documents into chunks.
    4. An embedding to generate embeddings from the chunks.
    5. A vector-store to store the embeddings.
  3. You have Confluence pages and want to export them as HTML, convert to Markdown, then vectorize? You'll typically need:

    1. A scrollhtml-exporter to export the Confluence pages as HTML (ZIP).
    2. A confluence-html-to-markdown transformer to convert the HTML export into self-contained Markdown (with images).
    3. A file-scanner to pick up the resulting .md files.
    4. A file-reader to read the Markdown content.
    5. A splitter to split the documents into chunks.
    6. An embedding to generate embeddings from the chunks.
    7. A vector-store to store the embeddings.
  4. You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:

    1. A jira-loader to extract the data from the jira tickets
    2. A splitter to split the data into chunks.
    3. An embedding to generate embeddings from the chunks.
    4. A vector-store to store the embeddings.
  5. You have FAQ documents exported from Confluence (.docx files) and want to extract Q&A pairs for vectorization? You'll typically need:

    1. An exporter (Scroll Word) or file-scanner to get the .docx files.
    2. A confluence-faq-splitter to extract Q&A pairs directly from the .docx headings.
    3. An embedding to generate embeddings from the Q&A chunks.
    4. A vector-store to store the embeddings.
  6. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:

    1. A teams-qna-loader to load the enriched Q&A pairs from the JSON file.
    2. An embedding to generate embeddings from the Q&A content.
    3. A vector-store to store the embeddings.
  7. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a writer (json-writer) skill as a change gate:

    1. A file-scanner (or exporter) to locate/export your source documents.
    2. A file-reader to read their content.
    3. A splitter to split the documents into chunks.
    4. A writer (json-writer) with checksum_path set — it computes a SHA-256 checksum of each chunk's content individually (keyed by document_id); only chunks whose content has changed (or are new) pass downstream, so unchanged chunks are stripped and their embedding and indexing are skipped automatically.
    5. An embedding to generate embeddings (skipped when content is unchanged).
    6. A vector-store to store the embeddings (skipped when content is unchanged).

Available Skills

Exporter Skills Export data from one source to another. For example export a confluence page to a markdown file.

Scroll Word Exporter

Exports Confluence pages to Microsoft Word documents. Each entry in page_urls and page_ids supports an optional inline tag. Entries without a tag fall back to the top-level tag param.

- skill: &Exporter
    type: exporter
    name: scrollword-exporter
    params:
        api_url: https://scroll-word.us.exporter.k15t.app/api/public/1/exports
        auth_token: env.SWE_AUTH_TOKEN  # Scroll Word API token - can be obtained in Confluence
        poll_interval: 20   # Interval in seconds to check the status of the export
        export_folder: ~/Downloads/sw_export_temp   # Path where the exported file(s) should be saved
        scope: current  # Possible values: [current | descendants]
        confluence_prefix: https://your/corporate/confluence/prefix
        tag: generic  # Optional: default tag for all pages (fallback)
        page_urls:
          - url: https://your/confluence/spaces/SPACE/pages/123/Page+Title
            tag: my-tag   # Optional: overrides top-level tag for this page
          - url: https://your/confluence/spaces/SPACE/pages/456/Another+Page
            # no tag — falls back to top-level tag
        page_ids:
          - id: 1774209540
            tag: my-tag   # Optional
          - id: 1234567890
            # no tag — falls back to top-level tag

Scroll HTML Exporter

Exports Confluence pages as HTML via the K15t Scroll HTML Exporter REST API. The export is downloaded as a ZIP and extracted locally. This is typically followed by the confluence-html-to-markdown transformer skill.

API tokens are exporter-specific — you need a Scroll HTML Exporter token (User Profile → Personal settings → Scroll HTML Exporter API Tokens).

Data residency: use the correct regional endpoint:

  • US: https://scroll-html.us.exporter.k15t.app/api/public/1/exports
  • EU/Germany: https://scroll-html.de.exporter.k15t.app/api/public/1/exports
- skill: &ScrollHTMLExporter
    type: exporter
    name: scrollhtml-exporter
    params:
        api_url: https://scroll-html.de.exporter.k15t.app/api/public/1/exports  # Use .us. for US region
        auth_token: env.SCROLL_HTML_EXPORTER_TOKEN  # Scroll HTML Exporter API token
        poll_interval: 2    # Interval in seconds to check the status of the export
        export_folder: ~/Downloads/html_export  # Path where the exported ZIP is extracted
        scope: current  # Possible values: [current | descendants | document]
        template_id: com.k15t.scroll.html.helpcenter  # Optional: defaults to the bundled Help Center template
        confluence_prefix: https://your-instance.atlassian.net/wiki  # Optional: used to build source_url
        tag: my-docs  # Optional: default tag for all pages
        page_ids:
          - id: 1436680207
            tag: copilot-docs  # Optional
        page_urls:
          - url: https://your-instance.atlassian.net/wiki/spaces/SPACE/pages/123/Page+Title
Transformer Skills Transform data from one format to another on disk. Transformers sit between exporters and file-scanners in the pipeline.

Confluence HTML to Markdown

Converts a Scroll HTML export folder into self-contained Markdown files. Images referenced in the pages are copied into an images/ sub-folder so the output is portable without the original HTML.

Typically used after scrollhtml-exporter and before file-scanner.

- skill: &HtmlToMarkdown
    type: transformer
    name: confluence-html-to-markdown
    params:
        input_dir: ~/Downloads/html_export/1436680207  # Path to the extracted Scroll HTML export
        output_dir: ~/Downloads/html_export/1436680207/markdown  # Optional: defaults to <input_dir>/markdown
File Scanner Skills Scans local folder for documents to be indexed.

Multi-File Scanner

Scans local disk for documents to be indexed.

- skill: &FileScanner
    type: file-scanner
    name: multi-file-scanner
    params:
        path: /path/to/your/documents
        filter: ["*.md"]    # Optional. If missing or empty - all the files will be considered. Use filter to narrow down the scope. Example: ["*.md", "*.txt"], ["globaldns*.md"]
        recursive: false    # false - scans only the folder indicated by `path`, true - scans the folder indicated by `path` and all its subfolders
File Reader Skills Read the content of files.

Azure Document Intelligence

Uses Azure Document Intelligence to extract the textual content from the input file.

- skill: &DocumentIntelligence
    type: file-reader
    name: azure-document-intelligence
    params:
      endpoint: "https://your-form-recognizer-endpoint"
      api_key: env.AZURE_FORM_RECOGNIZER_KEY

Multi File Reader

Supported file extensions:

  • .md: Markdown files using UnstructuredMarkdownLoader
  • .txt: Text files using TextLoader
  • .pdf: PDF files using PyPDFLoader
  • .doc, .docx: Word documents using UnstructuredWordDocumentLoader
  • .ppt, .pptx: PowerPoint files using UnstructuredPowerPointLoader
  • .xls, .xlsx: Excel files using UnstructuredExcelLoader
- skill: &FileReader
    type: file-reader   # fixed parameter, do not change
    name: multi-file-reader # fixed parameter, do not change
Web loaders Load data from web or structured files.

Jira Loader

Loads data from Jira issues

- skill: &JiraLoader
    type: loader
    name: jira-loader
    params:
        server_url: https://your/corporate/jira/url
        api_token: env.JIRA_PAT # Jira Personal Authentication Token. Can be obtained from Jira.
        issues: # You need to list jira issues one by one. This is intentional and allows you to control exactly what data goes in.
            - JSTAD-XYZ
            - JIRA-1234

Teams Q&A Loader

Loads enriched Q&A pairs from a JSON file produced by the FAQ enrichment pipeline. Each Q&A pair becomes a single document with one chunk. The skill prefers rephrased questions/answers when available, falling back to originals.

Each Q&A object in the JSON can optionally include a tag field that overrides the skill-level tag for that specific chunk, allowing fine-grained tagging within a single file.

- skill: &TeamsQnALoader
    type: loader
    name: teams-qna-loader
    params:
      file_path: data/processed_output/enriched_qna.json   # Required: path to enriched Q&A JSON file
      tag: teams-faq                                        # Optional: default tag for chunks (default: "enriched-qna"); can be overridden per Q&A object via a "tag" field in the JSON
Text Splitters Split large text data into smaller chunks.

Recursive Character Splitter

Splits text into chunks of a certain size, with overlap. Ideal to get you started.

- skill: &TextSplitter
    type: splitter  # fixed parameter, do not change
    name: recursive-character-splitter # fixed parameter, do not change
    params:
      chunk_size: 1200  # you can experiment with this value. Don't go too big or too small
      overlap: 180  # you can experiment with this value. Don't go too big or too small

Semantic Splitter

Splits text by grouping semantically equivalent chunks together. A bit more advanced than the Recursive Character Splitter.

- skill: &SemanticSplitter
    type: splitter  # fixed parameter, do not change
    name: semantic-splitter # fixed parameter, do not change
    params:
        embedding_model: # Currently only Azure embedding models are supported
            endpoint: https://your-embedding-endpoint
            api_key: env.AZURE_EMBEDDING_KEY
            api_version: your-api-version
            deployment_name: your-deployment-name

Confluence FAQ Splitter

Extracts Q&A pairs directly from FAQ .docx files exported from Confluence. Each heading that contains a ? or starts with a problem/question pattern (e.g. "How do I", "I cannot") is treated as a question, and the body content below it becomes the answer. Each Q&A pair is produced as a single atomic chunk. No file-reader is needed — this skill reads .docx files directly via python-docx.

Each chunk's document_id is a SHA-256 hash of the question text only, so the ID stays stable even when the answer is updated. This makes it a reliable unique key for Azure AI Search upserts — changed Q&A pairs are re-indexed in place without creating duplicates and pairs whose answers haven't changed are skipped by the json-writer change gate.

All parameters are optional with sensible defaults.

- skill: &ConfluenceFAQSplitter
    type: splitter
    name: confluence-faq-splitter
    params:
      min_heading_level: 2          # Minimum heading level for questions (default: 2)
      max_heading_level: 6          # Maximum heading level for questions (default: 6)
      skip_headings:                # Heading titles to skip (default: ['summary'])
        - summary
      skip_patterns:                # Text patterns to skip in answer content (default: ['CONFIDENTIAL', 'Search the FAQ', 'Search Artifactory FAQ'])
        - CONFIDENTIAL
      question_patterns:            # Prefixes that indicate a question (default: ['i am ', 'i cannot ', 'how do i ', 'what is ', ...])
        - "how do i "
        - "i cannot "
      stop_sections:                # Regex patterns for sections that end Q&A extraction (default: ['related articles', 'see also'])
        - "^\\s*related\\s*articles?\\s*$"
Writer Skills Capture and optionally gate intermediate pipeline state to a file.

JSON Writer

Extracts text content from all chunks and writes it as a sorted JSON array to a file. Useful for inspecting intermediate pipeline state (e.g. after splitting) and as a per-chunk change-detection gate: when checksum_path is configured, the skill computes a SHA-256 checksum of each chunk's content individually and stores the results in a JSON map keyed by document_id. On subsequent runs, only chunks whose content has changed (or are new) are passed downstream — unchanged chunks are stripped from their documents, so embedding and indexing are skipped for those chunks only.

This works well with Azure AI Search's key-based upsert — changed documents are re-indexed in place without creating duplicates.

- skill: &JSONWriter
    type: writer
    name: json-writer
    params:
      output_path: data/pipeline_output.json       # Path to the combined output JSON file (default: "data/pipeline_output.json")
      checksum_path: data/checksums.json           # Optional: path to a JSON file storing per-chunk SHA-256 checksums keyed by document_id. Enables per-chunk change detection.
      skip_downstream_if_unchanged: true           # Optional: if true (default) and checksum_path is set, strips unchanged chunks from their documents, skipping their embedding/indexing
Embedding Generate embeddings from text. Embeddings is a vector representation of your text data.

Azure Embeddings

Generates embeddings from text using embedding models deployed in Azure portal.

- skill: &Ada002Embedding
    type: embedding
    name: azure-ada002-embedding
    params:     # Configuration can be retrieved from your Azure Portal
      endpoint: https://your-embedding-endpoint
      api_key: env.AZURE_EMBEDDING_API_KEY
      api_version: your-api-version
      deployment_name: your-deployment-name

Fast Embed

Generates embeddings from text using llama_index library.

- skill: &FastEmbed
    type: embedding
    name: llama-fastembed

AWS Bedrock Titan

Generates embeddings using AWS Bedrock's Titan Embed Text v2 model. AWS credentials are resolved from the standard boto3 credential chain (env vars AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN, AWS_PROFILE, IAM role, ~/.aws/credentials, etc.) — do not put them in the YAML.

- skill: &BedrockTitanEmbedding
    type: embedding
    name: bedrock-titan-embedding
    params:
      region: us-east-1                         # Optional: falls back to AWS_REGION / default profile region
      model_id: amazon.titan-embed-text-v2:0    # Optional (default)
      dimensions: 1024                          # Optional: 256 | 512 | 1024 (default 1024)
      normalize: true                           # Optional (default true)
      max_retries: 3                            # Optional (default 3)
      retry_backoff: 2                          # Optional seconds, linear per attempt (default 2)
Vector Store Store embedding in a vector store.

Azure AI Search

Stores embeddings in an Azure AI Search index.

- skill: &AzureAISearch
    type: vector-store
    name: azure-ai-search
    params:
      endpoint: http://your-endpoint
      index_name: your-index-name
      api_key: env.AZURE_AI_SEARCH_API_KEY  # This parameter is optional. If missing, it will attempt an RBAC-based authentication. You might need to login to azure beforehand in your terminal using `azd` tool.
      field_mapping:    # on the left hand side is the internal representation. On the right hand side is your schema. You need to provide a mapping between our internal representation and your schema. If there's a field you don't need, just remove the line
        document_id: document_id
        content: content
        source_link: source_link
        document_name: document_name
        embedding: embedding
      overwrite_index: true  # true - before storing data, it will remove all the documents from your index. false - will append documents to your index
      batch_size: 50            # Optional: number of documents uploaded per API call (default: 50, max: 50)

Chroma

Stores embeddings in a Chroma vector store. Ideal for prototyping.

- skill: &ChromaDbVectorStore
    type: vector-store
    name: chromadb
    params:
        db_path: path/to/where/your/chroma/db/is    # if you don't have any yet, a new one will be created at the specified path
        collection_name: replace-this-with-your-collection-name # if you don't have a collection yet, a new one will be created when documents are inserted

FAISS

Stores embeddings in a faiss vector store.

- skill: &FaissDbVectorStore
    type: vector-store
    name: faissdb
    params:
        db_path: path/to/where/your/faiss/db/is    # if you don't have any yet, a new one will be created at the specified path
        dimension : replace-with-your-embeddings-dimension # Ensure that the correct dimension is provided. The expected dimension must match the embedding model you have selected
        overwrite_index: true  # true - before storing data, it will remove all the documents from your index. false - will append documents to your index

Contributors

All contributions are welcome!

The above list of skills and functionality covers the typical use cases identified so far.

If you have any cool ideas to extend the existing skills, or to create new ones, please contribute! Your contributions and feedback are the key to make this project a success!

Thank you so much! <3