You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The config yaml file is validated against [this schema](./src/docs2vecs/subcommands/indexer/config/config_schema.yaml).
78
78
79
-
Please check [sample config file 1](docs/readme/sample-config-file-1.yml), [sample config file 2](docs/readme/sample-config-file-2.yml) for your reference.
79
+
Please check [sample config file 1](docs/readme/sample-config-file-1.yml), [sample config file 2](docs/readme/sample-config-file-2.yml), [sample config file 3](docs/readme/sample-config-file-3.yml) for your reference.
80
80
81
81
</details>
82
82
@@ -166,7 +166,7 @@ Please note that **api keys** should **NOT** be stored in config files, and shou
166
166
167
167
Make sure you export the environment variables before you run the indexer. For convenience you can use the `--env` argument to supply your own `.env` file.
168
168
169
-
Generate and use Scroll Word Exporter API tokens from the Personal Settings section of your Confluence profile.
169
+
Generate and use Scroll Word Exporter API tokens from the Personal Settings section of your Confluence profile. For the Scroll HTML Exporter, generate a separate token from Personal Settings → Scroll HTML Exporter API Tokens. Note that each Scroll exporter (Word, HTML, PDF) requires its own token, and you must use the correct regional endpoint (US or EU/Germany).
Copy file name to clipboardExpand all lines: docs/readme/indexer-skills.md
+61-4Lines changed: 61 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,27 +20,37 @@ This document describes all available skills that can be used in the indexer pip
20
20
4. An `embedding` to generate embeddings from the chunks.
21
21
5. A `vector-store` to store the embeddings.
22
22
23
-
3. You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:
23
+
3. You have Confluence pages and want to export them as HTML, convert to Markdown, then vectorize? You'll typically need:
24
+
25
+
1. A `scrollhtml-exporter` to export the Confluence pages as HTML (ZIP).
26
+
2. A `confluence-html-to-markdown` transformer to convert the HTML export into self-contained Markdown (with images).
27
+
3. A `file-scanner` to pick up the resulting `.md` files.
28
+
4. A `file-reader` to read the Markdown content.
29
+
5. A `splitter` to split the documents into chunks.
30
+
6. An `embedding` to generate embeddings from the chunks.
31
+
7. A `vector-store` to store the embeddings.
32
+
33
+
4. You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:
24
34
25
35
1. A `jira-loader` to extract the data from the jira tickets
26
36
2. A `splitter` to split the data into chunks.
27
37
3. An `embedding` to generate embeddings from the chunks.
28
38
4. A `vector-store` to store the embeddings.
29
39
30
-
4. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
40
+
5. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
31
41
32
42
1. An `exporter` (Scroll Word) or `file-scanner` to get the `.docx` files.
33
43
2. A `confluence-faq-splitter` to extract Q&A pairs directly from the `.docx` headings.
34
44
3. An `embedding` to generate embeddings from the Q&A chunks.
35
45
4. A `vector-store` to store the embeddings.
36
46
37
-
5. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
47
+
6. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
38
48
39
49
1. A `teams-qna-loader` to load the enriched Q&A pairs from the JSON file.
40
50
2. An `embedding` to generate embeddings from the Q&A content.
41
51
3. A `vector-store` to store the embeddings.
42
52
43
-
6. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a `writer` (`json-writer`) skill as a change gate:
53
+
7. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a `writer` (`json-writer`) skill as a change gate:
44
54
45
55
1. A `file-scanner` (or `exporter`) to locate/export your source documents.
46
56
2. A `file-reader` to read their content.
@@ -81,6 +91,53 @@ Exports Confluence pages to Microsoft Word documents. Each entry in `page_urls`
81
91
- id: 1234567890
82
92
# no tag — falls back to top-level tag
83
93
```
94
+
95
+
### Scroll HTML Exporter
96
+
Exports Confluence pages as HTML via the K15t Scroll HTML Exporter REST API. The export is downloaded as a ZIP and extracted locally. This is typically followed by the `confluence-html-to-markdown` transformer skill.
97
+
98
+
API tokens are exporter-specific — you need a **Scroll HTML Exporter** token (User Profile → Personal settings → Scroll HTML Exporter API Tokens).
99
+
100
+
Data residency: use the correct regional endpoint:
Transform data from one format to another on disk. Transformers sit between exporters and file-scanners in the pipeline.
127
+
128
+
### Confluence HTML to Markdown
129
+
Converts a Scroll HTML export folder into self-contained Markdown files. Images referenced in the pages are copied into an `images/` sub-folder so the output is portable without the original HTML.
130
+
131
+
Typically used after `scrollhtml-exporter` and before `file-scanner`.
132
+
133
+
```yaml
134
+
- skill: &HtmlToMarkdown
135
+
type: transformer
136
+
name: confluence-html-to-markdown
137
+
params:
138
+
input_dir: ~/Downloads/html_export/1436680207 # Path to the extracted Scroll HTML export
139
+
output_dir: ~/Downloads/html_export/1436680207/markdown # Optional: defaults to <input_dir>/markdown
0 commit comments