Skip to content

Commit 2e56a46

Browse files
author
Dorin POMIAN
committed
feat: add html export skill for confluence
1 parent 5c031ff commit 2e56a46

9 files changed

Lines changed: 2161 additions & 1728 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ uv run --directory src docs2vecs indexer --config ~/Downloads/sw_export_temp/con
7676

7777
The config yaml file is validated against [this schema](./src/docs2vecs/subcommands/indexer/config/config_schema.yaml).
7878

79-
Please check [sample config file 1](docs/readme/sample-config-file-1.yml), [sample config file 2](docs/readme/sample-config-file-2.yml) for your reference.
79+
Please check [sample config file 1](docs/readme/sample-config-file-1.yml), [sample config file 2](docs/readme/sample-config-file-2.yml), [sample config file 3](docs/readme/sample-config-file-3.yml) for your reference.
8080

8181
</details>
8282

@@ -166,7 +166,7 @@ Please note that **api keys** should **NOT** be stored in config files, and shou
166166

167167
Make sure you export the environment variables before you run the indexer. For convenience you can use the `--env` argument to supply your own `.env` file.
168168

169-
Generate and use Scroll Word Exporter API tokens from the Personal Settings section of your Confluence profile.
169+
Generate and use Scroll Word Exporter API tokens from the Personal Settings section of your Confluence profile. For the Scroll HTML Exporter, generate a separate token from Personal Settings → Scroll HTML Exporter API Tokens. Note that each Scroll exporter (Word, HTML, PDF) requires its own token, and you must use the correct regional endpoint (US or EU/Germany).
170170

171171
## Experimental features
172172
<details><summary>Tracker</summary>

docs/readme/indexer-skills.md

Lines changed: 61 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,27 +20,37 @@ This document describes all available skills that can be used in the indexer pip
2020
4. An `embedding` to generate embeddings from the chunks.
2121
5. A `vector-store` to store the embeddings.
2222

23-
3. You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:
23+
3. You have Confluence pages and want to export them as HTML, convert to Markdown, then vectorize? You'll typically need:
24+
25+
1. A `scrollhtml-exporter` to export the Confluence pages as HTML (ZIP).
26+
2. A `confluence-html-to-markdown` transformer to convert the HTML export into self-contained Markdown (with images).
27+
3. A `file-scanner` to pick up the resulting `.md` files.
28+
4. A `file-reader` to read the Markdown content.
29+
5. A `splitter` to split the documents into chunks.
30+
6. An `embedding` to generate embeddings from the chunks.
31+
7. A `vector-store` to store the embeddings.
32+
33+
4. You have a list of jira tickets that you'd like to vectorize? You'll typically need the following type of skills in your config file:
2434

2535
1. A `jira-loader` to extract the data from the jira tickets
2636
2. A `splitter` to split the data into chunks.
2737
3. An `embedding` to generate embeddings from the chunks.
2838
4. A `vector-store` to store the embeddings.
2939

30-
4. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
40+
5. You have FAQ documents exported from Confluence (`.docx` files) and want to extract Q&A pairs for vectorization? You'll typically need:
3141

3242
1. An `exporter` (Scroll Word) or `file-scanner` to get the `.docx` files.
3343
2. A `confluence-faq-splitter` to extract Q&A pairs directly from the `.docx` headings.
3444
3. An `embedding` to generate embeddings from the Q&A chunks.
3545
4. A `vector-store` to store the embeddings.
3646

37-
5. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
47+
6. You have enriched Q&A JSON output from a Teams FAQ pipeline and want to index it? You'll typically need:
3848

3949
1. A `teams-qna-loader` to load the enriched Q&A pairs from the JSON file.
4050
2. An `embedding` to generate embeddings from the Q&A content.
4151
3. A `vector-store` to store the embeddings.
4252

43-
6. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a `writer` (`json-writer`) skill as a change gate:
53+
7. You want to avoid re-running expensive embedding and indexing when the content hasn't changed since the last run? Insert a `writer` (`json-writer`) skill as a change gate:
4454

4555
1. A `file-scanner` (or `exporter`) to locate/export your source documents.
4656
2. A `file-reader` to read their content.
@@ -81,6 +91,53 @@ Exports Confluence pages to Microsoft Word documents. Each entry in `page_urls`
8191
- id: 1234567890
8292
# no tag — falls back to top-level tag
8393
```
94+
95+
### Scroll HTML Exporter
96+
Exports Confluence pages as HTML via the K15t Scroll HTML Exporter REST API. The export is downloaded as a ZIP and extracted locally. This is typically followed by the `confluence-html-to-markdown` transformer skill.
97+
98+
API tokens are exporter-specific — you need a **Scroll HTML Exporter** token (User Profile → Personal settings → Scroll HTML Exporter API Tokens).
99+
100+
Data residency: use the correct regional endpoint:
101+
- US: `https://scroll-html.us.exporter.k15t.app/api/public/1/exports`
102+
- EU/Germany: `https://scroll-html.de.exporter.k15t.app/api/public/1/exports`
103+
104+
```yaml
105+
- skill: &ScrollHTMLExporter
106+
type: exporter
107+
name: scrollhtml-exporter
108+
params:
109+
api_url: https://scroll-html.de.exporter.k15t.app/api/public/1/exports # Use .us. for US region
110+
auth_token: env.SCROLL_HTML_EXPORTER_TOKEN # Scroll HTML Exporter API token
111+
poll_interval: 2 # Interval in seconds to check the status of the export
112+
export_folder: ~/Downloads/html_export # Path where the exported ZIP is extracted
113+
scope: current # Possible values: [current | descendants | document]
114+
template_id: com.k15t.scroll.html.helpcenter # Optional: defaults to the bundled Help Center template
115+
confluence_prefix: https://your-instance.atlassian.net/wiki # Optional: used to build source_url
116+
tag: my-docs # Optional: default tag for all pages
117+
page_ids:
118+
- id: 1436680207
119+
tag: copilot-docs # Optional
120+
page_urls:
121+
- url: https://your-instance.atlassian.net/wiki/spaces/SPACE/pages/123/Page+Title
122+
```
123+
</details>
124+
125+
<details><summary>Transformer Skills</summary>
126+
Transform data from one format to another on disk. Transformers sit between exporters and file-scanners in the pipeline.
127+
128+
### Confluence HTML to Markdown
129+
Converts a Scroll HTML export folder into self-contained Markdown files. Images referenced in the pages are copied into an `images/` sub-folder so the output is portable without the original HTML.
130+
131+
Typically used after `scrollhtml-exporter` and before `file-scanner`.
132+
133+
```yaml
134+
- skill: &HtmlToMarkdown
135+
type: transformer
136+
name: confluence-html-to-markdown
137+
params:
138+
input_dir: ~/Downloads/html_export/1436680207 # Path to the extracted Scroll HTML export
139+
output_dir: ~/Downloads/html_export/1436680207/markdown # Optional: defaults to <input_dir>/markdown
140+
```
84141
</details>
85142

86143
<details><summary>File Scanner Skills</summary>
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
definitions:
2+
- skill: &ScrollHTMLExporter
3+
type: exporter
4+
name: scrollhtml-exporter
5+
params:
6+
api_url: https://scroll-html.de.exporter.k15t.app/api/public/1/exports
7+
auth_token: env.SCROLL_HTML_EXPORTER_TOKEN
8+
poll_interval: 2
9+
export_folder: ~/Downloads/html_export
10+
scope: current
11+
confluence_prefix: https://amadeus.atlassian.net/wiki
12+
page_ids:
13+
- id: 1436680207
14+
15+
- skill: &HtmlToMarkdown
16+
type: transformer
17+
name: confluence-html-to-markdown
18+
params:
19+
input_dir: ~/Downloads/html_export/1436680207
20+
output_dir: ~/Downloads/html_export/1436680207/markdown
21+
22+
- skill: &FileScanner
23+
type: file-scanner
24+
name: multi-file-scanner
25+
params:
26+
path: ~/Downloads/html_export/1436680207/markdown
27+
filter: ["*.md"]
28+
recursive: false
29+
30+
- skill: &FileReader
31+
type: file-reader
32+
name: multi-file-reader
33+
34+
- skill: &TextSplitter
35+
type: splitter
36+
name: recursive-character-splitter
37+
params:
38+
chunk_size: 1200
39+
overlap: 200
40+
41+
- skill: &FastEmbed
42+
type: embedding
43+
name: llama-fastembed
44+
45+
- skill: &ChromaDbVectorStore
46+
type: vector-store
47+
name: chromadb
48+
params:
49+
db_path: ~/Downloads/html_export/chroma_db
50+
collection_name: confluence-html-export
51+
52+
- skillset: &Pipeline
53+
- *ScrollHTMLExporter
54+
- *HtmlToMarkdown
55+
- *FileScanner
56+
- *FileReader
57+
- *TextSplitter
58+
- *FastEmbed
59+
- *ChromaDbVectorStore
60+
61+
indexer:
62+
id: confluence-html-to-vectorstore
63+
skillset: *Pipeline

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ dependencies = [
1313
"azure-identity>=1.15.0",
1414
"azure-search-documents>=11.5.2",
1515
"azure-storage-blob>=12.24.1",
16+
"beautifulsoup4>=4.12",
1617
"cerberus>=1.3.7",
1718
"chromadb>=0.6.3",
1819
"jira>=3.8.0",
@@ -26,6 +27,7 @@ dependencies = [
2627
"llama-index-retrievers-bm25>=0.5.2",
2728
"llama-index-vector-stores-chroma>=0.4.1",
2829
"markdown>=3.7",
30+
"markdownify>=0.14",
2931
"openpyxl>=3.1.5",
3032
"pymongo>=4.11.1",
3133
"pystemmer>=2.2.0.3",

src/docs2vecs/subcommands/indexer/config/config_schema.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ definitions:
9696
scope:
9797
type: string
9898
required: False
99-
allowed: ['descendants', 'current']
99+
allowed: ['descendants', 'current', 'document']
100100
recursive:
101101
type: boolean
102102
filter:
@@ -265,6 +265,9 @@ definitions:
265265
type: integer
266266
required: False
267267
min: 1
268+
template_id:
269+
type: string
270+
required: False
268271

269272
skillset:
270273
type: list

src/docs2vecs/subcommands/indexer/skills/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from .azure_vector_store_skill import AzureVectorStoreSkill
33
from .document_intelligence_skill import AzureDocumentIntelligenceSkill
44
from .jira_loader_skill import JiraLoaderSkill
5+
from .scrollhtmlexporter_skill import ScrollHTMLExporterSkill
56
from .scrollwordexporter_skill import ScrollWorldExporterSkill
67
from .chromadb_vector_store_skill import ChromaDBVectorStoreSkill
78
from .tracker import VectorStoreTracker
@@ -25,6 +26,7 @@
2526
"AzureDocumentIntelligenceSkill",
2627
"AzureVectorStoreSkill",
2728
"JiraLoaderSkill",
29+
"ScrollHTMLExporterSkill",
2830
"ScrollWorldExporterSkill",
2931
"VectorStoreTracker",
3032
"ChromaDBVectorStoreSkill",

src/docs2vecs/subcommands/indexer/skills/factory.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from docs2vecs.subcommands.indexer.skills import JiraLoaderSkill
1313
from docs2vecs.subcommands.indexer.skills import LlamaFastembedEmbeddingSkill
1414
from docs2vecs.subcommands.indexer.skills import RecursiveCharacterTextSplitter
15+
from docs2vecs.subcommands.indexer.skills import ScrollHTMLExporterSkill
1516
from docs2vecs.subcommands.indexer.skills import ScrollWorldExporterSkill
1617
from docs2vecs.subcommands.indexer.skills import SemanticSplitter
1718
from docs2vecs.subcommands.indexer.skills import VectorStoreTracker
@@ -39,6 +40,7 @@ class SkillType(StrEnum):
3940
class AvailableSkillName(StrEnum):
4041
# exporters
4142
SCROLLWORD_EXPORTER = "scrollword-exporter"
43+
SCROLLHTML_EXPORTER = "scrollhtml-exporter"
4244

4345
# file readers
4446
AZ_DOCUMENT_INTELLIGENCE = "azure-document-intelligence"
@@ -79,6 +81,7 @@ class AvailableSkillName(StrEnum):
7981
AVAILABLE_SKILLS = {
8082
SkillType.EXPORTER: {
8183
AvailableSkillName.SCROLLWORD_EXPORTER: ScrollWorldExporterSkill,
84+
AvailableSkillName.SCROLLHTML_EXPORTER: ScrollHTMLExporterSkill,
8285
},
8386
SkillType.FILE_SCANNER: {AvailableSkillName.MULTI_FILE_SCANNER: FileScannerSkill},
8487
SkillType.FILE_READER: {

0 commit comments

Comments
 (0)