Skip to content

Commit ab65e6b

Browse files
committed
feat: add document-to-markdown conversion endpoints for multiple formats
1 parent 4fb9af3 commit ab65e6b

27 files changed

+2529
-0
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Document to Markdown Endpoints
2+
3+
This document describes the implemented endpoints for converting various document formats to Markdown.
4+
5+
## Endpoints
6+
7+
### PDF to Markdown
8+
- **URL**: `/api/1/pdf-to-markdown`
9+
- **Method**: `POST`
10+
- **Authentication**: Required (Bearer token)
11+
12+
### DOCX to Markdown
13+
- **URL**: `/api/1/docx-to-markdown`
14+
- **Method**: `POST`
15+
- **Authentication**: Required (Bearer token)
16+
17+
### DOC to Markdown
18+
- **URL**: `/api/1/doc-to-markdown`
19+
- **Method**: `POST`
20+
- **Authentication**: Required (Bearer token)
21+
22+
### EPUB to Markdown
23+
- **URL**: `/api/1/epub-to-markdown`
24+
- **Method**: `POST`
25+
- **Authentication**: Required (Bearer token)
26+
27+
### Text to Markdown
28+
- **URL**: `/api/1/text-to-markdown`
29+
- **Method**: `POST`
30+
- **Authentication**: Required (Bearer token)
31+
32+
### PPTX to Markdown
33+
- **URL**: `/api/1/pptx-to-markdown`
34+
- **Method**: `POST`
35+
- **Authentication**: Required (Bearer token)
36+
37+
### XLSX to Markdown
38+
- **URL**: `/api/1/xlsx-to-markdown`
39+
- **Method**: `POST`
40+
- **Authentication**: Required (Bearer token)
41+
42+
### HTML to Markdown (Existing)
43+
- **URL**: `/api/1/html-to-markdown`
44+
- **Method**: `POST`
45+
- **Authentication**: Required (Bearer token)
46+
47+
## Request Format
48+
49+
All endpoints expect a JSON payload with the following structure:
50+
51+
```json
52+
{
53+
"file": "base64-encoded-file-content",
54+
"filename": "original-filename.pdf", // or .docx, .doc, .epub, .txt, .pptx, .xlsx
55+
"store": false // optional, defaults to false
56+
}
57+
```
58+
59+
### Parameters
60+
61+
- **file** (required): Base64-encoded content of the document file
62+
- **filename** (required): Original filename with appropriate extension (.pdf, .docx, .doc, .epub, .txt, .pptx, .xlsx)
63+
- **store** (optional): Boolean flag to store the converted document in Supabase storage
64+
65+
## Response Format
66+
67+
### Success Response
68+
- **Content-Type**: `text/markdown; charset=utf-8`
69+
- **Content-Disposition**: `attachment; filename="converted-filename.md"`
70+
- **Body**: The converted Markdown content as plain text
71+
72+
### Error Response
73+
```json
74+
{
75+
"error": "Error message",
76+
"details": "Additional error details"
77+
}
78+
```
79+
80+
## Implementation Details
81+
82+
### Services
83+
- **pdf-to-markdown-service.js**: Handles PDF to Markdown conversion using pandoc
84+
- **docx-to-markdown-service.js**: Handles DOCX to Markdown conversion using pandoc
85+
- **doc-to-markdown-service.js**: Handles DOC to Markdown conversion using pandoc
86+
- **epub-to-markdown-service.js**: Handles EPUB to Markdown conversion using pandoc
87+
- **text-to-markdown-service.js**: Handles text to Markdown conversion using pandoc
88+
- **pptx-to-markdown-service.js**: Handles PPTX to Markdown conversion using pandoc
89+
- **xlsx-to-markdown-service.js**: Handles XLSX to Markdown conversion using pandoc
90+
91+
### Routes
92+
- **pdf-to-markdown.js**: Route handler for PDF conversion endpoint
93+
- **docx-to-markdown.js**: Route handler for DOCX conversion endpoint
94+
- **doc-to-markdown.js**: Route handler for DOC conversion endpoint
95+
- **epub-to-markdown.js**: Route handler for EPUB conversion endpoint
96+
- **text-to-markdown.js**: Route handler for text conversion endpoint
97+
- **pptx-to-markdown.js**: Route handler for PPTX conversion endpoint
98+
- **xlsx-to-markdown.js**: Route handler for XLSX conversion endpoint
99+
- **html-to-markdown.js**: Route handler for HTML conversion endpoint (existing)
100+
101+
### Validation
102+
- **pdfFileContent**: Validates PDF file upload requests
103+
- **docxFileContent**: Validates DOCX file upload requests
104+
- **docFileContent**: Validates DOC file upload requests
105+
- **epubFileContent**: Validates EPUB file upload requests
106+
- **textFileContent**: Validates text file upload requests
107+
- **pptxFileContent**: Validates PPTX file upload requests
108+
- **xlsxFileContent**: Validates XLSX file upload requests
109+
110+
## Usage Examples
111+
112+
### Converting a PDF file
113+
114+
```javascript
115+
const pdfFile = // ... get file as base64
116+
const response = await fetch('/api/1/pdf-to-markdown', {
117+
method: 'POST',
118+
headers: {
119+
'Content-Type': 'application/json',
120+
'Authorization': 'Bearer YOUR_TOKEN'
121+
},
122+
body: JSON.stringify({
123+
file: pdfFile,
124+
filename: 'document.pdf',
125+
store: false
126+
})
127+
});
128+
129+
const markdownContent = await response.text();
130+
```
131+
132+
### Converting a DOCX file
133+
134+
```javascript
135+
const docxFile = // ... get file as base64
136+
const response = await fetch('/api/1/docx-to-markdown', {
137+
method: 'POST',
138+
headers: {
139+
'Content-Type': 'application/json',
140+
'Authorization': 'Bearer YOUR_TOKEN'
141+
},
142+
body: JSON.stringify({
143+
file: docxFile,
144+
filename: 'document.docx',
145+
store: true // Store in Supabase
146+
})
147+
});
148+
149+
const markdownContent = await response.text();
150+
```
151+
152+
## Testing
153+
154+
A test/example script is available at [`examples/test-document-to-markdown-endpoints.js`](examples/test-document-to-markdown-endpoints.js) that demonstrates how to use both endpoints.
155+
156+
To run the example:
157+
```bash
158+
node examples/test-document-to-markdown-endpoints.js
159+
```
160+
161+
Note: You'll need to update the script with valid authentication tokens and actual file content for testing.
162+
163+
## Dependencies
164+
165+
- **pandoc**: Used for file conversion (must be installed on the server)
166+
- **fs**: File system operations for temporary file handling
167+
- **uuid**: Generating unique temporary file names
168+
- **os**: Operating system utilities for temporary directory access
169+
170+
## Format Support Limitations
171+
172+
### PowerPoint (PPTX) Support
173+
- Pandoc can extract text content from PPTX files but layout information is lost
174+
- Slide structure may be preserved as headings in the markdown output
175+
- Images, animations, and complex formatting are not converted
176+
177+
### Excel (XLSX) Support
178+
- **Limited Support**: Pandoc's XLSX support is experimental and may not work reliably
179+
- Simple spreadsheets with text content may convert successfully
180+
- Complex formulas, charts, and formatting are not supported
181+
- Consider converting Excel files to CSV format for better results
182+
183+
### Recommended Alternatives
184+
- For Excel files: Convert to CSV format first for more reliable conversion
185+
- For PowerPoint files: Export as PDF first if layout preservation is important
186+
187+
## Error Handling
188+
189+
The endpoints handle various error scenarios:
190+
- Invalid file format
191+
- Missing required parameters
192+
- Pandoc conversion failures
193+
- File system errors
194+
- Storage service errors (when store=true)
195+
196+
All errors are logged with appropriate context and return meaningful error messages to the client.
197+
198+
## Security Considerations
199+
200+
- All endpoints require authentication
201+
- Temporary files are automatically cleaned up after processing
202+
- File content is validated before processing
203+
- Error messages don't expose sensitive system information

examples/README.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Document-to-Markdown Examples
2+
3+
This directory contains example scripts demonstrating how to use each of the document-to-markdown conversion endpoints.
4+
5+
## Available Examples
6+
7+
### Individual Format Examples
8+
9+
1. **[`pdf-to-markdown-example.js`](pdf-to-markdown-example.js)** - Convert PDF files to Markdown
10+
2. **[`docx-to-markdown-example.js`](docx-to-markdown-example.js)** - Convert DOCX files to Markdown
11+
3. **[`doc-to-markdown-example.js`](doc-to-markdown-example.js)** - Convert legacy DOC files to Markdown
12+
4. **[`epub-to-markdown-example.js`](epub-to-markdown-example.js)** - Convert EPUB e-books to Markdown
13+
5. **[`text-to-markdown-example.js`](text-to-markdown-example.js)** - Convert plain text files to Markdown
14+
6. **[`pptx-to-markdown-example.js`](pptx-to-markdown-example.js)** - Convert PowerPoint presentations to Markdown ⚠️
15+
7. **[`xlsx-to-markdown-example.js`](xlsx-to-markdown-example.js)** - Convert Excel spreadsheets to Markdown ⚠️
16+
17+
### Comprehensive Example
18+
19+
- **[`test-document-to-markdown-endpoints.js`](test-document-to-markdown-endpoints.js)** - Test all endpoints in one script
20+
21+
## Quick Start
22+
23+
### Prerequisites
24+
25+
1. **Authentication Token**: Replace `YOUR_AUTH_TOKEN` in any example with a valid bearer token
26+
2. **Server Running**: Ensure the convert2doc server is running (usually on port 3000)
27+
3. **Sample Files**: Place sample files in the examples directory or update the file paths in the scripts
28+
29+
### Running an Example
30+
31+
```bash
32+
# Run a specific format example
33+
node examples/pdf-to-markdown-example.js
34+
35+
# Run the comprehensive test
36+
node examples/test-document-to-markdown-endpoints.js
37+
```
38+
39+
### Using as Modules
40+
41+
All examples export their main conversion functions and can be imported:
42+
43+
```javascript
44+
import { convertPdfToMarkdown } from './examples/pdf-to-markdown-example.js';
45+
46+
const markdownContent = await convertPdfToMarkdown(pdfBase64, 'document.pdf');
47+
```
48+
49+
## Format Support & Limitations
50+
51+
### ✅ Fully Supported Formats
52+
53+
- **PDF** - Reliable text extraction, layout may be lost
54+
- **DOCX** - Excellent support, formatting preserved
55+
- **DOC** - Good support for legacy Word documents
56+
- **EPUB** - Great for e-books, chapter structure preserved
57+
- **Text** - Perfect conversion, paragraph structure maintained
58+
59+
### ⚠️ Limited Support Formats
60+
61+
- **PPTX** - Text-only extraction, no layout/images
62+
- **XLSX** - Experimental, often fails, use CSV instead
63+
64+
## Example File Structure
65+
66+
Each individual example follows this pattern:
67+
68+
```javascript
69+
// Sample base64 content
70+
const sampleFileBase64 = '...';
71+
72+
// Main conversion function
73+
async function convertFileToMarkdown(fileBase64, filename, store = false) {
74+
// API call implementation
75+
}
76+
77+
// File reading example
78+
async function exampleWithFile() {
79+
// Read file and convert
80+
}
81+
82+
// Main runner
83+
async function runExample() {
84+
// Setup instructions and execution
85+
}
86+
87+
// Export for module use
88+
export { convertFileToMarkdown };
89+
```
90+
91+
## Configuration
92+
93+
### API Base URL
94+
95+
Update the `API_BASE_URL` constant in each example:
96+
97+
```javascript
98+
const API_BASE_URL = 'http://localhost:3000'; // Change as needed
99+
```
100+
101+
### Authentication
102+
103+
Replace the placeholder token in each example:
104+
105+
```javascript
106+
'Authorization': 'Bearer YOUR_ACTUAL_TOKEN_HERE'
107+
```
108+
109+
### File Paths
110+
111+
Update file paths to point to your actual documents:
112+
113+
```javascript
114+
const filePath = './your-actual-document.pdf';
115+
```
116+
117+
## Error Handling
118+
119+
All examples include comprehensive error handling:
120+
121+
- Network errors
122+
- Authentication failures
123+
- Conversion failures
124+
- File system errors
125+
126+
## Output
127+
128+
Each example will:
129+
130+
1. Display conversion progress
131+
2. Save the converted Markdown to a file
132+
3. Show a preview of the converted content
133+
4. Provide helpful error messages if conversion fails
134+
135+
## Tips for Success
136+
137+
### For PowerPoint (PPTX)
138+
- Only text content will be extracted
139+
- Consider converting to PDF first for layout preservation
140+
- Slide structure may become headings
141+
142+
### For Excel (XLSX)
143+
- Conversion often fails - this is expected
144+
- Convert to CSV format first for better results
145+
- Only simple spreadsheets may work
146+
147+
### For All Formats
148+
- Ensure files are not corrupted
149+
- Check file permissions
150+
- Verify the server is running and accessible
151+
- Use valid authentication tokens
152+
153+
## Troubleshooting
154+
155+
### Common Issues
156+
157+
1. **Authentication Error**: Check your bearer token
158+
2. **Server Connection**: Verify the API_BASE_URL and server status
159+
3. **File Not Found**: Check file paths and permissions
160+
4. **Conversion Failed**: Some formats have limitations (see above)
161+
162+
### Getting Help
163+
164+
- Check the main documentation: [`README-document-to-markdown-endpoints.md`](../README-document-to-markdown-endpoints.md)
165+
- Review server logs for detailed error messages
166+
- Ensure pandoc is installed on the server
167+
168+
## Contributing
169+
170+
When adding new examples:
171+
172+
1. Follow the established pattern
173+
2. Include comprehensive error handling
174+
3. Add format-specific limitations and tips
175+
4. Update this README with the new example
176+
5. Export the main conversion function for module use

0 commit comments

Comments
 (0)