|
| 1 | +# Document to Markdown Endpoints |
| 2 | + |
| 3 | +This document describes the implemented endpoints for converting various document formats to Markdown. |
| 4 | + |
| 5 | +## Endpoints |
| 6 | + |
| 7 | +### PDF to Markdown |
| 8 | +- **URL**: `/api/1/pdf-to-markdown` |
| 9 | +- **Method**: `POST` |
| 10 | +- **Authentication**: Required (Bearer token) |
| 11 | + |
| 12 | +### DOCX to Markdown |
| 13 | +- **URL**: `/api/1/docx-to-markdown` |
| 14 | +- **Method**: `POST` |
| 15 | +- **Authentication**: Required (Bearer token) |
| 16 | + |
| 17 | +### DOC to Markdown |
| 18 | +- **URL**: `/api/1/doc-to-markdown` |
| 19 | +- **Method**: `POST` |
| 20 | +- **Authentication**: Required (Bearer token) |
| 21 | + |
| 22 | +### EPUB to Markdown |
| 23 | +- **URL**: `/api/1/epub-to-markdown` |
| 24 | +- **Method**: `POST` |
| 25 | +- **Authentication**: Required (Bearer token) |
| 26 | + |
| 27 | +### Text to Markdown |
| 28 | +- **URL**: `/api/1/text-to-markdown` |
| 29 | +- **Method**: `POST` |
| 30 | +- **Authentication**: Required (Bearer token) |
| 31 | + |
| 32 | +### PPTX to Markdown |
| 33 | +- **URL**: `/api/1/pptx-to-markdown` |
| 34 | +- **Method**: `POST` |
| 35 | +- **Authentication**: Required (Bearer token) |
| 36 | + |
| 37 | +### XLSX to Markdown |
| 38 | +- **URL**: `/api/1/xlsx-to-markdown` |
| 39 | +- **Method**: `POST` |
| 40 | +- **Authentication**: Required (Bearer token) |
| 41 | + |
| 42 | +### HTML to Markdown (Existing) |
| 43 | +- **URL**: `/api/1/html-to-markdown` |
| 44 | +- **Method**: `POST` |
| 45 | +- **Authentication**: Required (Bearer token) |
| 46 | + |
| 47 | +## Request Format |
| 48 | + |
| 49 | +All endpoints expect a JSON payload with the following structure: |
| 50 | + |
| 51 | +```json |
| 52 | +{ |
| 53 | + "file": "base64-encoded-file-content", |
| 54 | + "filename": "original-filename.pdf", // or .docx, .doc, .epub, .txt, .pptx, .xlsx |
| 55 | + "store": false // optional, defaults to false |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +### Parameters |
| 60 | + |
| 61 | +- **file** (required): Base64-encoded content of the document file |
| 62 | +- **filename** (required): Original filename with appropriate extension (.pdf, .docx, .doc, .epub, .txt, .pptx, .xlsx) |
| 63 | +- **store** (optional): Boolean flag to store the converted document in Supabase storage |
| 64 | + |
| 65 | +## Response Format |
| 66 | + |
| 67 | +### Success Response |
| 68 | +- **Content-Type**: `text/markdown; charset=utf-8` |
| 69 | +- **Content-Disposition**: `attachment; filename="converted-filename.md"` |
| 70 | +- **Body**: The converted Markdown content as plain text |
| 71 | + |
| 72 | +### Error Response |
| 73 | +```json |
| 74 | +{ |
| 75 | + "error": "Error message", |
| 76 | + "details": "Additional error details" |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +## Implementation Details |
| 81 | + |
| 82 | +### Services |
| 83 | +- **pdf-to-markdown-service.js**: Handles PDF to Markdown conversion using pandoc |
| 84 | +- **docx-to-markdown-service.js**: Handles DOCX to Markdown conversion using pandoc |
| 85 | +- **doc-to-markdown-service.js**: Handles DOC to Markdown conversion using pandoc |
| 86 | +- **epub-to-markdown-service.js**: Handles EPUB to Markdown conversion using pandoc |
| 87 | +- **text-to-markdown-service.js**: Handles text to Markdown conversion using pandoc |
| 88 | +- **pptx-to-markdown-service.js**: Handles PPTX to Markdown conversion using pandoc |
| 89 | +- **xlsx-to-markdown-service.js**: Handles XLSX to Markdown conversion using pandoc |
| 90 | + |
| 91 | +### Routes |
| 92 | +- **pdf-to-markdown.js**: Route handler for PDF conversion endpoint |
| 93 | +- **docx-to-markdown.js**: Route handler for DOCX conversion endpoint |
| 94 | +- **doc-to-markdown.js**: Route handler for DOC conversion endpoint |
| 95 | +- **epub-to-markdown.js**: Route handler for EPUB conversion endpoint |
| 96 | +- **text-to-markdown.js**: Route handler for text conversion endpoint |
| 97 | +- **pptx-to-markdown.js**: Route handler for PPTX conversion endpoint |
| 98 | +- **xlsx-to-markdown.js**: Route handler for XLSX conversion endpoint |
| 99 | +- **html-to-markdown.js**: Route handler for HTML conversion endpoint (existing) |
| 100 | + |
| 101 | +### Validation |
| 102 | +- **pdfFileContent**: Validates PDF file upload requests |
| 103 | +- **docxFileContent**: Validates DOCX file upload requests |
| 104 | +- **docFileContent**: Validates DOC file upload requests |
| 105 | +- **epubFileContent**: Validates EPUB file upload requests |
| 106 | +- **textFileContent**: Validates text file upload requests |
| 107 | +- **pptxFileContent**: Validates PPTX file upload requests |
| 108 | +- **xlsxFileContent**: Validates XLSX file upload requests |
| 109 | + |
| 110 | +## Usage Examples |
| 111 | + |
| 112 | +### Converting a PDF file |
| 113 | + |
| 114 | +```javascript |
| 115 | +const pdfFile = // ... get file as base64 |
| 116 | +const response = await fetch('/api/1/pdf-to-markdown', { |
| 117 | + method: 'POST', |
| 118 | + headers: { |
| 119 | + 'Content-Type': 'application/json', |
| 120 | + 'Authorization': 'Bearer YOUR_TOKEN' |
| 121 | + }, |
| 122 | + body: JSON.stringify({ |
| 123 | + file: pdfFile, |
| 124 | + filename: 'document.pdf', |
| 125 | + store: false |
| 126 | + }) |
| 127 | +}); |
| 128 | + |
| 129 | +const markdownContent = await response.text(); |
| 130 | +``` |
| 131 | + |
| 132 | +### Converting a DOCX file |
| 133 | + |
| 134 | +```javascript |
| 135 | +const docxFile = // ... get file as base64 |
| 136 | +const response = await fetch('/api/1/docx-to-markdown', { |
| 137 | + method: 'POST', |
| 138 | + headers: { |
| 139 | + 'Content-Type': 'application/json', |
| 140 | + 'Authorization': 'Bearer YOUR_TOKEN' |
| 141 | + }, |
| 142 | + body: JSON.stringify({ |
| 143 | + file: docxFile, |
| 144 | + filename: 'document.docx', |
| 145 | + store: true // Store in Supabase |
| 146 | + }) |
| 147 | +}); |
| 148 | + |
| 149 | +const markdownContent = await response.text(); |
| 150 | +``` |
| 151 | + |
| 152 | +## Testing |
| 153 | + |
| 154 | +A test/example script is available at [`examples/test-document-to-markdown-endpoints.js`](examples/test-document-to-markdown-endpoints.js) that demonstrates how to use both endpoints. |
| 155 | + |
| 156 | +To run the example: |
| 157 | +```bash |
| 158 | +node examples/test-document-to-markdown-endpoints.js |
| 159 | +``` |
| 160 | + |
| 161 | +Note: You'll need to update the script with valid authentication tokens and actual file content for testing. |
| 162 | + |
| 163 | +## Dependencies |
| 164 | + |
| 165 | +- **pandoc**: Used for file conversion (must be installed on the server) |
| 166 | +- **fs**: File system operations for temporary file handling |
| 167 | +- **uuid**: Generating unique temporary file names |
| 168 | +- **os**: Operating system utilities for temporary directory access |
| 169 | + |
| 170 | +## Format Support Limitations |
| 171 | + |
| 172 | +### PowerPoint (PPTX) Support |
| 173 | +- Pandoc can extract text content from PPTX files but layout information is lost |
| 174 | +- Slide structure may be preserved as headings in the markdown output |
| 175 | +- Images, animations, and complex formatting are not converted |
| 176 | + |
| 177 | +### Excel (XLSX) Support |
| 178 | +- **Limited Support**: Pandoc's XLSX support is experimental and may not work reliably |
| 179 | +- Simple spreadsheets with text content may convert successfully |
| 180 | +- Complex formulas, charts, and formatting are not supported |
| 181 | +- Consider converting Excel files to CSV format for better results |
| 182 | + |
| 183 | +### Recommended Alternatives |
| 184 | +- For Excel files: Convert to CSV format first for more reliable conversion |
| 185 | +- For PowerPoint files: Export as PDF first if layout preservation is important |
| 186 | + |
| 187 | +## Error Handling |
| 188 | + |
| 189 | +The endpoints handle various error scenarios: |
| 190 | +- Invalid file format |
| 191 | +- Missing required parameters |
| 192 | +- Pandoc conversion failures |
| 193 | +- File system errors |
| 194 | +- Storage service errors (when store=true) |
| 195 | + |
| 196 | +All errors are logged with appropriate context and return meaningful error messages to the client. |
| 197 | + |
| 198 | +## Security Considerations |
| 199 | + |
| 200 | +- All endpoints require authentication |
| 201 | +- Temporary files are automatically cleaned up after processing |
| 202 | +- File content is validated before processing |
| 203 | +- Error messages don't expose sensitive system information |
0 commit comments