profullstack
diff --git a/‎README-document-to-markdown-endpoints.md‎
Lines changed: 203 additions & 0 deletions b/‎README-document-to-markdown-endpoints.md‎
Lines changed: 203 additions & 0 deletions
diff --git a/‎examples/README.md‎
Lines changed: 176 additions & 0 deletions b/‎examples/README.md‎
Lines changed: 176 additions & 0 deletions
@@ -0,0 +1,203 @@
+# Document to Markdown Endpoints
+
+This document describes the implemented endpoints for converting various document formats to Markdown.
+
+## Endpoints
+
+### PDF to Markdown
+- **URL**: `/api/1/pdf-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### DOCX to Markdown
+- **URL**: `/api/1/docx-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### DOC to Markdown
+- **URL**: `/api/1/doc-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### EPUB to Markdown
+- **URL**: `/api/1/epub-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### Text to Markdown
+- **URL**: `/api/1/text-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### PPTX to Markdown
+- **URL**: `/api/1/pptx-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### XLSX to Markdown
+- **URL**: `/api/1/xlsx-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+### HTML to Markdown (Existing)
+- **URL**: `/api/1/html-to-markdown`
+- **Method**: `POST`
+- **Authentication**: Required (Bearer token)
+
+## Request Format
+
+All endpoints expect a JSON payload with the following structure:
+
+```json
+{
+  "file": "base64-encoded-file-content",
+  "filename": "original-filename.pdf", // or .docx, .doc, .epub, .txt, .pptx, .xlsx
+  "store": false // optional, defaults to false
+}
+```
+
+### Parameters
+
+- **file** (required): Base64-encoded content of the document file
+- **filename** (required): Original filename with appropriate extension (.pdf, .docx, .doc, .epub, .txt, .pptx, .xlsx)
+- **store** (optional): Boolean flag to store the converted document in Supabase storage
+
+## Response Format
+
+### Success Response
+- **Content-Type**: `text/markdown; charset=utf-8`
+- **Content-Disposition**: `attachment; filename="converted-filename.md"`
+- **Body**: The converted Markdown content as plain text
+
+### Error Response
+```json
+{
+  "error": "Error message",
+  "details": "Additional error details"
+}
+```
+
+## Implementation Details
+
+### Services
+- **pdf-to-markdown-service.js**: Handles PDF to Markdown conversion using pandoc
+- **docx-to-markdown-service.js**: Handles DOCX to Markdown conversion using pandoc
+- **doc-to-markdown-service.js**: Handles DOC to Markdown conversion using pandoc
+- **epub-to-markdown-service.js**: Handles EPUB to Markdown conversion using pandoc
+- **text-to-markdown-service.js**: Handles text to Markdown conversion using pandoc
+- **pptx-to-markdown-service.js**: Handles PPTX to Markdown conversion using pandoc
+- **xlsx-to-markdown-service.js**: Handles XLSX to Markdown conversion using pandoc
+
+### Routes
+- **pdf-to-markdown.js**: Route handler for PDF conversion endpoint
+- **docx-to-markdown.js**: Route handler for DOCX conversion endpoint
+- **doc-to-markdown.js**: Route handler for DOC conversion endpoint
+- **epub-to-markdown.js**: Route handler for EPUB conversion endpoint
+- **text-to-markdown.js**: Route handler for text conversion endpoint
+- **pptx-to-markdown.js**: Route handler for PPTX conversion endpoint
+- **xlsx-to-markdown.js**: Route handler for XLSX conversion endpoint
+- **html-to-markdown.js**: Route handler for HTML conversion endpoint (existing)
+
+### Validation
+- **pdfFileContent**: Validates PDF file upload requests
+- **docxFileContent**: Validates DOCX file upload requests
+- **docFileContent**: Validates DOC file upload requests
+- **epubFileContent**: Validates EPUB file upload requests
+- **textFileContent**: Validates text file upload requests
+- **pptxFileContent**: Validates PPTX file upload requests
+- **xlsxFileContent**: Validates XLSX file upload requests
+
+## Usage Examples
+
+### Converting a PDF file
+
+```javascript
+const pdfFile = // ... get file as base64
+const response = await fetch('/api/1/pdf-to-markdown', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+    'Authorization': 'Bearer YOUR_TOKEN'
+  },
+  body: JSON.stringify({
+    file: pdfFile,
+    filename: 'document.pdf',
+    store: false
+  })
+});
+
+const markdownContent = await response.text();
+```
+
+### Converting a DOCX file
+
+```javascript
+const docxFile = // ... get file as base64
+const response = await fetch('/api/1/docx-to-markdown', {
+  method: 'POST',
+  headers: {
+    'Content-Type': 'application/json',
+    'Authorization': 'Bearer YOUR_TOKEN'
+  },
+  body: JSON.stringify({
+    file: docxFile,
+    filename: 'document.docx',
+    store: true // Store in Supabase
+  })
+});
+
+const markdownContent = await response.text();
+```
+
+## Testing
+
+A test/example script is available at [`examples/test-document-to-markdown-endpoints.js`](examples/test-document-to-markdown-endpoints.js) that demonstrates how to use both endpoints.
+
+To run the example:
+```bash
+node examples/test-document-to-markdown-endpoints.js
+```
+
+Note: You'll need to update the script with valid authentication tokens and actual file content for testing.
+
+## Dependencies
+
+- **pandoc**: Used for file conversion (must be installed on the server)
+- **fs**: File system operations for temporary file handling
+- **uuid**: Generating unique temporary file names
+- **os**: Operating system utilities for temporary directory access
+
+## Format Support Limitations
+
+### PowerPoint (PPTX) Support
+- Pandoc can extract text content from PPTX files but layout information is lost
+- Slide structure may be preserved as headings in the markdown output
+- Images, animations, and complex formatting are not converted
+
+### Excel (XLSX) Support
+- **Limited Support**: Pandoc's XLSX support is experimental and may not work reliably
+- Simple spreadsheets with text content may convert successfully
+- Complex formulas, charts, and formatting are not supported
+- Consider converting Excel files to CSV format for better results
+
+### Recommended Alternatives
+- For Excel files: Convert to CSV format first for more reliable conversion
+- For PowerPoint files: Export as PDF first if layout preservation is important
+
+## Error Handling
+
+The endpoints handle various error scenarios:
+- Invalid file format
+- Missing required parameters
+- Pandoc conversion failures
+- File system errors
+- Storage service errors (when store=true)
+
+All errors are logged with appropriate context and return meaningful error messages to the client.
+
+## Security Considerations
+
+- All endpoints require authentication
+- Temporary files are automatically cleaned up after processing
+- File content is validated before processing
+- Error messages don't expose sensitive system information
@@ -0,0 +1,176 @@
+# Document-to-Markdown Examples
+
+This directory contains example scripts demonstrating how to use each of the document-to-markdown conversion endpoints.
+
+## Available Examples
+
+### Individual Format Examples
+
+1. **[`pdf-to-markdown-example.js`](pdf-to-markdown-example.js)** - Convert PDF files to Markdown
+2. **[`docx-to-markdown-example.js`](docx-to-markdown-example.js)** - Convert DOCX files to Markdown
+3. **[`doc-to-markdown-example.js`](doc-to-markdown-example.js)** - Convert legacy DOC files to Markdown
+4. **[`epub-to-markdown-example.js`](epub-to-markdown-example.js)** - Convert EPUB e-books to Markdown
+5. **[`text-to-markdown-example.js`](text-to-markdown-example.js)** - Convert plain text files to Markdown
+6. **[`pptx-to-markdown-example.js`](pptx-to-markdown-example.js)** - Convert PowerPoint presentations to Markdown ⚠️
+7. **[`xlsx-to-markdown-example.js`](xlsx-to-markdown-example.js)** - Convert Excel spreadsheets to Markdown ⚠️
+
+### Comprehensive Example
+
+- **[`test-document-to-markdown-endpoints.js`](test-document-to-markdown-endpoints.js)** - Test all endpoints in one script
+
+## Quick Start
+
+### Prerequisites
+
+1. **Authentication Token**: Replace `YOUR_AUTH_TOKEN` in any example with a valid bearer token
+2. **Server Running**: Ensure the convert2doc server is running (usually on port 3000)
+3. **Sample Files**: Place sample files in the examples directory or update the file paths in the scripts
+
+### Running an Example
+
+```bash
+# Run a specific format example
+node examples/pdf-to-markdown-example.js
+
+# Run the comprehensive test
+node examples/test-document-to-markdown-endpoints.js
+```
+
+### Using as Modules
+
+All examples export their main conversion functions and can be imported:
+
+```javascript
+import { convertPdfToMarkdown } from './examples/pdf-to-markdown-example.js';
+
+const markdownContent = await convertPdfToMarkdown(pdfBase64, 'document.pdf');
+```
+
+## Format Support & Limitations
+
+### ✅ Fully Supported Formats
+
+- **PDF** - Reliable text extraction, layout may be lost
+- **DOCX** - Excellent support, formatting preserved
+- **DOC** - Good support for legacy Word documents
+- **EPUB** - Great for e-books, chapter structure preserved
+- **Text** - Perfect conversion, paragraph structure maintained
+
+### ⚠️ Limited Support Formats
+
+- **PPTX** - Text-only extraction, no layout/images
+- **XLSX** - Experimental, often fails, use CSV instead
+
+## Example File Structure
+
+Each individual example follows this pattern:
+
+```javascript
+// Sample base64 content
+const sampleFileBase64 = '...';
+
+// Main conversion function
+async function convertFileToMarkdown(fileBase64, filename, store = false) {
+  // API call implementation
+}
+
+// File reading example
+async function exampleWithFile() {
+  // Read file and convert
+}
+
+// Main runner
+async function runExample() {
+  // Setup instructions and execution
+}
+
+// Export for module use
+export { convertFileToMarkdown };
+```
+
+## Configuration
+
+### API Base URL
+
+Update the `API_BASE_URL` constant in each example:
+
+```javascript
+const API_BASE_URL = 'http://localhost:3000'; // Change as needed
+```
+
+### Authentication
+
+Replace the placeholder token in each example:
+
+```javascript
+'Authorization': 'Bearer YOUR_ACTUAL_TOKEN_HERE'
+```
+
+### File Paths
+
+Update file paths to point to your actual documents:
+
+```javascript
+const filePath = './your-actual-document.pdf';
+```
+
+## Error Handling
+
+All examples include comprehensive error handling:
+
+- Network errors
+- Authentication failures
+- Conversion failures
+- File system errors
+
+## Output
+
+Each example will:
+
+1. Display conversion progress
+2. Save the converted Markdown to a file
+3. Show a preview of the converted content
+4. Provide helpful error messages if conversion fails
+
+## Tips for Success
+
+### For PowerPoint (PPTX)
+- Only text content will be extracted
+- Consider converting to PDF first for layout preservation
+- Slide structure may become headings
+
+### For Excel (XLSX)
+- Conversion often fails - this is expected
+- Convert to CSV format first for better results
+- Only simple spreadsheets may work
+
+### For All Formats
+- Ensure files are not corrupted
+- Check file permissions
+- Verify the server is running and accessible
+- Use valid authentication tokens
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Authentication Error**: Check your bearer token
+2. **Server Connection**: Verify the API_BASE_URL and server status
+3. **File Not Found**: Check file paths and permissions
+4. **Conversion Failed**: Some formats have limitations (see above)
+
+### Getting Help
+
+- Check the main documentation: [`README-document-to-markdown-endpoints.md`](../README-document-to-markdown-endpoints.md)
+- Review server logs for detailed error messages
+- Ensure pandoc is installed on the server
+
+## Contributing
+
+When adding new examples:
+
+1. Follow the established pattern
+2. Include comprehensive error handling
+3. Add format-specific limitations and tips
+4. Update this README with the new example
+5. Export the main conversion function for module use