Multi-Content-Type Support

Last Updated: 2025-11-30

Harmony proxy supports automatic parsing of multiple content types beyond JSON, enabling seamless data transformation across different formats in healthcare and data integration pipelines.

Overview

The HTTP adapter automatically detects and parses incoming request bodies based on the Content-Type header, converting them to a normalized JSON structure for pipeline processing. This enables middleware and backends to work with a consistent data model regardless of the original format.

Supported Content Types

1. JSON (Default)

Content-Type Headers:

application/json
application/fhir+json (FHIR resources)
application/dicom+json (DICOM JSON)

Example Request:

curl -X POST http://localhost:8080/api/data \
  -H "Content-Type: application/json" \
  -d '{"name": "Alice", "age": 30}'

Normalized Structure:

{
  "name": "Alice",
  "age": 30
}

Notes:

Default format when Content-Type is missing or unrecognized
Direct pass-through to normalized_data
Validates JSON syntax

2. XML

Content-Type Headers:

application/xml
text/xml
application/soap+xml

Example Request:

curl -X POST http://localhost:8080/api/data \
  -H "Content-Type: application/xml" \
  -d '<person><name>Bob</name><age>25</age></person>'

Normalized Structure:

{
  "person": {
    "name": "Bob",
    "age": "25"
  }
}

XML Features:

Text-only elements: Converted to simple string values
Attributes: Prefixed with @ (e.g., "@id": "123")
Nested elements: Preserved as nested objects
Multiple elements with same name: Converted to arrays
Mixed content: Text stored in #text field when attributes present

Example with Attributes:

<person id="123" type="customer">
  <name>Charlie</name>
</person>

Becomes:

{
  "person": {
    "@id": "123",
    "@type": "customer",
    "name": "Charlie"
  }
}

Security: XXE (XML External Entity) attacks are prevented - quick-xml does not support external entities by default.

3. CSV

Content-Type Header:

text/csv

Example Request:

curl -X POST http://localhost:8080/api/data \
  -H "Content-Type: text/csv" \
  -d 'name,age,city
Alice,30,NYC
Bob,25,LA'

Normalized Structure:

{
  "rows": [
    {"name": "Alice", "age": "30", "city": "NYC"},
    {"name": "Bob", "age": "25", "city": "LA"}
  ],
  "row_count": 2
}

CSV Features:

First row treated as header
All values parsed as strings
Empty fields supported
Handles quoted fields with commas

Security: Formula injection prevention - fields starting with =, +, -, or @ are automatically prefixed with a single quote (') to prevent execution in spreadsheet applications.

Example:

name,formula
Alice,=SUM(A1:A10)

Becomes:

{
  "rows": [
    {"name": "Alice", "formula": "'=SUM(A1:A10)"}
  ]
}

4. Form URL-Encoded

Content-Type Header:

application/x-www-form-urlencoded

Example Request:

curl -X POST http://localhost:8080/api/data \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'name=Alice&age=30&city=NYC'

Normalized Structure:

{
  "name": "Alice",
  "age": "30",
  "city": "NYC"
}

Array Support: Use [] notation for arrays:

curl -X POST http://localhost:8080/api/data \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'name=Alice&interests[]=coding&interests[]=music'

Becomes:

{
  "name": "Alice",
  "interests": ["coding", "music"]
}

5. Multipart Form Data

Content-Type Header:

multipart/form-data; boundary=<boundary>

Example Request:

curl -X POST http://localhost:8080/api/upload \
  -F "name=Alice" \
  -F "age=30" \
  -F "file=@document.pdf"

Normalized Structure:

{
  "fields": {
    "name": "Alice",
    "age": "30"
  },
  "files": [
    {
      "name": "file",
      "filename": "document.pdf",
      "content_type": "application/pdf",
      "size": 12345,
      "checksum": "a1b2c3d4..."
    }
  ]
}

File Handling:

Files are NOT saved to disk automatically
File metadata captured for pipeline processing
SHA256 checksum computed for integrity verification
Middleware/backends can access file data via envelope

6. Binary Content

Content-Type Headers:

image/* (JPEG, PNG, GIF, etc.)
video/*
audio/*
application/pdf
application/zip
application/octet-stream

Example Request:

curl -X POST http://localhost:8080/api/upload \
  -H "Content-Type: image/jpeg" \
  --data-binary @photo.jpg

Normalized Structure:

{
  "format": "binary",
  "content_type": "image/jpeg",
  "size": 45678,
  "checksum": "abc123..."
}

Notes:

Binary data preserved in original_data field of envelope
Checksum allows integrity verification
Middleware can process binary data directly

Configuration

Content Limits

Configure size and complexity limits to prevent resource exhaustion:

[proxy.content_limits]
max_body_size = 10485760      # 10MB maximum request body
max_csv_rows = 10000           # Maximum CSV rows to parse
max_xml_depth = 100            # Maximum XML nesting depth
max_multipart_files = 10       # Maximum files in multipart upload
max_form_fields = 1000         # Maximum form fields

Defaults:

max_body_size: 10MB (10,485,760 bytes)
max_csv_rows: 10,000 rows
max_xml_depth: 100 levels
max_multipart_files: 10 files
max_form_fields: 1,000 fields

Example Configuration

[proxy]
id = "content-aware-proxy"
store_dir = "./data"

[proxy.content_limits]
max_body_size = 20971520  # 20MB for larger uploads
max_csv_rows = 50000      # Support larger CSV files
max_xml_depth = 50        # Limit XML complexity

[logging]
log_level = "info"

[network.default]
enable_wireguard = false
interface = "wg0"

[network.default.http]
bind_address = "0.0.0.0"
bind_port = 8080

[pipelines.api]
description = "Multi-format API pipeline"
networks = ["default"]
endpoints = ["api_endpoint"]
backends = ["processing_backend"]
middleware = []

[endpoints.api_endpoint]
service = "http"
[endpoints.api_endpoint.options]
path_prefix = "/api"

[backends.processing_backend]
service = "http"
[backends.processing_backend.options]
base_url = "http://backend-service:8080"

[services.http]
module = ""

Content Metadata

Each request includes content metadata in the envelope for tracking parsing status and format details:

pub struct ContentMetadata {
    pub content_type: String,      // Original Content-Type header
    pub charset: Option<String>,    // Character encoding if specified
    pub format: String,             // Detected format: json, xml, csv, etc.
    pub parse_status: ParseStatus,  // Success, Failed, NotAttempted, Unsupported
    pub original_size: usize,       // Size of original payload in bytes
    pub checksum: Option<String>,   // SHA256 checksum (for binary content)
}

Parse Status Values:

Success: Content parsed successfully
Failed: Parsing attempted but failed (malformed data)
NotAttempted: No parsing attempted (empty payload)
Unsupported: Content-Type not supported

Accessing in Middleware:

async fn process(envelope: RequestEnvelope<Vec<u8>>) -> Result<RequestEnvelope<Vec<u8>>, Error> {
    if let Some(metadata) = &envelope.request_details.content_metadata {
        tracing::info!(
            "Processing {} content ({}), parse_status: {:?}",
            metadata.format,
            metadata.content_type,
            metadata.parse_status
        );
    }
    Ok(envelope)
}

Security Considerations

XXE Prevention (XML)

Threat: XML External Entity (XXE) attacks allow attackers to read local files or perform SSRF attacks.

Mitigation: The quick-xml parser does not support external entities by default. External entity declarations are ignored.

Example Attack (Blocked):

<!DOCTYPE foo [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>

This will parse as if the entity doesn't exist, preventing file disclosure.

Formula Injection Prevention (CSV)

Threat: CSV formula injection occurs when spreadsheet applications execute formulas in CSV cells.

Mitigation: Fields starting with =, +, -, or @ are automatically prefixed with a single quote.

Before:

name,command
Alice,=cmd|'/c calc'!A1

After Parsing:

{"name": "Alice", "command": "'=cmd|'/c calc'!A1"}

XML Bomb Prevention

Threat: Billion Laughs attack (XML bomb) causes exponential entity expansion.

Mitigation:

Maximum XML depth limit (default: 100)
No entity expansion support
Maximum body size limit (default: 10MB)

Multipart Security

Threats:

Path traversal via malicious filenames
Resource exhaustion via many small files
Memory exhaustion via large files

Mitigations:

Filename sanitization (automatic by multer)
Maximum file count limit (default: 10)
Maximum body size limit (default: 10MB)
Files not automatically written to disk

Size Limits

All content types respect the max_body_size limit. Additional per-format limits:

CSV: Row count limit prevents memory exhaustion
XML: Depth limit prevents stack overflow
Multipart: File count limit prevents descriptor exhaustion
Form: Field count limit prevents hash collision attacks

Error Handling

Malformed Content

When content parsing fails:

parse_status set to Failed
Warning logged with error details
normalized_data set to None
Pipeline continues with original_data available
Middleware can check parse_status and handle accordingly

Example log:

WARN harmony: Failed to parse XML: XML parsing error: unexpected EOF

Unsupported Content-Type

When Content-Type is unknown:

Attempts to parse as JSON (fallback behavior)
If JSON parsing fails, parse_status set to Unsupported
Request continues through pipeline
Backend receives raw data in original_data

Size Limit Exceeded

When limits are exceeded:

Parsing terminates immediately
Error returned to client (400 Bad Request)
Descriptive error message includes limit value

Example error:

{
  "error": "CSV row count exceeds limit of 10000"
}

Fallback Behavior

Missing Content-Type: Defaults to application/json

Unknown Content-Type:

Attempts JSON parsing
If JSON parsing fails, marks as Unsupported
Pipeline continues with raw data

Empty Payload:

parse_status set to NotAttempted
normalized_data set to None
No parsing attempted

Best Practices

1. Always Set Content-Type

Explicitly set the Content-Type header to ensure correct parsing:

# Good
curl -H "Content-Type: text/csv" -d @data.csv http://...

# Avoid (will try JSON parsing)
curl -d @data.csv http://...

2. Validate in Middleware

Check parse status in middleware before processing:

if envelope.request_details.content_metadata
    .as_ref()
    .map_or(false, |m| m.parse_status != ParseStatus::Success) 
{
    return Err(Error::from("Content parsing failed"));
}

3. Configure Appropriate Limits

Set limits based on your use case:

Small API endpoints: Lower limits (1MB, 100 rows)
File upload services: Higher limits (100MB, more files)
Untrusted inputs: Conservative limits
Internal services: Relaxed limits

4. Monitor Parse Failures

Track parsing failures in logs and metrics:

if let Some(metadata) = &envelope.request_details.content_metadata {
    if metadata.parse_status != ParseStatus::Success {
        metrics::counter!("parse_failures", "format" => &metadata.format).increment(1);
    }
}

5. Use Binary Content for Large Files

For large files, use binary content types rather than base64-encoded JSON:

# Efficient
curl -H "Content-Type: application/pdf" --data-binary @large.pdf http://...

# Inefficient (33% overhead)
curl -H "Content-Type: application/json" -d '{"file":"<base64>"}' http://...

Troubleshooting

Issue: CSV not parsing correctly

Symptoms: CSV data appears in original_data but not in normalized_data

Solutions:

Verify Content-Type: text/csv header is set
Check CSV has valid header row
Ensure CSV is properly formatted (no unquoted commas in fields)
Check row count doesn't exceed max_csv_rows limit

Issue: XML parsing fails

Symptoms: parse_status: Failed for XML content

Solutions:

Validate XML syntax with external tool
Check for unsupported features (external entities, DTDs)
Verify XML depth doesn't exceed max_xml_depth limit
Ensure proper UTF-8 encoding

Issue: Multipart files not detected

Symptoms: Files array is empty in normalized_data

Solutions:

Verify Content-Type: multipart/form-data with boundary parameter
Ensure boundary in header matches boundary in body
Check that fields have filename attribute for file detection
Verify file count doesn't exceed max_multipart_files limit

Issue: Request rejected with "exceeds limit"

Symptoms: 400 Bad Request with size limit error

Solutions:

Increase relevant limit in [proxy.content_limits] configuration
Split large requests into smaller chunks
Use streaming endpoints for very large files
Compress data before upload

Performance Considerations

Parsing Overhead

Approximate parsing overhead by content type:

JSON: ~10-20μs for small payloads (<1KB)
XML: ~50-100μs (includes structure conversion)
CSV: ~100μs per 100 rows
Form URL-encoded: ~20-30μs
Multipart: ~500μs per file (includes checksum)
Binary: ~1ms per MB (checksum calculation)

Memory Usage

Memory overhead during parsing:

JSON: ~1x payload size (serde_json)
XML: ~2-3x payload size (DOM structure)
CSV: ~2x payload size (row objects)
Multipart: ~1.5x payload size (field buffers)

Optimization Tips

Use JSON when possible: Fastest parsing, lowest overhead
Stream large files: Don't parse entire body if processing in chunks
Disable checksums: For trusted sources, skip checksum calculation
Tune limits: Set limits to actual use case requirements
Monitor metrics: Track parsing times and adjust limits

Examples

See examples/content-types/ directory for:

Sample configuration files
Example requests for each content type
JOLT transforms for format conversion
Integration test examples

FilesExpand file tree

content-types.md

Latest commit

History

content-types.md

File metadata and controls

Multi-Content-Type Support

Overview

Supported Content Types

1. JSON (Default)

2. XML

3. CSV

4. Form URL-Encoded

5. Multipart Form Data

6. Binary Content

Configuration

Content Limits

Example Configuration

Content Metadata

Security Considerations

XXE Prevention (XML)

Formula Injection Prevention (CSV)

XML Bomb Prevention

Multipart Security

Size Limits

Error Handling

Malformed Content

Unsupported Content-Type

Size Limit Exceeded

Fallback Behavior

Best Practices

1. Always Set Content-Type

2. Validate in Middleware

3. Configure Appropriate Limits

4. Monitor Parse Failures

5. Use Binary Content for Large Files

Troubleshooting

Issue: CSV not parsing correctly

Issue: XML parsing fails

Issue: Multipart files not detected

Issue: Request rejected with "exceeds limit"

Performance Considerations

Parsing Overhead

Memory Usage

Optimization Tips

Examples

Related Documentation