Open
Description
Problem Description
When using partition_html()
and extracting table metadata via chunk.metadata.text_as_html
, numeric values are being automatically converted to exponential notation.
Example
- Input Number: 478923
- Converted Output: 4.7e+05
Steps to Reproduce
- Use
partition_html()
on an HTML file - Chunking using chunk by title function and extracting tabular data
- Access
chunk.metadata.text_as_html
- Observe numeric value conversion
Expected Behavior
- Numeric values should be preserved in their original format
- No automatic scientific notation conversion
Environment Details
- Unstructured Library Version: 0.10.28
- Python Version: 3.11.0rc1
- Environment: databricks runtime 15.4 LTS ML
Potential Impact
This automatic conversion can cause data integrity issues, especially in financial or scientific data processing.
Suggested Investigation
- Review number parsing/serialization logic
- Check type conversion mechanisms in metadata handling