Skip to content

Number getting converted into scientific notation in metadata.text_as_html #3871

Open
@sahil0094

Description

@sahil0094

Problem Description

When using partition_html() and extracting table metadata via chunk.metadata.text_as_html, numeric values are being automatically converted to exponential notation.

Example

  • Input Number: 478923
  • Converted Output: 4.7e+05

Steps to Reproduce

  1. Use partition_html() on an HTML file
  2. Chunking using chunk by title function and extracting tabular data
  3. Access chunk.metadata.text_as_html
  4. Observe numeric value conversion

Expected Behavior

  • Numeric values should be preserved in their original format
  • No automatic scientific notation conversion

Environment Details

  • Unstructured Library Version: 0.10.28
  • Python Version: 3.11.0rc1
  • Environment: databricks runtime 15.4 LTS ML

Potential Impact

This automatic conversion can cause data integrity issues, especially in financial or scientific data processing.

Suggested Investigation

  • Review number parsing/serialization logic
  • Check type conversion mechanisms in metadata handling

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions