Singer Encodings

A Python library for reading various file formats (CSV, JSON, JSONL, Avro, Parquet, and Excel) commonly used in Singer taps and targets.

Installation

pip install singer-encodings

Supported Formats

CSV
JSON
JSONL (JSON Lines)
Avro
Parquet
Excel (.xlsx, .xls)

Usage

Excel Files

The Excel module supports reading .xlsx and .xls files with advanced features including hyperlink preservation, comment extraction, and automatic date/time normalization.

Basic Usage

from singer_encodings.excel_reader import get_excel_row_iterator

# Open and read an Excel file
with open('data.xlsx', 'rb') as excel_file:
    row_iterator = get_excel_row_iterator(excel_file)

    if row_iterator:
        for sheet_name, row_dict in row_iterator:
            print(f"Sheet: {sheet_name}")
            print(f"Row: {row_dict}")

Read Specific Sheet

from singer_encodings.excel_reader import get_excel_row_iterator

with open('data.xlsx', 'rb') as excel_file:
    options = {'sheet_name': 'Sales Data'}
    row_iterator = get_excel_row_iterator(excel_file, options=options)

    if row_iterator:
        for sheet_name, row_dict in row_iterator:
            print(f"Row from {sheet_name}: {row_dict}")

Validate Required Headers

from singer_encodings.excel_reader import get_excel_row_iterator

with open('data.xlsx', 'rb') as excel_file:
    options = {
        'key_properties': ['id', 'email'],  # Required columns
        'date_overrides': ['created_at', 'updated_at']  # Expected date columns
    }

    try:
        row_iterator = get_excel_row_iterator(excel_file, options=options)
        if row_iterator:
            for sheet_name, row_dict in row_iterator:
                print(row_dict)
    except Exception as e:
        print(f"Validation error: {e}")

Filter by Catalog Headers

from singer_encodings.excel_reader import get_excel_row_iterator

# Only include specific columns; others go to _sdc_extra
headers_in_catalog = ['id', 'name', 'email', 'created_at']

with open('data.xlsx', 'rb') as excel_file:
    row_iterator = get_excel_row_iterator(
        excel_file,
        headers_in_catalog=headers_in_catalog
    )

    if row_iterator:
        for sheet_name, row_dict in row_iterator:
            # Columns not in catalog are stored in _sdc_extra
            if '_sdc_extra' in row_dict:
                print(f"Extra data: {row_dict['_sdc_extra']}")

Excel Features

Hyperlink Preservation

Cells with hyperlinks are represented as lists with structured data:

# Cell with hyperlink returns:
{
    "website": [{"text": "Visit Site", "url": "https://example.com"}]
}

Comment Extraction

Cell comments (both legacy and threaded) are preserved:

# Cell with comment returns:
{
    "notes": [{"text": "Important note", "comment": {"text": "Review this", "author": "John Doe"}}]
}

# Cell with hyperlink and comment:
{
    "link": [{"text": "Click here", "url": "https://example.com", "comment": {"text": "Updated link"}}]
}

Date/Time Normalization

All date and datetime values are automatically normalized to ISO-8601 format:

# Datetime values: 'YYYY-MM-DDTHH:MM:SSZ'
# Date values: 'YYYY-MM-DDT00:00:00Z'
# Time values: 'HH:MM:SS'

# Example output:
{
    "created_at": "2024-01-15T14:30:00Z",  # From Excel datetime
    "birth_date": "1990-05-20T00:00:00Z",  # From Excel date
    "start_time": "09:30:00"                # From Excel time
}

Duplicate Headers

Duplicate headers are automatically detected and stored in _sdc_extra:

# Excel with headers: id, name, email, name
# Returns:
{
    "id": "123",
    "name": "John",
    "email": "john@example.com",
    "_sdc_extra": [{"name": "Additional Name"}]
}

CSV Files

from singer_encodings.csv import get_row_iterator

csv_data = [b"name,email,age", b"John,john@example.com,30"]
row_iterator = get_row_iterator(csv_data)

for row in row_iterator:
    print(row)
# Output: {'name': 'John', 'email': 'john@example.com', 'age': '30'}

JSONL Files

from singer_encodings.jsonl import get_row_iterator

with open('data.jsonl', 'r') as jsonl_file:
    row_iterator = get_row_iterator(jsonl_file)

    for row in row_iterator:
        print(row)

Parquet Files

from singer_encodings.parquet import get_row_iterator

with open('data.parquet', 'rb') as parquet_file:
    row_iterator = get_row_iterator(parquet_file)

    for row in row_iterator:
        print(row)

Development

Running Tests

pytest tests/

Running Specific Test

pytest tests/test_excel.py

License

See LICENSE file for details.

Contributing

Contributions are welcome! Please ensure:

All tests pass
New features include tests
Update CHANGELOG.md with your changes

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.circleci		.circleci
.github		.github
singer_encodings		singer_encodings
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Singer Encodings

Installation

Supported Formats

Usage

Excel Files

Basic Usage

Read Specific Sheet

Validate Required Headers

Filter by Catalog Headers

Excel Features

Hyperlink Preservation

Comment Extraction

Date/Time Normalization

Duplicate Headers

CSV Files

JSONL Files

Parquet Files

Development

Running Tests

Running Specific Test

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 18

Uh oh!

Languages

License

singer-io/singer-encodings

Folders and files

Latest commit

History

Repository files navigation

Singer Encodings

Installation

Supported Formats

Usage

Excel Files

Basic Usage

Read Specific Sheet

Validate Required Headers

Filter by Catalog Headers

Excel Features

Hyperlink Preservation

Comment Extraction

Date/Time Normalization

Duplicate Headers

CSV Files

JSONL Files

Parquet Files

Development

Running Tests

Running Specific Test

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 18

Uh oh!

Languages

Packages