Skip to content

Use the same column data types for all engines in benchmarks #101

Open
@MrPowers

Description

@MrPowers

Here's a snippet from the Polars groupby benchmarks:

pl.read_csv(src_grp, schema_overrides={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32

Looks like id4, id5, id6 and v1 are using Int32 columns.

Other engines, like Spark, are just inferring the column types:

x = spark.read.csv(src_grp, header=True, inferSchema='true')

I think we should either have all the engines infer the column data types or all the engines specify the column data types for a better comparison. It's not apples:apples if some engines are using int32 and others are using int64.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions