💡 Feature Request: Add `to_spark()` to on dataset class

There's an interesting use case here where you could use PyAirbyte directly in data pipelines that run on spark. 

Currently if you want to do this, you need to do `to_pandas()` and then `spark_session.createDataFrame(issues_df, shema=my_schema)`, but this seems inefficient, plus you have to manually define the schema (for example for json blobs which are `object` in pandas but need to be `StringType` in spark, and other idiosyncrasies like pandas having 64 bit ints but spark having Int and Long).

Or maybe a spark df cache would be more efficient here?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💡 Feature Request: Add `to_spark()` to on dataset class #173

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

💡 Feature Request: Add to_spark() to on dataset class #173

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

💡 Feature Request: Add `to_spark()` to on dataset class #173