Skip to content

Update pyarrow and microdata-tools#156

Merged
pawbu merged 1 commit intomainfrom
update-pyarrow
Mar 27, 2025
Merged

Update pyarrow and microdata-tools#156
pawbu merged 1 commit intomainfrom
update-pyarrow

Conversation

@pawbu
Copy link
Contributor

@pawbu pawbu commented Mar 26, 2025

One test for partitioned datasets read the .paquet file which returns also the partitioned column. Added a test for the usual case when just asking for "start_year=123" and not "start_year=123/04f4164ec1f247f2ad392fa9c03e71fe-0.parquet".

@pawbu pawbu requested a review from a team as a code owner March 26, 2025 13:13
@sonarqubecloud
Copy link

@DanielElisenberg
Copy link
Collaborator

DanielElisenberg commented Mar 26, 2025

so reading the whole directory in as a table doesn't include the column that was partitioned upon, but reading one of the partitions directly yields this column?🤔

EDIT:

  • Reading the whole dataset would yield the partitioned column status_date
  • Reading the partition folder has status_data
  • Reading the parquet in the partition folder does not have status_date

Would this be correct? And why do we want to test reading from a single partition, since the app never does? 👀 To document pyarrow behavior?

@pawbu
Copy link
Contributor Author

pawbu commented Mar 26, 2025

start_year is the partitioned column in this case, so:

  • Reading the whole dataset would yield the partitioned column start_year
  • Reading the partition folder does not yield start_year
  • Reading the (single) parquet file in the partition folder would yield start_year

And why do we want to test reading from a single partition, since the app never does?

I haven't checked the reason for the test in question. Can do that later after merging this PR since the microdata-tools is out, so we should not be runnning different version here in job-executor for too long 👍

Copy link
Collaborator

@DanielElisenberg DanielElisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense 👍🏻 Reasonable conclusion. Let's look at whether reading the separate folders and containing parquet files is reasonable in a new PR 💯

@pawbu pawbu merged commit 180c62e into main Mar 27, 2025
6 checks passed
@pawbu pawbu deleted the update-pyarrow branch March 27, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants