Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor changes to Python fileio #24363

Closed
wants to merge 6 commits into from
Closed

Conversation

iht
Copy link
Contributor

@iht iht commented Nov 27, 2022

This PR fixes #24362 and adds a slight change to the documentation, to make it clear that the file_naming function also receives destination as a parameter (it is mentioned at the beginning of the documentation page, but in WriteToFiles that parameter description is missing).

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

iht added 2 commits November 27, 2022 13:30
The file naming function also takes the destination as an input variable.
When fileio writes several shards, the last shard will have as name
"00123-of-00124" which may look like incomplete processing.

This commit starts the first shard name with 1, and the last shard name with
N (for N shards).
@codecov
Copy link

codecov bot commented Nov 27, 2022

Codecov Report

Merging #24363 (67865bc) into master (37fb90c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #24363   +/-   ##
=======================================
  Coverage   73.35%   73.35%           
=======================================
  Files         718      718           
  Lines       97033    97033           
=======================================
+ Hits        71177    71180    +3     
+ Misses      24525    24522    -3     
  Partials     1331     1331           
Flag Coverage Δ
python 82.99% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/fileio.py 96.05% <100.00%> (ø)
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.08% <0.00%> (-0.17%) ⬇️
...che_beam/runners/interactive/interactive_runner.py 91.77% <0.00%> (+1.26%) ⬆️
.../python/apache_beam/testing/test_stream_service.py 92.85% <0.00%> (+4.76%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.
R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@johnjcasey
Copy link
Contributor

run python precommit

@johnjcasey
Copy link
Contributor

LGTM, just rerunning python

@Abacn
Copy link
Contributor

Abacn commented Nov 28, 2022

This makes more sense, obviously. Just have worried that if, ever, some user apps could assume the (00123-of-00124) naming pattern and we will break them mysteriously. It may deserve to be noted in "breaking change".

@johnjcasey
Copy link
Contributor

Agreed with Yi. Can you add a note to the changes file?

@Abacn
Copy link
Contributor

Abacn commented Nov 28, 2022

I also suspect same issue also exists elsewhere in the code base. e.g. here:

self.shard_name_format %

* <p>Eg: [prefix]-00000-of-00100[suffix] and [prefix]-00001-of-00100[suffix]

unfortunately finding and changing all these would be tedious...

@Abacn
Copy link
Contributor

Abacn commented Nov 29, 2022

Sorry, I mean the finding is great, however due to same issue elsewhere, partial fix would then make the indexing inconsistent throughout the code base.

@iht
Copy link
Contributor Author

iht commented Nov 30, 2022

Sorry, I mean the finding is great, however due to same issue elsewhere, partial fix would then make the indexing inconsistent throughout the code base.

Ah, ok. I have checked that this is used in fileio and in the dataframes module. I will update this PR with more details about the potential impact. So far, in my tests, this does not break anything anywhere else in the Python SDK.

Or do you mean that the behavior should be the same in the Java and Python SDKs? I can also add a similar change to the Java SDK, and check where is that used in the rest of the SDK to evaluate potential impact.

@Abacn
Copy link
Contributor

Abacn commented Nov 30, 2022

Or do you mean that the behavior should be the same in the Java and Python SDKs? I can also add a similar change to the Java SDK, and check where is that used in the rest of the SDK to evaluate potential impact.

Yes, ideally naming should be consistent throughout SDKs. Currently it is consistent (though weird). Would appreciate if you are willing to find the appearance in both the Java and Python SDKs.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2022

Reminder, please take a look at this pr: @AnandInguva @johnjcasey

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@AnandInguva
Copy link
Contributor

stop reviewer notifications

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: requested by reviewer

@zhenglin-charlie-li
Copy link
Contributor

Or do you mean that the behavior should be the same in the Java and Python SDKs? I can also add a similar change to the Java SDK, and check where is that used in the rest of the SDK to evaluate potential impact.

Yes, ideally naming should be consistent throughout SDKs. Currently it is consistent (though weird). Would appreciate if you are willing to find the appearance in both the Java and Python SDKs.

@Abacn

I'd like to keep working on and change the Go and Java SDK. What should I do with Git? Should I create a new branch from master? Seams that I can not create a branch based on this branch(iht:minor_python_doc)

@Abacn
Copy link
Contributor

Abacn commented Jan 7, 2023

thanks for the interests. I would expect many places to change and imo limited benefit compared to the risk.

@iht
Copy link
Contributor Author

iht commented Jan 11, 2023

thanks for the interests. I would expect many places to change and imo limited benefit compared to the risk.

I am going to close this PR now, and I will be sending a new one with consistent changes between the Python and Java SDKs. After the holidays, this branch is outdated, and I prefer just to start over than resolving the conflcits.

@iht iht closed this Jan 11, 2023
@iht iht deleted the minor_python_doc branch April 20, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileIO shard numbering would be more user-friendly if it were 1-based instead of 0-based
5 participants