you will need a java version @8 or @11 according to the compatibility matrix for pyspark. this is most easily done using sdkman. after installing - including adding the sourcing of sdkman to e.g. your .zshrc - you an simply
sdk install java 11.0.24-zulu
sdk use java 11.0.24-zuluthe Makefile has it all. mostly, you will need homebrew and pyenv on your machine. then, you run make setup. it may be that you have to restart your shell and add a section to the .bashrc or .zshrc to properly install pyenv.
you can run the sample dummy df via python -i main.py and interactively play in the session with the df object.
it took really long time to figure out this config, LEARNINGS:
- s3a has to be used to access files
- most configs have to be set in the instantiated spark context
- they have to have their
fs.s3X.implset - you need the hadoop in the correct version as pyspark
aws-java-sdk-bundleis also needed- or manually set the env variables:
accordingly
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY export AWS_DEFAULT_REGION=us-west-2
- IF using SSO:
ProfileCredentialsProviderdoes not work with sso atm, you need to:- use such a script default credentials to insert the above mentioned values into your
.aws/credentials - set the env variable accordingly
export AWS_PROFILE=default# pyspark_setup - full recipe when chaining roles:
- get credentials for source profile, add them to stanza in
~/.aws/credentials - get credentials for target profile, add them to stanza in
~/.aws/credentials - set AWS_PROFILE to target profile
- run script
- get credentials for source profile, add them to stanza in
- use such a script default credentials to insert the above mentioned values into your