Skip to content

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented May 10, 2019

Fixes #34

Updates ADAM dependency version to 0.27.0. Note the workaround in maven-shade-plugin configuration to prevent runtime conflicts with parquet and avro versions.

@heuermh
Copy link
Member Author

heuermh commented May 10, 2019

@mlinderm Would adding Scala 2.12 support be useful? Spark 2.4.3 supports Scala 2.12 but getting things running on the binary distribution is a nightmare.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/deca-prb/32/
Test PASSed.

@heuermh heuermh changed the title Update ADAM dependency version to 0.27.0-SNAPSHOT Update ADAM dependency version to 0.27.0 May 23, 2019
@heuermh heuermh marked this pull request as ready for review May 23, 2019 15:45
@heuermh heuermh requested review from fnothaft and mlinderm May 23, 2019 15:45
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/deca-prb/33/
Test PASSed.

@mlinderm
Copy link
Collaborator

@heuermh I am observing a substantial performance degradation with the new version. On a 16 core workstation, the time for calling CNVs in 2535 samples went from 9m17s with the old version, to 12m22s for the new version (with ADAM 0.27 and Spark 2.4.3). The difference seems to be in PCA/SVD step. Do you have any guesses as to what might have changed between the old spark/ADAM version the current version?

@heuermh
Copy link
Member Author

heuermh commented May 27, 2019

The difference seems to be in PCA/SVD step. Do you have any guesses as to what might have changed between the old spark/ADAM version the current version?

There are a lot of code changes between 0.24.0 and 0.27.0 in ADAM, but not much that would affect numerical method performance

bigdatagenomics/adam@maint_spark2_2.11-0.24.0...maint_spark2_2.11-0.27.0

I imagine differences between Spark 2.1.x and 2.4.3 might be more significant though. Is there a smaller benchmark we could use to reproduce what you are seeing, say with fewer samples?

@mlinderm
Copy link
Collaborator

There are datasets ranging from 500 samples to 2535 samples on the AMP BDG cluster at /user/mlinderman/deca/DATA.<samples>.RD.txt (if that is not easy to obtain, I can post input files for you). The former should only run for a few minutes or less. On the workstation the old version with Spark 2.1.0 took 1m34s for 500 samples, the new version took 2m17s. I can start working through different Spark versions to see if I observe a change.

@heuermh
Copy link
Member Author

heuermh commented May 27, 2019

There are datasets ranging from 500 samples to 2535 samples on the AMP BDG cluster

Great, thanks. I'll also take a look tomorrow.

@mlinderm
Copy link
Collaborator

I tried several spark distributions with the original code base (older ADAM). The performance degradation seems to occur between 2.2.3 and 2.3.3, that is 2.2.3 was 1m38s while 2.3.3 was 2m13s for 500 samples. One guess was that it was an issue with an upgrade to Breeze, but changing the Breeze dependency to 0.13.2 with Spark 2.4.3 did not improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update ADAM dependency version to 0.27.0

3 participants