Does rank 0 node have to be the master hadoop node #335

Dariusms · 2022-01-26T17:54:59Z

Dariusms
Jan 26, 2022

If I have an existing Hadoop cluster and HDFS networked storage, can I point to it instead and have all nodes and cores for the job that are running hadoop point to the existing hadoop master node and be worker nodes of it ?

chu11 · 2022-01-26T21:10:11Z

chu11
Jan 26, 2022
Maintainer

I don't think that'll work given how Hadoop works. The Hadoop master is typically configured with a list of all the worker nodes its allowed to schedule jobs on.

Magpie would (hypothetically) have to dynamically update that Hadoop master's configuration whenever it spins up (and tears down) a new allocation on an HPC cluster. That is likely impossible [1] (ssh off cluster and have privileged access to re-configure Hadoop) and minimally not reasonable to do.

Edit: That said, any job you start on an HPC cluster should be able to access the HDFS networked storage, since that is likely independent of the Hadoop master / Yarn scheduler.

[1] - clarification, I should say impossible given the average HPC cluster environment. I can't speak for all environments.

3 replies

Dariusms Jan 27, 2022
Author

Thanks for that reply, so if I understand what you are saying, I can have my original Hadoop cluster pointing to a networked HDFS location, and my new Hadoop cluster point to the same HDFS location ? How would that work in terms of writes through HIVE?

Edit: To clarify, in essence I am asking if a user inserts/update/delete or adds/removes relationships to other tables via HIVE on say my original hadoop cluster, will the new one pick it up ? and vice versa. For example, to support logged reads, It could be as simple as updating a date accessed field on the table.

chu11 Jan 27, 2022
Maintainer

I'm assuming the Hive database setup is on the original Hadoop cluster and not on the HPC nodes? If so, then I think it would work. I'm not an expert on Hive (the support was added by @ASChilds so he might know better), but at the end of the day all reads/writes going through the Hive service should support multiple clients.

If you are thinking of setting up the Hive service on the HPC nodes, then the answer is probably no. Magpie would be setting up multiple Hive services, so the read/writes would go through multiple services and you probably wouldn't be getting correct consistency.

ASChilds Jan 27, 2022

So, having multiple instances of a Hadoop clusters...say Cluster A and Cluster B.
Cluster A has the HDFS / Hive / Metastore service running
Cluster B has say...HDFS / Spark-zeppelin and is using Cluster A as a Hive service
If that's the case I think as @chu11 mentioned that should function properly...it'll probably just take some custom tuning for configurations but I feel like it should work. I've been out of touch on Hadoop/Hive for a while though so I'm not 100% sure.
I do agree though if A and B are using Hive on HPC cluster 'C' it may introduce inconsistencies and/or deadlocking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does rank 0 node have to be the master hadoop node #335

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does rank 0 node have to be the master hadoop node #335

Uh oh!

Dariusms Jan 26, 2022

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

chu11 Jan 26, 2022 Maintainer

Uh oh!

Uh oh!

Dariusms Jan 27, 2022 Author

Uh oh!

chu11 Jan 27, 2022 Maintainer

Uh oh!

ASChilds Jan 27, 2022

Dariusms
Jan 26, 2022

Replies: 1 comment 3 replies

chu11
Jan 26, 2022
Maintainer

Dariusms Jan 27, 2022
Author

chu11 Jan 27, 2022
Maintainer