Possible improvement for Spark Node2Vec

Hello Mercury-Graph team,

I took a lot of inspiration from your RandomWalks implementation while working on GraphFrames, so I wanted to share something back that might be useful.

One implementation limitation I ran into is in Spark MLlib’s built-in Word2Vec: `embedding_dim x vocab_size < Int.MaxValue`. For graph workloads, that becomes a real bottleneck surprisingly quickly. With the default `100`, the theoretical upper bound is only around 21 million vertices.

In practice, the issue shows up much earlier. Spark’s Word2Vec was built for NLP-style vocabularies, where the number of tokens is usually in the hundreds of thousands, not hundreds of millions or billions. It collects the full vocabulary on the driver, builds both a flat array of size `embedding_dim x vocab_size` and the corresponding tree structure, broadcasts them, and synchronizes them at every training iteration.

So even before hitting the JVM array limit, the memory and communication costs can already become prohibitive. For example, with a graph of 15 million vertices and the default embedding size of 100, the flat matrix alone is already over 10 GB, which is a serious serialization and communication burden.

At the same time, the random-walk part itself scales much further on Spark. Your approach -- and the one I used in GraphFrames -- can scale to billions of vertices with the right cluster sizing and tuning. In other words, for large graphs, Word2Vec becomes the limiting factor much sooner than walk generation.

o address that, I implemented an alternative seq2vec method in GraphFrames: [Hash2Vec](https://graphframes.io/api/scaladoc/org/graphframes/embeddings/Hash2Vec.html). It does not require a global vocabulary and runs in linear time in a single pass over the data.

Since Mercury-Graph already depends on GraphFrames, it seems like this could be added fairly easily as an alternative backend for Spark Node2Vec. There is no direct public Python API for Hash2Vec in GraphFrames, but it is part of the public JVM API, so it should still be straightforward to call through `py4J`.

It scales well to very large graphs on large clusters. One small practical note: the default `numPartitons` is really only suitable for toy examples -- for real workloads, I would recommend setting it closer to the number of workers.

In my experiments, Hash2Vec by itself gives reasonably good embeddings, although the quality was about 20% worse than Word2Vec:

<img width="388" height="240" alt="Image" src="https://github.com/user-attachments/assets/a316260a-0a53-4572-843d-28cd49e8a352" />

That said, if you add a convolution layer on top -- either your own implementation or the GraphFrames [built-in](https://graphframes.io/api/scaladoc/org/graphframes/convolutions/SamplingConvolution.html)) -- and concatenate the vertex embedding with the average neighbor embedding, the quality of Hash2Vec + convolution becomes roughly comparable to Word2Vec on the same subset of node-classification benchmarks.

I thought this might be useful for large-scale graph workloads, so I wanted to share it in case it helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible improvement for Spark Node2Vec #41

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Possible improvement for Spark Node2Vec #41

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions