Single-Agent Graph Embedding for Quantum Computers via Reinforcement Learning - Master Thesis @ Politecnico di Milano
pytorch > gymnasium > stablebaselines3
Full thesis at Thesis_Complete.pdf
Thesis summary at Thesis_Executive_Summary.pdf
Supercomputers alone are insufficient for tackling NP-hard problems, one of them is the Graph Minor Embedding problem, crucial for quantum computing with D-Wave systems. This work focuses on using Reinforcement Learning (RL) techniques to enhance heuristics like D-Wave's minorminer for embedding problems on a Quantum Processing Unit (QPU). The study explores PPO and DQN RL models to optimize embeddings, aiming to reduce the number of qubits needed. Performance comparisons are made against the state-of-the-art, with evaluations on datasets of 15, 30, and 50-node graphs.
-
Install conda packages from this requirements file > requirements.txt (stable-baselines3[extra] needs to be installed.
pip install stable-baselines3[extra]) -
Launch with
python rel-embedder.py [args]
- --train1: this parameter is used when we want train a new model, pass a string representing the name of the model that will be saved.
- --graph_set: when using this parameter when can pass a graphSet on which to train the model. Available sets and associated strings are "n30c20x10" (set of 10 graphs with 30 nodes and 20% of connectivity), "n30c20x20", "n50c20x10", "n50c20x20" with the same logic
- --ts: number of timesteps to train the model. Default is 100000
- --lr: with this parameter we can set the learning rate. Default is 0.0003
- --ent_coef: accepts a float that represents the entropy coefficient the model. Default is 0
- --gamma: sets the gamma parameter of the model. Default is 0.99
- --norm: if set to true, the heat values will be normalized to a float between 0 and 1. Otherwise the absolute values will be kept. Default is False
- --avg_heat: when set to true, the heat value of auxiliary nodes is included into the calculation of the average heat for the graph. Default is True
- --ep_perc: with this parameter we can set the percentage of heat to reach in order to end the episode. Default is 0.1, 10%
- --subrun: number of training runs to perform with the same settings. Default is 1
- --algo: accepts either strings "PPO" or "DQN" to set the algorithm for the training. Default is "PPO"
- --test: used when testing a model. The name of the model is passed to this parameter.
- --graph_set: when using this parameter when can pass a graph_set on which to test the model
- --ep: sets the total number of episodes per run
- --subrun: number of testing runs to perform on the same model
D-Wave’s Quantum Processing Unit (QPU) architecture is at the base of D-Wave’s Quantum Computer exploiting the principles of quantum annealing and is suited to solve optimization problems. The D-Wave QPU treated in this work, composed of superconducting qubits, is organized in what is known as a Chimera graph pattern, shown below.
The minorminer heuristic is designed to address the complex challenge of identifying a smaller graph H as a minor of a larger graph G. Given the NP-hard nature of this problem, traditional exact solutions become computationally infeasible for graphs of substantial size. Heuristic logic explained here https://arxiv.org/abs/1406.2741
The main goal of our agent is not to fully embed a graph onto the D-Wave’s Quantum Processing Unit (QPU), but rather to ease the embedding procedure for the minorminer heuristic. It is accomplished by passing in input to the heuristic a modified version of the original graph such that the nodes identified that require more qubits to be embedded onto the QPU will be preemptively expanded into structures know as chains.
Nodes’ heat gives an indicative measure of how many qubits are necessary to embed the current graph into a QPU. In this way we’re able to understand at each timestep and after each modification done by the agent, how well the graph is approaching a target average heat.
Local embeddings are smaller embeddings calculated on subsets of the input graph in order to provide the heat function described in 2.1 with the number of qubits needed to embed each node onto the QPU. A local embedding is calculated on the subset graph such that a node is part of the local embedding if it is the node in examination or a node from a 1-hop distance from the node in examination.
What the agent will be able to see is an array of heat values that are related to the nodes adjacent to the node select. The idea is that the agent will learn to select adjacent nodes that will lower the average heat of the graph. What happens when the agent selects a node will be explained in Subsection 2.4.
The agent will be able to take discrete actions, defined as a Discrete space in the Gym Custom Environment with the same length of 10 elements as the Observation Space described in 2.3. When an agent selects an action corresponding to an integer value ranging from 0 to 9 it means that it is selecting the index of the i-th node listed in the Observation Space.
Once we have a priority node selected as the node with the highest heat and a node selected by the agent which is adjacent to the priority node, we can proceed with the Node Expansion process. The arc shared by the priority node and the adjacent node is removed and an auxiliary node is added to the priority node. In this way initial chains are formed and the overall heat of the graph is lowered step by step.
The ideal reward would be a reward that at each timestep can show the progress of the process of lowering the heat of the graph and that is invariant to the graph that is analyzed by the agent.
For this reason the reward function has been defined as: log(prevAvgHeat/currAvgHeat)
Where prevAvgHeat represents the average heat of the graph at timestep - 1, while currAvgHeat represents the average heat of the graph at the current timestep.
As stated in 2.6 the reward function is expressed as a progress at each timestep of the difference of heat with respect to the previous timestep, it would be convenient to base the episode termination
condition on the same concept. In fact if we set this condition as a percentage of heat to reach we will obtain a total reward for each episode that is very similar between graphs having
different complexities. Given that we have a starting average heat defined as startAvgHeat, a current average heat as currAvgHeat and a percentage of progress of i.e. 10% passed as parameter
to the agent, the episode termination condition will be defined as: currAvgHeat ≤ startAvgHeat ∗ (1 − 0.1)
The concept proposed consists of having a single agent able only to view a neighborhood of nodes at each timestep. Hopping between neighborhoods where the heat higher, lowering it using the Node Expansion technique and processing a new neighborhood on the next timestep. In this way the agent will have a limited view of the graph, keeping the elements of the observation space contained while updating the parameters on the same agent improving it at each timestep. The following pseudocode shows the general behavior of the RL agent, identified in a big portion of the step function of the Custom Gym Environment:
- Episode percentage: this parameter, expressed as a percentage in terms of heat to reach, sets for the agent a limit that terminates the episode.
- Average heat: boolean parameter that if set to False it will exclude the heat of auxiliary nodes when calculating the average heat of the graph.
- Normalized heat: boolean parameter if set to true all the heat values are normalized, otherwise their absolute value is kept.
For ease of understanding let’s first consider a graph taken from the dataset of graphs composed by 15 nodes and on average 5 chains of 2 qubits. The RL agent will take in input the source graph show below and will start forming initial chains at each timestep.
Once the episode is concluded, the resulting graph will be passed to the minorminer heuristic. The output produced will be the following:
Nodes having the same coloring belong to the same chain. Labeled nodes with indices represent the starting node of each chain, while nodes without an index represent nodes added to the
chain. In this case the RL model added to the embedding of 15 chains numered from 0 to 14, the new chain 16 composed of one node.
Now we will proceed to assign the chain formed by the RL agent to the original nodes to which it should be assigned.
We notice how the node with index 16 has been reassigned to node 0, changing color and being part of the chain of node 0 now.
What the minorminer heuristic has produced without the help of the RL model, is the embedding below:
Assessing the result for this example, we have that the RL model+heuristic performed better then the heuristic alone, needing 20 qubits compared to the 23 qubits needing by minorminer.
Another example showing the same steps but with bigger graphs is shown below.
Source graph passed to the RL model:
Output graph from the RL model + heuristic:
Assigning chains formed by the RL model:
Output graph from the minorminer heuristic alone:
The cumulative reward of the the models just analyzed is shown in the figure below: PPO model trained on dataset n30c20, DQN model trained on dataset n30c20, PPO model trained on dataset n50c20.
We notice how the DQN on n30c20 model reaches a lower reward at convergence than PPO on n30c20, and how the PPO model trained on the n50c20 dataset apparently reaches a good reward in spite of the drop in performance. This would support the idea that the algorithm managing the generation of auxiliary nodes doesn’t scale well on large graphs by creating nodes that make the minorminor heuristic underperform and use more qubits than necessary even tough the agent performed well by obtaining a good reward and thus lowering the heat of the graph efficiently.
For an Experimental setup overview, more in depth quantitative and qualitative results and future improvements, please refer to chapter 4 of the Executive Summary or chapter 5 of the Complete Thesis linked at the beginning of this README.