Runnable utilities for a multi-view extension of Active Learning for Graph Embedding (AGE; Cai, Zheng, and Chang, 2017) using Multi-view Graph Convolutional Networks with Attention Mechanism (MAGCN; Yao, Liang, Liang, Li, and Cao, 2022) and Multiplex PageRank (Halu, Mondragon, Panzarasa, and Bianconi, 2013). See Core References.
Use this method when you want strong classification performance on large datasets while keeping human labeling cost low. MV_AGE prioritizes the next most useful unlabeled item by combining model uncertainty, MAGCN embedding-space density, and Multiplex PageRank centrality across ingredient and nutrient views, so each label can improve the classifier efficiently.
MV_AGE uses the AGE active-learning scoring method, adapted for multi-view data. The centrality term comes from Multiplex PageRank over the ingredient and nutrient graph views, and the density term comes from the network embeddings after MAGCN fuses those views.
Download or clone this repository before using MV_AGE. Run the commands below from the repository root so Python can find the local mv_age/ source folder.
The prepared project is meant to be ready for food-item labeling. During labeling, each terminal round shows the next food item name and its ingredient list, then prompts for the correct class label. With the standard food_names.csv file, the food_name values are shown alongside the ingredient text so the item is easier to identify.
The workflow is a simple two-step process:
- put input CSV files in the repository's
input_data/folder - prepare a project, then start labeling
The labeling loop queries one item at a time, which keeps the exact single-sample active-learning behavior described in the AGE paper.
Figure: AGE active-labeling framework from Cai, Zheng, and Chang (2017), Active Learning for Graph Embedding.
Figure: MAGCN multi-view attention architecture from Yao, Liang, Liang, Li, and Cao (2022), Multi-view graph convolutional networks with attention mechanism.
- Python 3.10 or 3.11
- MATLAB available from the command line
- the Python dependencies from
requirements.txt
Python 3.12 is not currently advertised because the TensorFlow 2.15 runtime used by the exact MAGCN backend does not provide compatible wheels across the supported target platforms. On Windows, the runtime depends on tensorflow-intel; on Linux and macOS it depends on tensorflow.
The default MATLAB command is matlab.
If needed, set a different command with:
set MATLAB_CMD=matlabor pass --matlab-cmd to the prepare command.
--matlab-cmd names the MATLAB executable or cluster wrapper command. Use it when the default matlab command is not the right command for your environment.
Users provide four CSV files:
-
ingredients.csvRequired columns:item_idingredient_list
-
nutrients.csvRequired columns:item_id- one or more numeric nutrient columns
-
initial_labels.csvRequired columns:item_idlabel
-
food_names.csvRequired columns:item_idfood_name
Notes:
initial_labels.csvmay contain only the initially labeled subset, or all rows with blanks for unlabeled items.- zero-label starts are not supported; the exact AGE/MAGCN workflow needs starting labels.
- the starting labels should include at least one example for every class MV_AGE should use.
food_names.csvis required for project preparation and is used for display during labeling. The model still encodes the ingredient text.- optionally, provide a
full_labels.csvwithitem_idandtrue_labelto report prediction accuracy during evaluation or demo runs - the easiest workflow is to put those files in the included
input_data/folder using the standard filenames above, then runprepare --input-dir input_data - the default sentence encoder is
all-MiniLM-L6-v2 - the default sentence-encoder device is
auto; it usescudawhen PyTorch can see a GPU, otherwisecpu - the code creates the ingredient embeddings and graph files during
prepare - later labels must come from the same class set that appears in the starting labels
From the repository root:
pip install -r requirements.txtGPU setup is for the sentence-embedding step that encodes ingredient text. On Linux clusters, PyTorch and TensorFlow may need extra environment-specific setup. To verify both frameworks after install:
python - <<'PY'
import torch, tensorflow as tf
print("torch cuda available:", torch.cuda.is_available())
print("tf gpus:", tf.config.list_physical_devices("GPU"))
PYTo require GPU for sentence embeddings, add --device cuda to the prepare command. That fails fast with a clear error if PyTorch cannot access CUDA. To force CPU, pass --device cpu.
--device controls only the sentence-transformers encoder used to embed ingredient text. The default is auto, which uses cuda when PyTorch can see a GPU and otherwise uses cpu.
If torch cuda available is False or import torch fails with a CUDA or NCCL error, reinstall a matching PyTorch build for the machine using the official commands from:
https://docs.pytorch.org/get-started/previous-versions/
tensorflow 2.15.x also requires numpy < 2.0, and the requirements files pin that automatically. If the environment already has numpy 2.x, reinstall the dependencies so pip can bring numpy back into a TensorFlow-compatible range.
Add your CSV files to the included input_data/ folder:
input_data/
ingredients.csv
nutrients.csv
initial_labels.csv
food_names.csv
full_labels.csv # optionalThose files must follow the required columns listed above.
python -m mv_age prepare --input-dir input_data --project my_projectArguments:
--input-dir input_datatells MV_AGE to read the standard CSV filenames from theinput_data/folder.--project my_projecttells MV_AGE where to save the prepared project files.
This step:
- encodes ingredient lists with
all-MiniLM-L6-v2 - prepares the nutrient feature matrix
- builds the two exact kNN graph views
- runs MATLAB Multiplex PageRank
- saves the project folder
python -m mv_age label --project my_project --rounds 3Arguments:
--project my_projecttells MV_AGE which prepared project folder to open.--rounds 3runs three labeling rounds in this terminal session. Each round asks for one label.
Each labeling round follows the multi-view AGE workflow:
- one training epoch
- one queried item
- user enters one label
- repeat
Each round prints the allowed label options before prompting for input.
For a compact 48-row demo dataset to test the CLI before using project data, use the same prepare command shape:
python -m mv_age prepare --input-dir demo --project example_project
python -m mv_age label --project example_project --rounds 3Arguments:
--input-dir demouses the bundled demo CSV files when thedemofolder is missing or empty.--project example_projectsaves the demo project inexample_project, then opens that same project for labeling.--rounds 3runs three demo labeling rounds.
If demo is missing or empty, prepare falls back to the bundled demo csv files. If the folder contains ingredients.csv, nutrients.csv, initial_labels.csv, and food_names.csv, those are used instead. Partial folders fail with an error so the code does not silently mix user data with demo data. The same missing-folder demo fallback also works with demo_inputs.
The bundled demo keeps 16 starting labels, four per class, so the workflow has all classes available at initialization.
By default, the labeling screen shows:
- example predictions
- current best class guesses for each queried item
- when
full_labels.csvwas included duringprepare, current prediction accuracy on the available truth labels
For a cleaner human-labeling screen without guesses, use:
python -m mv_age label --project my_project --rounds 3 --hide-guessesArguments:
--hide-guesseshides example predictions and class-probability guesses during labeling. The food item name and ingredient list are still shown.--project my_projectand--rounds 3have the same meanings as in the labeling command above.
The bundled four-class sample example uses these branded food category groups:
cat1Pepperoni, Salami & Cold CutsSausages, Hotdogs & Bratscat2Canned SoupOther Soupscat3Cookies & BiscuitsCrackers & Biscotticat4Cakes, Cupcakes, Snack CakesCroissants, Sweet Rolls, Muffins & Other Pastries
MV_AGE scores unlabeled food items with:
- uncertainty from class-probability entropy
- density from the post-fusion MAGCN embedding space
- centrality from Multiplex PageRank over the ingredient and nutrient graph views
The package defaults are:
- Graph construction:
graph_k_view = 30 - MAGCN hidden size:
hidden1 = 32 - MAGCN learning rate:
learning_rate = 0.01 - MAGCN weight decay:
weight_decay = 5e-4 - MAGCN dropout:
dropout = 0.5 - AGE time-sensitive schedule:
age_basef = 0.995 - Multiplex PageRank alpha:
mpr_alpha = 0.85 - Multiplex PageRank beta:
mpr_beta = 1.0 - Multiplex PageRank gamma:
mpr_gamma = 1.0
- Cai, H., Zheng, V. W., and Chang, K. C.-C. (2017). Active Learning for Graph Embedding. arXiv preprint arXiv:1705.05085. https://arxiv.org/abs/1705.05085
- Yao, K., Liang, J., Liang, J., Li, M., and Cao, F. (2022). Multi-view graph convolutional networks with attention mechanism. Artificial Intelligence, 307, 103708. https://doi.org/10.1016/j.artint.2022.103708
- Halu, A., Mondragon, R. J., Panzarasa, P., and Bianconi, G. (2013). Multiplex PageRank. PLOS ONE, 8(10), e78293. https://doi.org/10.1371/journal.pone.0078293
From a clean checkout, the lightweight unit suite can be run without MATLAB, TensorFlow, or sentence-transformer model downloads:
pip install numpy pandas scipy scikit-learn
python -m unittest discover -s tests -vThese checks are also captured in .github/workflows/tests.yml for the standalone repository.
This repository includes vendored research code for the exact backend. The repository license note uses GPL-3.0-or-later because the bundled Multiplex PageRank MATLAB implementation is GPL-3.0-or-later. See THIRD_PARTY_NOTICES.md before publishing or redistributing the code.
Each labeling round shows:
- current label counts
- labeled percentage
- label options that are allowed for this project
- current model accuracy, if the project was prepared with truth labels
- a few example predictions, unless
--hide-guessesis used - the next item to label
- the food item name
- the ingredient list text
- current class-probability guesses, unless
--hide-guessesis used
The project folder stores:
foods.csv- merged working food table with current labels and optional truth labelsingredient_embeddings.npy- sentence-embedding matrix for the ingredient textnutrient_features.npy- numeric nutrient feature matrixx_dense.npy- combined ingredient and nutrient features used by MAGCNingredient_graph.npz- sparse kNN graph built from ingredient embeddingsnutrient_graph.npz- sparse kNN graph built from nutrient featurescentrality_values.npy- Multiplex PageRank centrality score for each itemmetadata.json- project configuration, column names, class names, and AGE settingsstate.json- labeling progress, checkpoint round, and random-number generator statequery_history.csv- log of queried items, entered labels, and AGE scoring valuespredictions.csv- most recent prediction and query-score tablecheckpoints/- TensorFlow checkpoint files used to resume the model state

