Skip to content

Latest commit

 

History

History
74 lines (64 loc) · 3.37 KB

File metadata and controls

74 lines (64 loc) · 3.37 KB
name pygraphistry-ai
description PyGraphistry graph ML/AI: UMAP, DBSCAN, embeddings, and anomaly detection workflows. Use when asked to "run UMAP on my graph", "cluster nodes", "find anomalies in my network data", "embed nodes", "fit-transform pipeline", "semantic search over graph nodes", or "graph AI". Also triggers on "graphistry umap", "dbscan clusters", "node embeddings", "featurize", or "anomaly triage". Proactively suggest when the user has node feature columns and asks about outliers, clusters, or similarity without yet using UMAP or DBSCAN.

PyGraphistry AI

Doc routing (local + canonical)

  • First route with ../pygraphistry/references/pygraphistry-readthedocs-toc.md.
  • Use ../pygraphistry/references/pygraphistry-readthedocs-top-level.tsv for section-level shortcuts.
  • Only scan ../pygraphistry/references/pygraphistry-readthedocs-sitemap.xml when a needed page is missing.
  • Use one batched discovery read before deep-page reads; avoid cat * and serial micro-reads.
  • In user-facing answers, prefer canonical https://pygraphistry.readthedocs.io/en/latest/... links.

Typical workflow

  1. Build graph from nodes/edges.
  2. Run feature/embedding method (umap, embed, optional dbscan).
  3. Inspect derived columns/features and visualize.
  4. Iterate on feature columns and sampling strategy.

Baseline examples

# Similarity embedding / projection
g2 = graphistry.nodes(df, 'id').umap(X=['f1', 'f2', 'f3'])
g2.plot()
# Fit/transform flow for consistent projection on new batches
g_train = graphistry.nodes(df_train, 'id').umap(X=['f1', 'f2'])
g_batch = g_train.transform_umap(df_batch, return_graph=True)
g_batch.plot()
# Semantic search over embedded features
g2 = graphistry.nodes(df, 'id').umap(X=['text_col'])
results_df, query_vector = g2.search('suspicious login pattern')
# Text-first workflow: featurize then search/cluster
g2 = graphistry.nodes(df, 'id').featurize(kind='nodes', X=['title', 'body']).umap(kind='nodes').dbscan()
hits, qv = g2.search('credential stuffing campaign')
# Precomputed embedding columns
embedding_cols = [c for c in df.columns if c.startswith('emb_')]
g2 = graphistry.nodes(df, 'id').umap(X=embedding_cols)
g_new = g2.transform_umap(df_new, return_graph=True)

Practical guardrails

  • Start with small/representative samples before full runs.
  • Keep explicit feature lists (X=...) for reproducibility.
  • Track engine/dataframe type for CPU vs GPU behavior.
  • For anomaly workflows, document thresholds and false-positive assumptions.
  • For graph ML tasks, route deeper model workflows to RGCN/link-prediction references.
  • For text workflows, prefer featurize(...).umap(...).search(...) when queries are natural language.
  • If users already have embeddings, reuse them via explicit embedding column lists (X=[...]) before recomputing.
  • When user asks for a concise workflow snippet, prefer one short code block and avoid long narrative wrappers.

Canonical docs