Skip to content

[Feature request] docs on chunked reading for large databases? #3702

@nick-youngblut

Description

@nick-youngblut

Is your feature request related to a problem? Please describe.

For large tiledb-soma databases hosted on GCP or AWS, read of the obs dataframe can be quite slow (minutes) and require a substantial amount of memory.

Describe the solution you'd like

I've implemented the following chunked-read approach:

from typing import Optional, List
import pandas as pd
import tiledbsoma
import tiledb
import pyarrow as pa

def get_obs(
        db_uri: str,
        columns: Optional[List[str]] = None,
        obs_query: Optional[tiledbsoma.AxisQuery] = None,
        axis: str = "RNA",
        group_by: Optional[List[str]] = None,
        agg_name: str = "count_all"
    ) -> pd.DataFrame:
    """
    Retrieve the obs table from a tiledbsoma Experiment.
    
    If group_by is provided, aggregated counts by the specified grouping columns are returned.
    Otherwise, the full obs subset with the specified columns is returned.
    
    Parameters:
      db_uri: URI for the database.
      columns: List of column names to retrieve (required if group_by is None).
      obs_query: Optional AxisQuery to filter the obs data.
      axis: The axis to query (default "RNA").
      group_by: Optional list of columns to group by for aggregation.
      agg_name: Name for the aggregated count column (default "count_all").
    
    Returns:
      A pandas DataFrame containing either the subset of obs data or the aggregated counts.
    """
    if group_by is not None:
        # When grouping, we only need the group_by columns.
        group_cols = group_by
        chunk_results: List[pd.DataFrame] = []
        with tiledbsoma.Experiment.open(db_uri) as exp:
            # Use axis_query if provided.
            if obs_query is not None:
                try:
                    reader = exp.axis_query(axis, obs_query=obs_query).obs(column_names=group_cols)
                except TypeError:
                    reader = exp.axis_query(axis, obs_query=obs_query).obs()
                chunk_iter = reader
            else:
                chunk_iter = exp.obs.read(column_names=group_cols)
            # Process each chunk.
            for chunk in chunk_iter:
                if chunk.num_rows == 0:
                    continue
                df_chunk = chunk.to_pandas()
                # Group and count within this chunk.
                df_group = df_chunk.groupby(group_cols, as_index=False).size().rename(columns={"size": agg_name})
                chunk_results.append(df_group)
        if not chunk_results:
            return pd.DataFrame(columns=group_cols + [agg_name])
        # Combine chunk results and re-aggregate to get overall counts.
        df_all = pd.concat(chunk_results, ignore_index=True)
        df_final = df_all.groupby(group_cols, as_index=False)[agg_name].sum()
        return df_final.sort_values(by=agg_name, ascending=False)
    else:
        # Without grouping, ensure columns is provided.
        if columns is None:
            raise ValueError("When group_by is not specified, the 'columns' parameter must be provided.")
        chunks: List[pd.DataFrame] = []
        with tiledbsoma.Experiment.open(db_uri) as exp:
            if obs_query is not None:
                try:
                    reader = exp.axis_query(axis, obs_query=obs_query).obs(column_names=columns)
                except TypeError:
                    reader = exp.axis_query(axis, obs_query=obs_query).obs()
                chunk_iter = reader
            else:
                chunk_iter = exp.obs.read(column_names=columns)
            for chunk in chunk_iter:
                if chunk.num_rows == 0:
                    continue
                df_chunk = chunk.to_pandas()[columns]
                chunks.append(df_chunk)
        if not chunks:
            return pd.DataFrame(columns=columns)
        return pd.concat(chunks, ignore_index=True)

Which seems to substantially reduce db query times and memory. Maybe something similar can be implemented in the package? Maybe it is, and I've just missed that code?

At least in the docs, it might be helpful to point out alternatives to using .concat as in:

obs = experiment.obs
table = obs.read().concat()
table

...for large databases.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions