first commit with new neighborhood_connectivity function into _shap.py #66

LukasHats · 2025-01-06T13:08:53Z

As we discussed here @marcovarrone : So I did not yet fully understand how the functions are built in the package, but I thought it might make sense to put it into _shape.py.
The function can only be run after using gr.connected_components and takes the adata.obs components and calculates per image how many cells from a neighborhood are inside a connected component. The library_key (e.g. image_ID) is necessary as we have to calculate that per image. If users input a condition, it will plot the different conditions as hue, but that's not strictly necessary. Also users can set show=False to get the dataframe (here we can discuss if we want to return the figure object rather than the df), however the standard plotting function currently also gives the ax object and plots the graph. But happy to adjust all of that.

Its also currently set to violinplot, which makes sense if you have many images. But its probably a bit odd if users have only 1 or a few images

I don't know what else needs to be added so this will turn into a function like cc.pl.neighborhood_connectivity and if you want to implement a test for it.

for more information, see https://pre-commit.ci

… into nhood_connectivity

for more information, see https://pre-commit.ci

marcovarrone · 2025-01-12T21:07:01Z

Thank you for the pull request @LukasHats!
The way it currently works is that for every shape metric, there is a function in cellcharter.tl, (e.g., cellcharter.tl.linarity, now renamed cellcharter.tl.linearity_metric) that computes the metric for every component and stores them as a key in a dictionary called shape_componentinsideadata.uns. The way it should be implemented is that you have a cellcharter.tl.nhood_connectivity_metricfunction that, similarly tolinearity_metric, curl_metric, etc..., computes the metric for every component and adds the relative key to the shape_component` dictionary.

After that, the user should use the cc.pl.shape_metrics function to plot the boxplots of the metric values. The function shape_metrics function was quite convoluted and developed specifically for the purpose of the paper. With the commits that I added to the pull requests I simplified it a bit, so now it should be more understandable.

In summary, what need to be done is:

By @LukasHats:
- Transfer the metric computation from cc.pl.neighborhood_connectivity to cc.tl.nhood_connectivity_metric and store the result as in the other metric functions.
By me:
- Add tests to the new version of theshape_metrics function and the nhood_connectivity_metric when complete
- Update the CODEX tutorial

I hope everything is clear! If you have any doubts, feel free to let me know :)

LukasHats · 2025-01-13T08:39:50Z

Perfect, I will do it this week @marcovarrone . Thanks a lot for rewriting the whole shape metric part for it!

Also now I understand how the shape metric used to work! Really like the new approach!

LukasHats · 2025-01-16T10:35:57Z

@marcovarrone I am having trouble of integrating the nhood_connectivity score in the same way as the other metrics work. Lets take the example of purity:

adata.uns['shape_component']['purity']

{2476: 0.6571428571428571,
 2331: 0.5389221556886228,
 1366: 0.7424242424242424,

So for each connected component you have a quantification of the metric.

However, the neighborhood_connectivity idea is different from that. It's rather a measure of how many cells in an image are located inside a connected component or not. Or further extended: how many cells from a neighborhood are inside such a connected component or not, per image. So its rather a meta_score, not something for each component.

So I can not really deliver a metric per component, as its done for purity etc. The .uns would look different from how the .uns['shape_component'] works. I know this is a huge problem as the plotting function needs that format.
Maybe we could set the same score for every component, but would that be an idea? E.g. just as an example:

adata.uns['shape_component']['nhood_connectivity']

{2476: 0.3,
 2331: 0.3
 1366: 0.3,

Only if 2476, 2331, 1366 are in the same image of course. But I don't think this will give the results/score that I want toa achieve with the connectivity score. Otherwise we would maybe need to create a new uns and plotting function for the nhood_connectivity case

marcovarrone · 2025-01-21T15:33:27Z

That's a very good point @LukasHats, I didn't realize that!

I see two possible solutions:

We find a way to combine the purity and neighborhood connectivity metrics. In this way, even if the neighborhood connectivity for the components of the same domains is the same, its combination with purity will still lead to a value that is unique for every component. This can make sense since purity and neighborhood connectivity represent complementary views of the same thing. However, I don't know how to combine them in a sensible way rather than taking the mean between the two values.
We change the structure of adata.uns['shape_component'] and add a sort of key entity=domain or entity=component so that if it is a metric related to domains is plotted as a box plot, but if it's a metric related to components, it's showed as a single bar plot. It's not the most elegant solution and it may take some effort to implement the plotting part.

What do you think?

LukasHats · 2025-01-27T12:45:18Z

Thanks for your suggestions @marcovarrone, I will play around with both ideas in my dataset to see what the output might look like. For the purity idea, one would have to consider a biologically meaningful combination of both metrics.

LukasHats · 2025-02-03T15:04:40Z

To keep you posted @marcovarrone,

I realized that the approach I described above suffers from a general problem. If we use a specific threshold of min_cells, the output of my approach (i.e. how many cells of a specific neighborhood reside in a connected component or not) will highly depend on the abundance of a specific neighborhood in an image. If we anyway might have a huge difference in abundances of cells from different neighborhood, we cant use this approach to compare them.

Thats why I thought of a new approach together with a teammember (https://github.com/gesavoigt/):
We set min_cells=1 in cc.gr.connected_components. This way we will get all cells that reside at least in a minimal connected component. As an example:

adata[(adata.obs['image_ID'] == 'TS-373_IMC01_UB_001.csv') & (adata.obs['cellcharter_CN'] == 'bone_myeloid')].obs['component'].nunique()
105

Means we have 105 unique connected components in this specific image and neighborhood. Now we can get the total number of cells of that neighborhood by:

adata[(adata.obs['image_ID'] == 'TS-373_IMC01_UB_001.csv') & (adata.obs['cellcharter_CN'] == 'bone_myeloid')].obs['component'].count()
436

So theoretically on average, we have 4 cells per connected components. Which is of course wrong as there might be a lot of cells that are not connected at all. However, we can use the information in adata.obs['component'] to count the real number of cells that make up a unique connected component and plot the distribution of this. I will think of a good metric that represents this distribution and might give us a good representation.
But for now what we could at least report is the number of cells that make up a unique connected component (also if users use different min_cells). This way we have a unique value for each component, which is now 'Absolute cell number' that one can report. I will then think of a way on how to relate this to the actual neighborhoods

LukasHats · 2025-02-09T14:12:04Z

Okay so @marcovarrone,

I came up with an approach now. Lets go through an example:

First, we run classic cellcharter (still old API):

cc.gr.connected_components(adata, cluster_key='cellcharter_CN', min_cells=50)
cc.tl.boundaries(adata, min_hole_area_ratio=0.1)
cc.tl.purity(adata, library_key='image_ID')

My idea is now to add an absolute cell_count for each component (how many cells make up that component). As you said, we need 1 metric per component that is stored in the dictionary:

so this could be for example the function cc.tl.component_cell_count

count = adata.obs['component'].value_counts().to_dict()
adata.uns['shape_component']['count'] = count

We now get exactly a number per component:

adata.uns['shape_component']['count']

{2244: 9279,
 2139: 6197,
 2215: 5235,
 2135: 4877,
...

Now for plotting, we can come up with our own idea. We can first construct a dataframe, that holds all information we want for plotting/calculation. For example:

df = pd.DataFrame(adata.uns['shape_component']['count'].items(), columns=['component', 'count'])
df = pd.merge(df, adata.obs[['component', 'image_ID']].drop_duplicates().dropna(), on='component')
df = pd.merge(df, adata.obs[['component', 'cellcharter_CN']].drop_duplicates().dropna(), on='component')
df = pd.merge(df, adata.obs[['component', 'disease2']].drop_duplicates().dropna(), on='component')

counts = adata.obs.groupby(['image_ID', 'cellcharter_CN']).size().reset_index(name='total_neighborhood_cells_image')
df = df.merge(counts, on=['image_ID', 'cellcharter_CN'], how='left')

unique_counts = (
    adata.obs.groupby(["image_ID", "cellcharter_CN"])["component"]
    .nunique()
    .reset_index()
    .rename(columns={"component": "unique_components_neighborhood_image"})
)
df = df.merge(unique_counts, on=["image_ID", "cellcharter_CN"], how="left")

df

| component | count | image_ID                   | cellcharter_CN             | disease2 | total_neighborhood_cells_image | unique_components_neighborhood_image |
|-----------|-------|----------------------------|----------------------------|----------|--------------------------------|--------------------------------------|
| 2244      | 9279  | TS-373_IMC21_UB_001.csv    | focal_pc_oxphos            | MM_noBD  | 9317                           | 1                                    |
| 2139      | 6197  | TS-373_IMC71_B_002.csv     | focal_pc_oxphos            | MM_BD    | 6553                           | 4                                    |
| 2215      | 5235  | TS-373_IMC84_B_002.csv     | focal_pc_oxphos            | MM_BD    | 5443                           | 1                                    |
| 2135      | 4877  | TS-373_IMC50_B_002.csv     | focal_pc_oxphos            | MM_BD    | 5359                           | 3                                    |
| 2213      | 4870  | TS-373_IMC89_B_001.csv     | focal_pc_oxphos            | MM_BD    | 5054                           | 2                                    |
| ...       | ...   | ...                        | ...                        | ...      | ...                            | ...                                  |
| 1695      | 50    | TS-373_IMC77_B_002.csv     | bone_myeloid               | MM_BD    | 1004                           | 4                                    |
| 1694      | 50    | TS-373_IMC77_B_002.csv     | bone_myeloid               | MM_BD    | 1004                           | 4                                    |
| 266       | 50    | TS-373_IMC66_B_002.csv     | stroma_adipocyte           | MM_BD    | 831                            | 3                                    |
| 2709      | 50    | TS-373_IMC29_UB_001.csv    | proliferating_glycolytic   | MM_noBD  | 1508                           | 6                                    |
| 1052      | 50    | TS-373_IMC69_B_002.csv     | pc_myeloid                 | MM_BD    | 1711                           | 7                                    |

Now here we finally have a lot of information. We see for example that component 2244 has 9279 cells from the neighborhood called focal_pc_oxphos, and we have a total of 9317 cells of that neighborhood in that image. We can also see that there is only 1 unique connected component for this neighborhood in that image.

We now only need to find a good way of plotting this in a relative manner so we also address smaller neighborhoods (most likely we need to take into account the total_neighborhood_cells_image) and maybe also integrate the information about unique connected components.
What do you say to this approach? Any other suggestions?

marcovarrone · 2025-02-18T06:48:13Z

Hi @LukasHats, looks great!
The last remaining part now is to find a way to combine all this information preferably into a single metric.

We have to answer some questions about what the metric should capture before designing it:

can you say in a sentence what the metric should represent?
if it's related to how scattered are the cells in the neighborhood, imagine the extreme case in which there are multiple components in a slide and all the cells are inside components (there are no cells scattered around the sample). Then, should the metric be different if there is only one component or multiple ones in the sample?
Related to question 2, is it going to be a metric for components or a metric for neighborhoods?
it helps to reason for extremes. What would be the situation for which the metric is 0 and for which the metric is 1?

LukasHats · 2025-03-12T09:35:04Z

Hi @LukasHats, looks great! The last remaining part now is to find a way to combine all this information preferably into a single metric.

We have to answer some questions about what the metric should capture before designing it:

1. can you say in a sentence what the metric should represent?

2. if it's related to how scattered are the cells in the neighborhood, imagine the extreme case in which there are multiple components in a slide and all the cells are inside components (there are no cells scattered around the sample). Then, should the metric be different if there is only one component or multiple ones in the sample?

3. Related to question 2, is it going to be a metric for components or a metric for neighborhoods?

4. it helps to reason for extremes. What would be the situation for which the metric is 0 and for which the metric is 1?

@marcovarrone
So I finally came up with a metric that we can put in per component. I am using my above mentioned dataframe, generated with counting cell numbers of each component.

Proposed Metric: Normalized Component Contribution (NCC)

NCC = (Component_Size) / (Total_Neighborhood_Cells of the associated neighborhood / Unique_Components of that neighborhood). The Normalized Component Contribution (NCC) metric compares a component’s cell count to the average component size in its cellular neighborhood, indicating whether it is larger or smaller than expected given the neighborhood’s total cells and component count.
Yes the metric should be different, if there is only on component or multiple ones per sample, see the equation in 1)
Its going to be a metric for components, that however depends also on the neighborhood labels, as its incorporating the average component size of the neighborhood the component comes from. Thats why we will need the neighborhood_key in the function I am implementing right now
The Normalized Component Contribution (NCC) metric equals 1 when a component’s cell count matches the average size for its cellular neighborhood (i.e., count = total_neighborhood_cells_image / unique_components_neighborhood_image), indicating typical clustering. The metric would theoretically reach 0 only if a component contained zero cells, though this scenario is biologically impossible since components represent connected cell groups. In practice, NCC approaches but never actually reaches 0, with smaller values indicating components far below neighborhood averages.

So its a metric per component, I will now write the function to store it in the dictionary just like the purity function etc. The only difference is, that the user also needs to provide the cellcharter neighborhood label key! I am excited to implement it with you!

Edit(25.03.2025)
The good thing is, if users decide to plot different neighborhoods seperately, the average for each neighborhood will still differ! Thats what I wanted to achieve initially.

for more information, see https://pre-commit.ci

LukasHats · 2025-03-12T10:13:41Z

@marcovarrone I now added the function, it should work, tried it out in a jupyter notebook on my anndata. I will implement the plotting function later today/tomorrow!

…ould already work as its already implemented in the .tl module similar to the existing metrics

for more information, see https://pre-commit.ci

marcovarrone · 2025-04-08T12:25:44Z

Hi @LukasHats, thank you very much for the contribution!
Glad to see that things are finally shaping up :)

I am quite busy at the moment, but I should be able to review it in 1-2 weeks!

for more information, see https://pre-commit.ci

marcovarrone · 2025-04-28T14:58:00Z

Hi @LukasHats sorry for taking a while.
I think the NCC metric looks very good!

In the last commit, I just replaced the name of the saved metric from component_counts to ncc since the value saved is the normalized version.

Thank you very much for contributing to CellCharter :)

EDIT: I was thinking about a more explanatory name for the metric. What do you think about "Relative component size (RCS)", i.e., how big is this component relative to the expected size?
The term "contribution" feels a little bit too generic for me.

LukasHats · 2025-04-28T18:35:01Z

Hey @marcovarrone !

Thanks for your help! Yes, I agree, RCS sounds better! I would add: "how big is this component relative to its neighborhood-expected size? Although of course a component is by definition always from 1 specific neighborhood. Feel free to rename it. I am happy and excited to contribute! Thanks for your help and input :)

marcovarrone · 2025-04-29T15:44:41Z

@LukasHats good point on the "neighborhood", that's very important.
I was writing the test for the function and I created some toy examples, but one results doesn't match with what I would expect, but I think it's because I may be misinterpreting some things.

I created a dataset with cells divided into 66% neighborhood 0 and 33% neighborhood 1.

Neighborhood 1 is composed of one component (component 0)
Neighborhood 2 is split 50% into component 1 and 50% into component 2

These are the counts:

domain  component    count
0       0            46416
1       1            11604
1       2            11604

When I compute the relative size I get:

component 0: 1
component 1: 0.25
component 2: 0.25

While the value of component 0 is what I expect, I didn't expect that value for the others. Shouldn't it be 1 for both components 1 and 2? They all have the same size within the same neighborhood, so they should have a size equal to the expected.

LukasHats · 2025-04-29T19:41:40Z

Hey @marcovarrone ,

first of all, you are right, they should be 1. I am not quite sure how you implemented the test, but if this is your ground truth data:

domain  component    count
0       0            46416
1       1            11604
1       2            11604

You are lacking an essential column, which is needed for calculation, which is the unique_components_neighborhood_image in my function. Therefore the table should be:

domain  component    count       unique_components
0       0            46416       1
1       1            11604       2
1       2            11604       2

Technically, we also need the total_neighborhood_cell column, but we can infer this in this example by taking 2*11604.

Now if we apply the function (line 602 _shape.py):
df['ncc'] = df['count'] / df['total_neighborhood_cells_image'] / df['unique_components_neighborhood_image']

Therefore here:

count=11604
total_neighborhood_cells_image=2*11604
unique_components_neighborhood_image=2

11604/(2*11604)/2 = 1

But I don't have an idea how 0.25 should be a result. Did you use the implemented function and provided an example anndata?

EDIT: Oh I think I found the mistake. I forgot brackets inside the function. It should be:

df['ncc'] = df['count'] / (df['total_neighborhood_cells_image'] / df['unique_components_neighborhood_image'])

In our example:

11604/(2*11604)/2 = 0.25
11604/(2*11604/2) = 1

The brackets are important ... math ;)

I just pushed the revised version to the PR and will be excited if this now fits with your tests. Sorry that I missed the brackets

EDIT2: Sorry I change one function forth and back and had a small emerge conflict but resolved it

…n _shape.py, adapted the neighborhood cell calculation to only incorporate cells from connected components

for more information, see https://pre-commit.ci

…t Size (RCS)

marcovarrone · 2025-06-24T08:32:22Z

With an unexcusable delay I finally merged the pull request! Thank you very much @LukasHats for all the work :)

I noticed that your implementation computed the average by sample by default.
I decided to remove this by default, but the user can still select by setting library_key, this is because I believe that it's more probable that the user wants to compute the average size of a component across all samples and then check whether the size of the components in one sample (or condition) are different from another.

LukasHats · 2025-06-25T13:28:57Z

Hey @marcovarrone ,

no worries, I am happy that you put in that much work to make it happen. Also fine to no change it to non-default. Thanks for pushing it across the finish line! And in case of future errors/problems, please include me in the issue!

We are about to submit a paper where CellCharter is one of the cornerstones where I used a similar measure as the RCS. Will keep you updated!

LukasHats and others added 9 commits December 23, 2024 17:20

first commit with new neighborhood_connectivity function into _shap.py

b6cd81b

[pre-commit.ci] auto fixes from pre-commit.com hooks

65b266d

for more information, see https://pre-commit.ci

Add metric suffix to shape metrics functions

bd742b0

Simplify shape metrics plotting

8f3304a

[pre-commit.ci] auto fixes from pre-commit.com hooks

63f5171

for more information, see https://pre-commit.ci

Fix shape metric plotting

63111b2

Remove print

4a86a80

Merge branch 'nhood_connectivity' of github.com:LukasHats/cellcharter…

231b612

… into nhood_connectivity

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c27daf

for more information, see https://pre-commit.ci

LukasHats and others added 2 commits March 12, 2025 11:12

implementation of the NCC metric. TODO: Plotting function

5b84831

[pre-commit.ci] auto fixes from pre-commit.com hooks

faef8b4

for more information, see https://pre-commit.ci

LukasHats and others added 2 commits March 25, 2025 12:44

Added the new 'ncc-metric' to the plotting function info. Plotting sh…

b1bda0a

…ould already work as its already implemented in the .tl module similar to the existing metrics

[pre-commit.ci] auto fixes from pre-commit.com hooks

f29d6fa

for more information, see https://pre-commit.ci

LukasHats marked this pull request as ready for review March 25, 2025 11:53

marcovarrone and others added 2 commits April 28, 2025 16:54

Plot shape metrics using subplots

080473c

[pre-commit.ci] auto fixes from pre-commit.com hooks

50b1de1

for more information, see https://pre-commit.ci

LukasHats and others added 10 commits April 29, 2025 22:32

added brackets to the function of normalized_component_contribution i…

d715561

…n _shape.py, adapted the neighborhood cell calculation to only incorporate cells from connected components

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a4740f

for more information, see https://pre-commit.ci

reverting changes to total counting

b65ddba

solve merge conflict

fa4d22f

Rename Normalized Component Contribution (NCC) into Relative Componen…

4ce624f

…t Size (RCS)

Add RCS tests

517d6cc

Fix obs names warning in test dataset

988ff85

Make library_key optional by default

26f6ecc

Remove useless comments

19e0310

Merge branch 'main' into nhood_connectivity

a2db831

marcovarrone merged commit 3d54819 into CSOgroup:main Jun 24, 2025
1 check was pending

LukasHats deleted the nhood_connectivity branch June 25, 2025 17:13

first commit with new neighborhood_connectivity function into _shap.py #66

first commit with new neighborhood_connectivity function into _shap.py #66

Uh oh!

Conversation

LukasHats commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcovarrone commented Jan 12, 2025

Uh oh!

LukasHats commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasHats commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcovarrone commented Jan 21, 2025

Uh oh!

LukasHats commented Jan 27, 2025

Uh oh!

LukasHats commented Feb 3, 2025

Uh oh!

LukasHats commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcovarrone commented Feb 18, 2025

Uh oh!

LukasHats commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasHats commented Mar 12, 2025

Uh oh!

marcovarrone commented Apr 8, 2025

Uh oh!

marcovarrone commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasHats commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcovarrone commented Apr 29, 2025

Uh oh!

LukasHats commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

marcovarrone commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasHats commented Jun 25, 2025

Uh oh!

Uh oh!

LukasHats commented Jan 6, 2025 •

edited

Loading

LukasHats commented Jan 13, 2025 •

edited

Loading

LukasHats commented Jan 16, 2025 •

edited

Loading

LukasHats commented Feb 9, 2025 •

edited

Loading

LukasHats commented Mar 12, 2025 •

edited

Loading

marcovarrone commented Apr 28, 2025 •

edited

Loading

LukasHats commented Apr 28, 2025 •

edited

Loading

LukasHats commented Apr 29, 2025 •

edited

Loading

marcovarrone commented Jun 24, 2025 •

edited

Loading