Skip to content

first commit with new neighborhood_connectivity function into _shap.py #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

LukasHats
Copy link

@LukasHats LukasHats commented Jan 6, 2025

As we discussed here @marcovarrone : So I did not yet fully understand how the functions are built in the package, but I thought it might make sense to put it into _shape.py.
The function can only be run after using gr.connected_components and takes the adata.obs components and calculates per image how many cells from a neighborhood are inside a connected component. The library_key (e.g. image_ID) is necessary as we have to calculate that per image. If users input a condition, it will plot the different conditions as hue, but that's not strictly necessary. Also users can set show=False to get the dataframe (here we can discuss if we want to return the figure object rather than the df), however the standard plotting function currently also gives the ax object and plots the graph. But happy to adjust all of that.

Its also currently set to violinplot, which makes sense if you have many images. But its probably a bit odd if users have only 1 or a few images

I don't know what else needs to be added so this will turn into a function like cc.pl.neighborhood_connectivity and if you want to implement a test for it.

@marcovarrone
Copy link
Collaborator

Thank you for the pull request @LukasHats!
The way it currently works is that for every shape metric, there is a function in cellcharter.tl, (e.g., cellcharter.tl.linarity, now renamed cellcharter.tl.linearity_metric) that computes the metric for every component and stores them as a key in a dictionary called shape_componentinsideadata.uns. The way it should be implemented is that you have a cellcharter.tl.nhood_connectivity_metricfunction that, similarly tolinearity_metric, curl_metric, etc..., computes the metric for every component and adds the relative key to the shape_component` dictionary.

After that, the user should use the cc.pl.shape_metrics function to plot the boxplots of the metric values. The function shape_metrics function was quite convoluted and developed specifically for the purpose of the paper. With the commits that I added to the pull requests I simplified it a bit, so now it should be more understandable.

In summary, what need to be done is:

  • By @LukasHats:

    • Transfer the metric computation from cc.pl.neighborhood_connectivity to cc.tl.nhood_connectivity_metric and store the result as in the other metric functions.
  • By me:

    • Add tests to the new version of theshape_metrics function and the nhood_connectivity_metric when complete
    • Update the CODEX tutorial

I hope everything is clear! If you have any doubts, feel free to let me know :)

@LukasHats
Copy link
Author

LukasHats commented Jan 13, 2025

Perfect, I will do it this week @marcovarrone . Thanks a lot for rewriting the whole shape metric part for it!

Also now I understand how the shape metric used to work! Really like the new approach!

@LukasHats
Copy link
Author

LukasHats commented Jan 16, 2025

@marcovarrone I am having trouble of integrating the nhood_connectivity score in the same way as the other metrics work. Lets take the example of purity:

adata.uns['shape_component']['purity']

{2476: 0.6571428571428571,
 2331: 0.5389221556886228,
 1366: 0.7424242424242424,

So for each connected component you have a quantification of the metric.

However, the neighborhood_connectivity idea is different from that. It's rather a measure of how many cells in an image are located inside a connected component or not. Or further extended: how many cells from a neighborhood are inside such a connected component or not, per image. So its rather a meta_score, not something for each component.

So I can not really deliver a metric per component, as its done for purity etc. The .uns would look different from how the .uns['shape_component'] works. I know this is a huge problem as the plotting function needs that format.
Maybe we could set the same score for every component, but would that be an idea? E.g. just as an example:

adata.uns['shape_component']['nhood_connectivity']

{2476: 0.3,
 2331: 0.3
 1366: 0.3,

Only if 2476, 2331, 1366 are in the same image of course. But I don't think this will give the results/score that I want toa achieve with the connectivity score. Otherwise we would maybe need to create a new uns and plotting function for the nhood_connectivity case

@marcovarrone
Copy link
Collaborator

That's a very good point @LukasHats, I didn't realize that!

I see two possible solutions:

  • We find a way to combine the purity and neighborhood connectivity metrics. In this way, even if the neighborhood connectivity for the components of the same domains is the same, its combination with purity will still lead to a value that is unique for every component. This can make sense since purity and neighborhood connectivity represent complementary views of the same thing. However, I don't know how to combine them in a sensible way rather than taking the mean between the two values.
  • We change the structure of adata.uns['shape_component'] and add a sort of key entity=domain or entity=component so that if it is a metric related to domains is plotted as a box plot, but if it's a metric related to components, it's showed as a single bar plot. It's not the most elegant solution and it may take some effort to implement the plotting part.

What do you think?

@LukasHats
Copy link
Author

Thanks for your suggestions @marcovarrone, I will play around with both ideas in my dataset to see what the output might look like. For the purity idea, one would have to consider a biologically meaningful combination of both metrics.

@LukasHats
Copy link
Author

To keep you posted @marcovarrone,

I realized that the approach I described above suffers from a general problem. If we use a specific threshold of min_cells, the output of my approach (i.e. how many cells of a specific neighborhood reside in a connected component or not) will highly depend on the abundance of a specific neighborhood in an image. If we anyway might have a huge difference in abundances of cells from different neighborhood, we cant use this approach to compare them.

Thats why I thought of a new approach together with a teammember (https://github.com/gesavoigt/):
We set min_cells=1 in cc.gr.connected_components. This way we will get all cells that reside at least in a minimal connected component. As an example:

adata[(adata.obs['image_ID'] == 'TS-373_IMC01_UB_001.csv') & (adata.obs['cellcharter_CN'] == 'bone_myeloid')].obs['component'].nunique()
105

Means we have 105 unique connected components in this specific image and neighborhood. Now we can get the total number of cells of that neighborhood by:

adata[(adata.obs['image_ID'] == 'TS-373_IMC01_UB_001.csv') & (adata.obs['cellcharter_CN'] == 'bone_myeloid')].obs['component'].count()
436

So theoretically on average, we have 4 cells per connected components. Which is of course wrong as there might be a lot of cells that are not connected at all. However, we can use the information in adata.obs['component'] to count the real number of cells that make up a unique connected component and plot the distribution of this. I will think of a good metric that represents this distribution and might give us a good representation.
But for now what we could at least report is the number of cells that make up a unique connected component (also if users use different min_cells). This way we have a unique value for each component, which is now 'Absolute cell number' that one can report. I will then think of a way on how to relate this to the actual neighborhoods

@LukasHats
Copy link
Author

LukasHats commented Feb 9, 2025

Okay so @marcovarrone,

I came up with an approach now. Lets go through an example:

First, we run classic cellcharter (still old API):

cc.gr.connected_components(adata, cluster_key='cellcharter_CN', min_cells=50)
cc.tl.boundaries(adata, min_hole_area_ratio=0.1)
cc.tl.purity(adata, library_key='image_ID')

My idea is now to add an absolute cell_count for each component (how many cells make up that component). As you said, we need 1 metric per component that is stored in the dictionary:

so this could be for example the function cc.tl.component_cell_count

count = adata.obs['component'].value_counts().to_dict()
adata.uns['shape_component']['count'] = count

We now get exactly a number per component:

adata.uns['shape_component']['count']

{2244: 9279,
 2139: 6197,
 2215: 5235,
 2135: 4877,
...

Now for plotting, we can come up with our own idea. We can first construct a dataframe, that holds all information we want for plotting/calculation. For example:

df = pd.DataFrame(adata.uns['shape_component']['count'].items(), columns=['component', 'count'])
df = pd.merge(df, adata.obs[['component', 'image_ID']].drop_duplicates().dropna(), on='component')
df = pd.merge(df, adata.obs[['component', 'cellcharter_CN']].drop_duplicates().dropna(), on='component')
df = pd.merge(df, adata.obs[['component', 'disease2']].drop_duplicates().dropna(), on='component')

counts = adata.obs.groupby(['image_ID', 'cellcharter_CN']).size().reset_index(name='total_neighborhood_cells_image')
df = df.merge(counts, on=['image_ID', 'cellcharter_CN'], how='left')

unique_counts = (
    adata.obs.groupby(["image_ID", "cellcharter_CN"])["component"]
    .nunique()
    .reset_index()
    .rename(columns={"component": "unique_components_neighborhood_image"})
)
df = df.merge(unique_counts, on=["image_ID", "cellcharter_CN"], how="left")

df

| component | count | image_ID                   | cellcharter_CN             | disease2 | total_neighborhood_cells_image | unique_components_neighborhood_image |
|-----------|-------|----------------------------|----------------------------|----------|--------------------------------|--------------------------------------|
| 2244      | 9279  | TS-373_IMC21_UB_001.csv    | focal_pc_oxphos            | MM_noBD  | 9317                           | 1                                    |
| 2139      | 6197  | TS-373_IMC71_B_002.csv     | focal_pc_oxphos            | MM_BD    | 6553                           | 4                                    |
| 2215      | 5235  | TS-373_IMC84_B_002.csv     | focal_pc_oxphos            | MM_BD    | 5443                           | 1                                    |
| 2135      | 4877  | TS-373_IMC50_B_002.csv     | focal_pc_oxphos            | MM_BD    | 5359                           | 3                                    |
| 2213      | 4870  | TS-373_IMC89_B_001.csv     | focal_pc_oxphos            | MM_BD    | 5054                           | 2                                    |
| ...       | ...   | ...                        | ...                        | ...      | ...                            | ...                                  |
| 1695      | 50    | TS-373_IMC77_B_002.csv     | bone_myeloid               | MM_BD    | 1004                           | 4                                    |
| 1694      | 50    | TS-373_IMC77_B_002.csv     | bone_myeloid               | MM_BD    | 1004                           | 4                                    |
| 266       | 50    | TS-373_IMC66_B_002.csv     | stroma_adipocyte           | MM_BD    | 831                            | 3                                    |
| 2709      | 50    | TS-373_IMC29_UB_001.csv    | proliferating_glycolytic   | MM_noBD  | 1508                           | 6                                    |
| 1052      | 50    | TS-373_IMC69_B_002.csv     | pc_myeloid                 | MM_BD    | 1711                           | 7                                    |


Now here we finally have a lot of information. We see for example that component 2244 has 9279 cells from the neighborhood called focal_pc_oxphos, and we have a total of 9317 cells of that neighborhood in that image. We can also see that there is only 1 unique connected component for this neighborhood in that image.

We now only need to find a good way of plotting this in a relative manner so we also address smaller neighborhoods (most likely we need to take into account the total_neighborhood_cells_image) and maybe also integrate the information about unique connected components.
What do you say to this approach? Any other suggestions?

@marcovarrone
Copy link
Collaborator

Hi @LukasHats, looks great!
The last remaining part now is to find a way to combine all this information preferably into a single metric.

We have to answer some questions about what the metric should capture before designing it:

  1. can you say in a sentence what the metric should represent?
  2. if it's related to how scattered are the cells in the neighborhood, imagine the extreme case in which there are multiple components in a slide and all the cells are inside components (there are no cells scattered around the sample). Then, should the metric be different if there is only one component or multiple ones in the sample?
  3. Related to question 2, is it going to be a metric for components or a metric for neighborhoods?
  4. it helps to reason for extremes. What would be the situation for which the metric is 0 and for which the metric is 1?

@LukasHats
Copy link
Author

LukasHats commented Mar 12, 2025

Hi @LukasHats, looks great! The last remaining part now is to find a way to combine all this information preferably into a single metric.

We have to answer some questions about what the metric should capture before designing it:

1. can you say in a sentence what the metric should represent?

2. if it's related to how scattered are the cells in the neighborhood, imagine the extreme case in which there are multiple components in a slide and all the cells are inside components (there are no cells scattered around the sample). Then, should the metric be different if there is only one component or multiple ones in the sample?

3. Related to question 2, is it going to be a metric for components or a metric for neighborhoods?

4. it helps to reason for extremes. What would be the situation for which the metric is 0 and for which the metric is 1?

@marcovarrone
So I finally came up with a metric that we can put in per component. I am using my above mentioned dataframe, generated with counting cell numbers of each component.

Proposed Metric: Normalized Component Contribution (NCC)

  1. NCC = (Component_Size) / (Total_Neighborhood_Cells of the associated neighborhood / Unique_Components of that neighborhood). The Normalized Component Contribution (NCC) metric compares a component’s cell count to the average component size in its cellular neighborhood, indicating whether it is larger or smaller than expected given the neighborhood’s total cells and component count.

  2. Yes the metric should be different, if there is only on component or multiple ones per sample, see the equation in 1)

  3. Its going to be a metric for components, that however depends also on the neighborhood labels, as its incorporating the average component size of the neighborhood the component comes from. Thats why we will need the neighborhood_key in the function I am implementing right now

  4. The Normalized Component Contribution (NCC) metric equals 1 when a component’s cell count matches the average size for its cellular neighborhood (i.e., count = total_neighborhood_cells_image / unique_components_neighborhood_image), indicating typical clustering. The metric would theoretically reach 0 only if a component contained zero cells, though this scenario is biologically impossible since components represent connected cell groups. In practice, NCC approaches but never actually reaches 0, with smaller values indicating components far below neighborhood averages.

So its a metric per component, I will now write the function to store it in the dictionary just like the purity function etc. The only difference is, that the user also needs to provide the cellcharter neighborhood label key! I am excited to implement it with you!

Edit(25.03.2025)
The good thing is, if users decide to plot different neighborhoods seperately, the average for each neighborhood will still differ! Thats what I wanted to achieve initially.

@LukasHats
Copy link
Author

@marcovarrone I now added the function, it should work, tried it out in a jupyter notebook on my anndata. I will implement the plotting function later today/tomorrow!

LukasHats and others added 2 commits March 25, 2025 12:44
…ould already work as its already implemented in the .tl module similar to the existing metrics
@LukasHats LukasHats marked this pull request as ready for review March 25, 2025 11:53
@marcovarrone
Copy link
Collaborator

Hi @LukasHats, thank you very much for the contribution!
Glad to see that things are finally shaping up :)

I am quite busy at the moment, but I should be able to review it in 1-2 weeks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants