Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearest Genes Function #172

Open
wants to merge 16 commits into
base: dev
Choose a base branch
from
Open

Nearest Genes Function #172

wants to merge 16 commits into from

Conversation

nleroy917
Copy link
Member

Third times a charm? I think my branch/base locally got messed up, so I am just opening a new one.

Given a query and set of annotations, this function will calculate
the nearest annotation to each region in the region set, as well
as the nearest gene type and the distance to the nearest gene.

This is updated from the previous version which wasn't functioning
properly!

The only issue I am having is that the function requires that the
annotation file has the name of the gene defined as gene_id
and the type of gene as gene_biotype. It would be nice to
introduce a keyword parameter that defaults to these two,
but can be changed to extract out this information should an
annotation file be given with different schema or naming convention.

However, I am having a hard time attempting to dynamically access these attributes. For example:

query$gene_type = annotation[nearestIds]$gene_type

versus

query$gene_type = annotations[nearestIds][key_name]

@nleroy917
Copy link
Member Author

nleroy917 commented Dec 3, 2021

@kkupkova I fixed the typo you pointed out and I also added support for GRangesList input. The merge conflict is due to NAMESPACE. I forked from master instead of dev on accident. Is there a function in dev not in master?

@nsheff nsheff mentioned this pull request Jan 8, 2022
@nsheff
Copy link
Member

nsheff commented Jan 8, 2022

@nleroy917 can you address the merge conflicts here please?

@nsheff
Copy link
Member

nsheff commented Jan 8, 2022

Did you fix your issue with dynamically accessing attributes?

@nleroy917
Copy link
Member Author

NAMESPACE conflicts should be resolved. Looking into dynamic attribute access now.

@nleroy917
Copy link
Member Author

Looking at this StackOverflow article it seems if you want to access a column named gene_type, you can store that string in a variable and then use it to get your data like so:

col_name = "gene_type"
result1 = df$gene_type
result2 = df[, col_name]
result3 = df[[col_name]]

And all results will be identical (result = result2 = result3). However... this doesn't work with GRanges objects. So, I am going to convert the data to a DataFrame prior to accessing that data using grToDt():

  # calculate the nearest annotations to given query
  nearestIds = nearest(query, annotations)
  
  # annotate nearest gene and type
  nearestGenes = annotations[nearestIds]
  
  #
  # convert nearestGenes GRange object to data-table
  # and dynamically access the column that way...
  # this is used to circumvent the fact that we cannot
  # dynamically access metadata columns inside a GRanges
  # object like we can a datatable:
  #   col = "gene_id"
  #   dt[[col]]
  #   ^^^ This doesnt work in a GRanges object.
  #
  query$nearest_gene = grToDt(nearestGenes)[[gene_name_key]]
  query$nearest_gene_type = grToDt(nearestGenes)[[gene_type_key]]

@nsheff
Copy link
Member

nsheff commented Jan 10, 2022

GRanges uses a separate entity called 'metadata columns' for this. It used to be a function called elementMetadata, now it looks like you should use mcols.

You should not convert to a data table just to extract a column from the GRanges object.

@nleroy917
Copy link
Member Author

You should not convert to a data table just to extract a column from the GRanges object.

Is it inaccurate or non-performant?

@nleroy917
Copy link
Member Author

Added mcols function to dynamically extract columns instead of grToDt

@nsheff
Copy link
Member

nsheff commented Jan 10, 2022

You should not convert to a data table just to extract a column from the GRanges object.

Is it inaccurate or non-performant?

non-performant

@kkupkova
Copy link
Contributor

Hi!

  1. The function still does not work on list of region sets - GRangesList. Please make sure that it does.
  2. The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

@nleroy917
Copy link
Member Author

Do you have example data inputs for both of these? Presently, this is what I am doing to get both a query and "annotations"

queryFile = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions")
query = rtracklayer::import(queryFile)
data(TSS_hg19)

Does a GRnagesList object exist inside extdata as well that I could test?

@kkupkova
Copy link
Contributor

Look in the vignettes, there are examples there. And maybe read the vignette, it might help you better understand the purpose of this package and then design your function

@nsheff
Copy link
Member

nsheff commented Jan 24, 2022

@nleroy917 will you be able to finish this today? We need to make a new release shortly.

@nleroy917
Copy link
Member Author

I can try. However,

The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

this is still ambiguous to me. The function input need not be TSS's. It can be any GRanges object. If you give an internal set with annotations (and designate them by their identifier), then it will be annotated. It could be anything. It just takes a query and some annotated set.

Regarding the first point, I included this code I found in other similar functions:

  .validateInputs(list(query=c("GRanges","GRangesList")))
  if (is(query, "GRangesList")) {
    # Recurse over each GRanges object
    x = lapply(query, calcFeatureDist, features)
    return(x)
  }

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

@nsheff
Copy link
Member

nsheff commented Jan 24, 2022

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list.

@nsheff
Copy link
Member

nsheff commented Jan 24, 2022

this is still ambiguous to me. The function input need not be TSS's. It can be any GRanges object.

I think she may be referring to your annotations element, not your query element. Are both GRanges objects?

You should pattern your function after the way the current functions work.

@nleroy917
Copy link
Member Author

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list.

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list.

I see now. Would this be proper handling of a GRangesList?

# calcNearestGenes.R
if (is(query, "GRangesList")) {
    # Recurse over each GRanges object
    annots = lapply(
      query,
      function(x) {
        calcNearestGenes(x, annotations, gene_name_key=gene_name_key, gene_type_key=gene_type_key)
        }
      )
    return(annots)
  }

I think she may be referring to your annotations element, not your query element. Are both GRanges objects?

Both should be GRanges objects.

@kkupkova
Copy link
Contributor

Your input to this function are TSS coordinates - I would include in the example how to extract those from gene annotations. It might be also good to make a ref function for our available gene models. I offered to talk to clarify everything. Never heard back.

@nleroy917
Copy link
Member Author

The latest commit should address GRangesList compatibility.

The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

Regarding this, are you referring to something like this where the calling signature becomes:

calcNearestGenes =  function(query, refAssembly) {
   ...
}

and the TSSs are extracted through:

getTSSs(refAssembly)

Then the TSSs can be used to annotate the nearest genes? Should the function be able to support someone coming with their own annotations? I.e. you either bring your own, or designate a refAssembly?

@kkupkova
Copy link
Contributor

Sorry if I was not clear. I did not realize we had also function for extracting TSSs from a GTF file. So I guess it is ok if the annotation object is the TSS_hg19 object that we have associated with the package. I am just thinking that there should be a calcNearestGenesRef function, where a string indicating genome is passed to it (just like in e.g. calcFeatureDistRefTSS function) and then there should be another function calcFeatureDistRefTSS where a user can provide their own annotation file in form of GRanges object (like the TSS_hg19 one).

But that can be solved when the function is working. Now when I pass a query file, where the query is a GRangesList, I get an error. So the function is now not able to handle multiple inputs at once. Please make sure that the function is able to handle both GRanges and GRangesList query objects.

Also, at this point the tests are not passing, so please make sure that they are.

Another thing, can you check that the function output gives the same distances as calcFeatureDist or calcFeatureDistRefTSS function. I can see that the nearest_distance values in the output are always >= 0, which is not how the other similar functions work. There should be a distinction between where the nearest gene is downstream or upstream from the region of interest.

@nleroy917
Copy link
Member Author

But that can be solved when the function is working. Now when I pass a query file, where the query is a GRangesList, I get an error. So the function is now not able to handle multiple inputs at once. Please make sure that the function is able to handle both GRanges and GRangesList query objects.

Make sure you have pulled down my latest changes. I addressed this yesterday and it works for me with both a GRanges and GRangesList object. If not... could you give me some code so that I may reproduce what you are doing to ensure it works for both object types. It becomes a lot easier to work through issues when I have reproducible examples.

Also, at this point the tests are not passing, so please make sure that they are.

The linter was failing due to code outside my commits (Something in partition-plots.R)... I was aware but since it wasn't. my code, I was ignoring for now. I fetched upstream and that seems to have resolved it.

I am just thinking that there should be a calcNearestGenesRef function, where a string indicating genome is passed to it (just like in e.g. calcFeatureDistRefTSS function) and then there should be another function calcFeatureDistRefTSS where a user can provide their own annotation file in form of GRanges object (like the TSS_hg19 one).

I need to make sure I am understanding you correctly here, since you used "calcFeatureDistRefTSS" in two separate contexts. It sounds like I need to do three things:

  1. Create a version of my function that accepts an arbitrary GRanges (or GRangesList) object with annotated regions to calculate the nearest distance to.
  2. Create a version of my function that accepts a reference assembly string and will do the above automatically for the user.
  3. Ensure both the above two functions are properly calculating distances such that downstream elements are represented with a negative distance and upstream elements are represented with a positive distance.

I understand there is a time crunch with the release, so please let me know what I need to do, and I will prioritize it to get it done ASAP

@nsheff
Copy link
Member

nsheff commented Jan 25, 2022

I need to make sure I am understanding you correctly here

I think your list is accurate. All functions in the package have the 2 versions you describe. It was one of the main philosophies of the package. The Ref version should call the generic version after doing what it needs to get the TSS annotation.

@nleroy917
Copy link
Member Author

Ok. I have been pondering the +/- value for distances... What if I did something like this:

# get distance to upsream and downstream
# with proper sign
distToUpstream = -1 * distance(query, annotations[precede(query, annotations)])
distToDownstream = distance(query, annotations[follow(query, annotations)])

# calculate absolute distance and find nearest
nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))

# coerce upstream back to negative by
# finding where the upstream distance was
# chosen and force it back to negative
nearestDist[nearestDist == distToUpstream] = -1 * nearestDist[nearestDist == distToUpstream]

query$distance_to_nearest = nearestDist

@nleroy917
Copy link
Member Author

This is what I came up with:

.directionalDistanceToNearest = function(x, y) {
  # get distance to upsream and downstream
  # with proper sign
  distToUpstream = -1 * distance(x, y[precede(query, y)])
  distToDownstream = distance(x, y[follow(query, y)])
  
  # calculate absolute distance and find nearest
  nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))
  
  # coerce upstream back to negative by
  # finding where the upstream distance was
  # chosen and force it back to negative
  nearestDist[nearestDist == abs(distToUpstream)] = -1 * nearestDist[nearestDist == abs(distToUpstream)]
  
  return(nearestDist)
}

Testing on the same data, I get this result. Where TEST_nearest_distance is the output of my above function. It seems to be working well compared to the Granges distance() function which is in column nearest_distance:

       chr     start       end    nearest_gene nearest_gene_type nearest_distance TEST_nearest_distance
   1: chr1   3190582   3191428 ENSG00000130762    protein_coding           179560               -179560
   2: chr1   8130440   8131887 ENSG00000116285    protein_coding            44070                -44070
   3: chr1  10593124  10594209 ENSG00000160049    protein_coding            60539                -60539
   4: chr1  10732071  10733118 ENSG00000130940    protein_coding           123588                123588
   5: chr1  10757665  10758631 ENSG00000130940    protein_coding            98075                 98075
  ---                                                                                                  
1335: chrX 139380917 139382199 ENSG00000134595    protein_coding           205025                205025
1336: chrX 139593503 139594774 ENSG00000134595    protein_coding             6276                 -6276
1337: chrX 139674500 139675403 ENSG00000134595    protein_coding            87273                -87273
1338: chrX 147829017 147830159 ENSG00000155966    protein_coding           246877                246877
1339: chrX 150407693 150409052 ENSG00000102195    protein_coding            62567                 62567

@kkupkova
Copy link
Contributor

The code now works, so that is awesome. But I still have few things to add.

  1. Please make sure that the function documentation is complete. .directionalDistanceToNearest is not described at all and inputs to calcNearestGenes are not all described.

  2. Please make a calcNearestGenesRef function, that will take query and string with genome it should be used on: look at calcFeatureDistRefTSS to see how it should be done.

  3. While the function works, as you can see in our paper, we are using data.table functions for most of the operations for speed. This should be done uniformly across package. You are calculating the distance with nearest function from GenomicRanges package. While it calculates the distances correctly it is slower and there might be even slight differences in the distances calculated. I would just recommend taking our calcFeatureDist and calcFeatureDistRefTSS functions and just tweak them in a way that those won't return only the distances, but will return all the features that you have in your function.

  4. I would remove TEST_nearest_distance column from the output, just return these values in the nearest_distance column. There is no reason for having a "directional" and "not-directional" column.

@nleroy917
Copy link
Member Author

  1. Please make sure that the function documentation is complete. .directionalDistanceToNearest is not described at all and inputs to calcNearestGenes are not all described.

Should be done.

  1. Please make a calcNearestGenesRef function, that will take query and string with genome it should be used on: look at calcFeatureDistRefTSS to see how it should be done.

Should be done

  1. While the function works, as you can see in our paper, we are using data.table functions for most of the operations for speed. This should be done uniformly across package. You are calculating the distance with nearest function from GenomicRanges package. While it calculates the distances correctly it is slower and there might be even slight differences in the distances calculated. I would just recommend taking our calcFeatureDist and calcFeatureDistRefTSS functions and just tweak them in a way that those won't return only the distances, but will return all the features that you have in your function.

So the reason I wasn't converting to a data table was that I thought it was non-performant. See this comment by @nsheff

  1. I would remove TEST_nearest_distance column from the output, just return these values in the nearest_distance column. There is no reason for having a "directional" and "not-directional" column.

Should be done

Am I correct that I need to run roxygen and push again?

@nsheff
Copy link
Member

nsheff commented Jan 28, 2022

So the reason I wasn't converting to a data table was that I thought it was non-performant. See this comment by @nsheff

You were converting for the purpose of adding a metadata column.

This is talking about converting it to use object for a computation. The data.table computation will be much faster. Kristyna is correct.

@kkupkova
Copy link
Contributor

I am sorry! I completely forgot that the issues are fixed. I thought it was just forgotten. There are still few problems:

  1. Missing description for gene_name_key and gene_biotype - I also guess this should be somehow mentioned in the annotations object, that it should contain these.
  2. When I run calcFeatureDistRefTSS I get different results - we are calculating distance from the middle of the peak, here you are calculating shortest distance. Since this should be essentially the same function, it should definitely produce the same results.
  3. This is more of an idea - I am not even sure if we should have another new function, or just edit calcFeatureDist function, so it returns i) original coordinates, ii) distance, iii) gene name if available. We should then add few lines to the plotting function, so it's compatible. But I think that this would be the most user friendly option. Creating another function that does essentially the same thing is just a bit redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants