Nearest Genes Function #172

nleroy917 · 2021-12-03T18:05:59Z

Third times a charm? I think my branch/base locally got messed up, so I am just opening a new one.

Given a query and set of annotations, this function will calculate
the nearest annotation to each region in the region set, as well
as the nearest gene type and the distance to the nearest gene.

This is updated from the previous version which wasn't functioning
properly!

The only issue I am having is that the function requires that the
annotation file has the name of the gene defined as gene_id
and the type of gene as gene_biotype. It would be nice to
introduce a keyword parameter that defaults to these two,
but can be changed to extract out this information should an
annotation file be given with different schema or naming convention.

However, I am having a hard time attempting to dynamically access these attributes. For example:

query$gene_type = annotation[nearestIds]$gene_type

versus

query$gene_type = annotations[nearestIds][key_name]

nleroy917 · 2021-12-03T18:07:23Z

@kkupkova I fixed the typo you pointed out and I also added support for GRangesList input. The merge conflict is due to NAMESPACE. I forked from master instead of dev on accident. Is there a function in dev not in master?

nsheff · 2022-01-08T15:23:37Z

@nleroy917 can you address the merge conflicts here please?

nsheff · 2022-01-08T15:27:30Z

Did you fix your issue with dynamically accessing attributes?

nleroy917 · 2022-01-09T19:22:18Z

NAMESPACE conflicts should be resolved. Looking into dynamic attribute access now.

nleroy917 · 2022-01-09T21:12:16Z

Looking at this StackOverflow article it seems if you want to access a column named gene_type, you can store that string in a variable and then use it to get your data like so:

col_name = "gene_type"
result1 = df$gene_type
result2 = df[, col_name]
result3 = df[[col_name]]

And all results will be identical (result = result2 = result3). However... this doesn't work with GRanges objects. So, I am going to convert the data to a DataFrame prior to accessing that data using grToDt():

  # calculate the nearest annotations to given query
  nearestIds = nearest(query, annotations)
  
  # annotate nearest gene and type
  nearestGenes = annotations[nearestIds]
  
  #
  # convert nearestGenes GRange object to data-table
  # and dynamically access the column that way...
  # this is used to circumvent the fact that we cannot
  # dynamically access metadata columns inside a GRanges
  # object like we can a datatable:
  #   col = "gene_id"
  #   dt[[col]]
  #   ^^^ This doesnt work in a GRanges object.
  #
  query$nearest_gene = grToDt(nearestGenes)[[gene_name_key]]
  query$nearest_gene_type = grToDt(nearestGenes)[[gene_type_key]]

nsheff · 2022-01-10T12:13:19Z

GRanges uses a separate entity called 'metadata columns' for this. It used to be a function called elementMetadata, now it looks like you should use mcols.

You should not convert to a data table just to extract a column from the GRanges object.

nleroy917 · 2022-01-10T16:01:59Z

You should not convert to a data table just to extract a column from the GRanges object.

Is it inaccurate or non-performant?

nleroy917 · 2022-01-10T16:08:40Z

Added mcols function to dynamically extract columns instead of grToDt

nsheff · 2022-01-10T16:12:45Z

You should not convert to a data table just to extract a column from the GRanges object.

Is it inaccurate or non-performant?

non-performant

kkupkova · 2022-01-11T00:23:37Z

Hi!

The function still does not work on list of region sets - GRangesList. Please make sure that it does.
The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

nleroy917 · 2022-01-12T16:47:54Z

Do you have example data inputs for both of these? Presently, this is what I am doing to get both a query and "annotations"

queryFile = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions")
query = rtracklayer::import(queryFile)
data(TSS_hg19)

Does a GRnagesList object exist inside extdata as well that I could test?

kkupkova · 2022-01-12T16:50:16Z

Look in the vignettes, there are examples there. And maybe read the vignette, it might help you better understand the purpose of this package and then design your function

Version 1.3.1

1.3.1 update NEWS

Update full-power.Rmd

increase version number - correct

nsheff · 2022-01-24T13:04:17Z

@nleroy917 will you be able to finish this today? We need to make a new release shortly.

nleroy917 · 2022-01-24T13:16:13Z

I can try. However,

The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

this is still ambiguous to me. The function input need not be TSS's. It can be any GRanges object. If you give an internal set with annotations (and designate them by their identifier), then it will be annotated. It could be anything. It just takes a query and some annotated set.

Regarding the first point, I included this code I found in other similar functions:

  .validateInputs(list(query=c("GRanges","GRangesList")))
  if (is(query, "GRangesList")) {
    # Recurse over each GRanges object
    x = lapply(query, calcFeatureDist, features)
    return(x)
  }

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

nsheff · 2022-01-24T13:20:00Z

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list.

nsheff · 2022-01-24T13:21:31Z

this is still ambiguous to me. The function input need not be TSS's. It can be any GRanges object.

I think she may be referring to your annotations element, not your query element. Are both GRanges objects?

You should pattern your function after the way the current functions work.

nleroy917 · 2022-01-24T13:53:27Z

Which, to me, looks like it's is coercing any GRangesList to a GRanges object. Am I interpreting this incorrectly?

No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list.

I see now. Would this be proper handling of a GRangesList?

# calcNearestGenes.R
if (is(query, "GRangesList")) {
    # Recurse over each GRanges object
    annots = lapply(
      query,
      function(x) {
        calcNearestGenes(x, annotations, gene_name_key=gene_name_key, gene_type_key=gene_type_key)
        }
      )
    return(annots)
  }

I think she may be referring to your annotations element, not your query element. Are both GRanges objects?

Both should be GRanges objects.

kkupkova · 2022-01-24T15:48:36Z

Your input to this function are TSS coordinates - I would include in the example how to extract those from gene annotations. It might be also good to make a ref function for our available gene models. I offered to talk to clarify everything. Never heard back.

nleroy917 · 2022-01-24T17:52:12Z

The latest commit should address GRangesList compatibility.

The input to the function are TSSs. Getting a list of TSSs already requires an effort. I think that the input should be either a GenomicRanges object with (not only TSSs) and extract the TSSs from the gene coordinates or there should be a Ref function like in other functions, where you just pass a string identifying genome assembly and all is done automatically.

Regarding this, are you referring to something like this where the calling signature becomes:

calcNearestGenes =  function(query, refAssembly) {
   ...
}

and the TSSs are extracted through:

getTSSs(refAssembly)

Then the TSSs can be used to annotate the nearest genes? Should the function be able to support someone coming with their own annotations? I.e. you either bring your own, or designate a refAssembly?

kkupkova · 2022-01-25T00:39:32Z

Sorry if I was not clear. I did not realize we had also function for extracting TSSs from a GTF file. So I guess it is ok if the annotation object is the TSS_hg19 object that we have associated with the package. I am just thinking that there should be a calcNearestGenesRef function, where a string indicating genome is passed to it (just like in e.g. calcFeatureDistRefTSS function) and then there should be another function calcFeatureDistRefTSS where a user can provide their own annotation file in form of GRanges object (like the TSS_hg19 one).

But that can be solved when the function is working. Now when I pass a query file, where the query is a GRangesList, I get an error. So the function is now not able to handle multiple inputs at once. Please make sure that the function is able to handle both GRanges and GRangesList query objects.

Also, at this point the tests are not passing, so please make sure that they are.

Another thing, can you check that the function output gives the same distances as calcFeatureDist or calcFeatureDistRefTSS function. I can see that the nearest_distance values in the output are always >= 0, which is not how the other similar functions work. There should be a distinction between where the nearest gene is downstream or upstream from the region of interest.

nleroy917 · 2022-01-25T18:01:15Z

But that can be solved when the function is working. Now when I pass a query file, where the query is a GRangesList, I get an error. So the function is now not able to handle multiple inputs at once. Please make sure that the function is able to handle both GRanges and GRangesList query objects.

Make sure you have pulled down my latest changes. I addressed this yesterday and it works for me with both a GRanges and GRangesList object. If not... could you give me some code so that I may reproduce what you are doing to ensure it works for both object types. It becomes a lot easier to work through issues when I have reproducible examples.

Also, at this point the tests are not passing, so please make sure that they are.

The linter was failing due to code outside my commits (Something in partition-plots.R)... I was aware but since it wasn't. my code, I was ignoring for now. I fetched upstream and that seems to have resolved it.

I am just thinking that there should be a calcNearestGenesRef function, where a string indicating genome is passed to it (just like in e.g. calcFeatureDistRefTSS function) and then there should be another function calcFeatureDistRefTSS where a user can provide their own annotation file in form of GRanges object (like the TSS_hg19 one).

I need to make sure I am understanding you correctly here, since you used "calcFeatureDistRefTSS" in two separate contexts. It sounds like I need to do three things:

Create a version of my function that accepts an arbitrary GRanges (or GRangesList) object with annotated regions to calculate the nearest distance to.
Create a version of my function that accepts a reference assembly string and will do the above automatically for the user.
Ensure both the above two functions are properly calculating distances such that downstream elements are represented with a negative distance and upstream elements are represented with a positive distance.

I understand there is a time crunch with the release, so please let me know what I need to do, and I will prioritize it to get it done ASAP

nsheff · 2022-01-25T18:33:53Z

I need to make sure I am understanding you correctly here

I think your list is accurate. All functions in the package have the 2 versions you describe. It was one of the main philosophies of the package. The Ref version should call the generic version after doing what it needs to get the TSS annotation.

nleroy917 · 2022-01-25T19:02:27Z

Ok. I have been pondering the +/- value for distances... What if I did something like this:

# get distance to upsream and downstream
# with proper sign
distToUpstream = -1 * distance(query, annotations[precede(query, annotations)])
distToDownstream = distance(query, annotations[follow(query, annotations)])

# calculate absolute distance and find nearest
nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))

# coerce upstream back to negative by
# finding where the upstream distance was
# chosen and force it back to negative
nearestDist[nearestDist == distToUpstream] = -1 * nearestDist[nearestDist == distToUpstream]

query$distance_to_nearest = nearestDist

nleroy917 · 2022-01-27T18:31:25Z

This is what I came up with:

.directionalDistanceToNearest = function(x, y) {
  # get distance to upsream and downstream
  # with proper sign
  distToUpstream = -1 * distance(x, y[precede(query, y)])
  distToDownstream = distance(x, y[follow(query, y)])
  
  # calculate absolute distance and find nearest
  nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))
  
  # coerce upstream back to negative by
  # finding where the upstream distance was
  # chosen and force it back to negative
  nearestDist[nearestDist == abs(distToUpstream)] = -1 * nearestDist[nearestDist == abs(distToUpstream)]
  
  return(nearestDist)
}

Testing on the same data, I get this result. Where TEST_nearest_distance is the output of my above function. It seems to be working well compared to the Granges distance() function which is in column nearest_distance:

       chr     start       end    nearest_gene nearest_gene_type nearest_distance TEST_nearest_distance
   1: chr1   3190582   3191428 ENSG00000130762    protein_coding           179560               -179560
   2: chr1   8130440   8131887 ENSG00000116285    protein_coding            44070                -44070
   3: chr1  10593124  10594209 ENSG00000160049    protein_coding            60539                -60539
   4: chr1  10732071  10733118 ENSG00000130940    protein_coding           123588                123588
   5: chr1  10757665  10758631 ENSG00000130940    protein_coding            98075                 98075
  ---                                                                                                  
1335: chrX 139380917 139382199 ENSG00000134595    protein_coding           205025                205025
1336: chrX 139593503 139594774 ENSG00000134595    protein_coding             6276                 -6276
1337: chrX 139674500 139675403 ENSG00000134595    protein_coding            87273                -87273
1338: chrX 147829017 147830159 ENSG00000155966    protein_coding           246877                246877
1339: chrX 150407693 150409052 ENSG00000102195    protein_coding            62567                 62567

kkupkova · 2022-01-27T23:59:41Z

The code now works, so that is awesome. But I still have few things to add.

Please make sure that the function documentation is complete. .directionalDistanceToNearest is not described at all and inputs to calcNearestGenes are not all described.
Please make a calcNearestGenesRef function, that will take query and string with genome it should be used on: look at calcFeatureDistRefTSS to see how it should be done.
While the function works, as you can see in our paper, we are using data.table functions for most of the operations for speed. This should be done uniformly across package. You are calculating the distance with nearest function from GenomicRanges package. While it calculates the distances correctly it is slower and there might be even slight differences in the distances calculated. I would just recommend taking our calcFeatureDist and calcFeatureDistRefTSS functions and just tweak them in a way that those won't return only the distances, but will return all the features that you have in your function.
I would remove TEST_nearest_distance column from the output, just return these values in the nearest_distance column. There is no reason for having a "directional" and "not-directional" column.

…ove test code.

nleroy917 · 2022-01-28T19:19:22Z

Please make sure that the function documentation is complete. .directionalDistanceToNearest is not described at all and inputs to calcNearestGenes are not all described.

Should be done.

Please make a calcNearestGenesRef function, that will take query and string with genome it should be used on: look at calcFeatureDistRefTSS to see how it should be done.

Should be done

While the function works, as you can see in our paper, we are using data.table functions for most of the operations for speed. This should be done uniformly across package. You are calculating the distance with nearest function from GenomicRanges package. While it calculates the distances correctly it is slower and there might be even slight differences in the distances calculated. I would just recommend taking our calcFeatureDist and calcFeatureDistRefTSS functions and just tweak them in a way that those won't return only the distances, but will return all the features that you have in your function.

So the reason I wasn't converting to a data table was that I thought it was non-performant. See this comment by @nsheff

I would remove TEST_nearest_distance column from the output, just return these values in the nearest_distance column. There is no reason for having a "directional" and "not-directional" column.

Should be done

Am I correct that I need to run roxygen and push again?

nsheff · 2022-01-28T20:17:36Z

So the reason I wasn't converting to a data table was that I thought it was non-performant. See this comment by @nsheff

You were converting for the purpose of adding a metadata column.

This is talking about converting it to use object for a computation. The data.table computation will be much faster. Kristyna is correct.

kkupkova · 2022-07-27T23:14:14Z

I am sorry! I completely forgot that the issues are fixed. I thought it was just forgotten. There are still few problems:

Missing description for gene_name_key and gene_biotype - I also guess this should be somehow mentioned in the annotations object, that it should contain these.
When I run calcFeatureDistRefTSS I get different results - we are calculating distance from the middle of the peak, here you are calculating shortest distance. Since this should be essentially the same function, it should definitely produce the same results.
This is more of an idea - I am not even sure if we should have another new function, or just edit calcFeatureDist function, so it returns i) original coordinates, ii) distance, iii) gene name if available. We should then add few lines to the plotting function, so it's compatible. But I think that this would be the most user friendly option. Creating another function that does essentially the same thing is just a bit redundant.

nturaga and others added 3 commits October 26, 2021 17:06

bump x.y.z version to even y prior to creation of RELEASE_3_14 branch

2a030e7

bump x.y.z version to odd y following creation of RELEASE_3_14 branch

d00b0f1

Pass serialized S4 instances thru updateObject()

bebda48

nsheff mentioned this pull request Jan 8, 2022

Nearest Genes Function #170

Closed

nleroy917 added 3 commits January 9, 2022 16:18

ad nearest-genes function

9f866a1

add suport for GRangesList

b416148

dynamic access of gene_id and gene_biotype

1ed7a63

nleroy917 force-pushed the dev-nathan-2 branch from abdf9cf to 1ed7a63 Compare January 9, 2022 21:23

swap out for mcols

e82d7f2

kkupkova added 5 commits January 20, 2022 18:57

Merge pull request databio#183 from databio/dev

496d06b

Version 1.3.1

Merge pull request databio#184 from databio/dev

67664d9

1.3.1 update NEWS

Merge pull request databio#185 from databio/dev

4e9b7a7

Update full-power.Rmd

Merge pull request databio#186 from databio/dev

68e6aca

increase version number - correct

Merge remote-tracking branch 'upstream/master'

37e8c30

update calcNearestGenes to support GRangesList

534e7b0

Merge branch 'databio:master' into dev-nathan-2

66570a2

add the directional distances function

4a5a00a

add calcNearestGenesRef. document directional distances function. rem…

408592c

…ove test code.

Nearest Genes Function #172

Are you sure you want to change the base?

Nearest Genes Function #172

Uh oh!

Conversation

nleroy917 commented Dec 3, 2021

Uh oh!

nleroy917 commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsheff commented Jan 8, 2022

Uh oh!

nsheff commented Jan 8, 2022

Uh oh!

nleroy917 commented Jan 9, 2022

Uh oh!

nleroy917 commented Jan 9, 2022

Uh oh!

nsheff commented Jan 10, 2022

Uh oh!

nleroy917 commented Jan 10, 2022

Uh oh!

nleroy917 commented Jan 10, 2022

Uh oh!

nsheff commented Jan 10, 2022

Uh oh!

kkupkova commented Jan 11, 2022

Uh oh!

nleroy917 commented Jan 12, 2022

Uh oh!

kkupkova commented Jan 12, 2022

Uh oh!

nsheff commented Jan 24, 2022

Uh oh!

nleroy917 commented Jan 24, 2022

Uh oh!

nsheff commented Jan 24, 2022

Uh oh!

nsheff commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nleroy917 commented Jan 24, 2022

Uh oh!

kkupkova commented Jan 24, 2022

Uh oh!

nleroy917 commented Jan 24, 2022

Uh oh!

kkupkova commented Jan 25, 2022

Uh oh!

nleroy917 commented Jan 25, 2022

Uh oh!

nsheff commented Jan 25, 2022

Uh oh!

nleroy917 commented Jan 25, 2022

Uh oh!

nleroy917 commented Jan 27, 2022

Uh oh!

kkupkova commented Jan 27, 2022

Uh oh!

nleroy917 commented Jan 28, 2022

Uh oh!

nsheff commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkupkova commented Jul 27, 2022

Uh oh!

Uh oh!

nleroy917 commented Dec 3, 2021 •

edited

Loading

nsheff commented Jan 24, 2022 •

edited

Loading

nsheff commented Jan 28, 2022 •

edited

Loading