-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nearest Genes Function #172
base: dev
Are you sure you want to change the base?
Conversation
@kkupkova I fixed the typo you pointed out and I also added support for |
@nleroy917 can you address the merge conflicts here please? |
Did you fix your issue with dynamically accessing attributes? |
|
Looking at this StackOverflow article it seems if you want to access a column named col_name = "gene_type"
result1 = df$gene_type
result2 = df[, col_name]
result3 = df[[col_name]] And all results will be identical ( # calculate the nearest annotations to given query
nearestIds = nearest(query, annotations)
# annotate nearest gene and type
nearestGenes = annotations[nearestIds]
#
# convert nearestGenes GRange object to data-table
# and dynamically access the column that way...
# this is used to circumvent the fact that we cannot
# dynamically access metadata columns inside a GRanges
# object like we can a datatable:
# col = "gene_id"
# dt[[col]]
# ^^^ This doesnt work in a GRanges object.
#
query$nearest_gene = grToDt(nearestGenes)[[gene_name_key]]
query$nearest_gene_type = grToDt(nearestGenes)[[gene_type_key]] |
abdf9cf
to
1ed7a63
Compare
GRanges uses a separate entity called 'metadata columns' for this. It used to be a function called You should not convert to a data table just to extract a column from the GRanges object. |
Is it inaccurate or non-performant? |
Added |
non-performant |
Hi!
|
Do you have example data inputs for both of these? Presently, this is what I am doing to get both a query and "annotations" queryFile = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions")
query = rtracklayer::import(queryFile)
data(TSS_hg19) Does a |
Look in the vignettes, there are examples there. And maybe read the vignette, it might help you better understand the purpose of this package and then design your function |
Version 1.3.1
1.3.1 update NEWS
Update full-power.Rmd
increase version number - correct
@nleroy917 will you be able to finish this today? We need to make a new release shortly. |
I can try. However,
this is still ambiguous to me. The function input need not be TSS's. It can be any Regarding the first point, I included this code I found in other similar functions: .validateInputs(list(query=c("GRanges","GRangesList")))
if (is(query, "GRangesList")) {
# Recurse over each GRanges object
x = lapply(query, calcFeatureDist, features)
return(x)
} Which, to me, looks like it's is coercing any |
No, it's not coercing a GRangesList to a GRanges object; it is recursing across the individual GRanges components of the GRangesList object, and returning the results as a list. |
I think she may be referring to your You should pattern your function after the way the current functions work. |
I see now. Would this be proper handling of a GRangesList? # calcNearestGenes.R
if (is(query, "GRangesList")) {
# Recurse over each GRanges object
annots = lapply(
query,
function(x) {
calcNearestGenes(x, annotations, gene_name_key=gene_name_key, gene_type_key=gene_type_key)
}
)
return(annots)
}
Both should be GRanges objects. |
Your input to this function are TSS coordinates - I would include in the example how to extract those from gene annotations. It might be also good to make a ref function for our available gene models. I offered to talk to clarify everything. Never heard back. |
The latest commit should address GRangesList compatibility.
Regarding this, are you referring to something like this where the calling signature becomes: calcNearestGenes = function(query, refAssembly) {
...
} and the TSSs are extracted through: getTSSs(refAssembly) Then the TSSs can be used to annotate the nearest genes? Should the function be able to support someone coming with their own annotations? I.e. you either bring your own, or designate a |
Sorry if I was not clear. I did not realize we had also function for extracting TSSs from a GTF file. So I guess it is ok if the annotation object is the But that can be solved when the function is working. Now when I pass a query file, where the query is a GRangesList, I get an error. So the function is now not able to handle multiple inputs at once. Please make sure that the function is able to handle both GRanges and GRangesList query objects. Also, at this point the tests are not passing, so please make sure that they are. Another thing, can you check that the function output gives the same distances as |
Make sure you have pulled down my latest changes. I addressed this yesterday and it works for me with both a
The linter was failing due to code outside my commits (Something in
I need to make sure I am understanding you correctly here, since you used "
I understand there is a time crunch with the release, so please let me know what I need to do, and I will prioritize it to get it done ASAP |
I think your list is accurate. All functions in the package have the 2 versions you describe. It was one of the main philosophies of the package. The |
Ok. I have been pondering the +/- value for distances... What if I did something like this: # get distance to upsream and downstream
# with proper sign
distToUpstream = -1 * distance(query, annotations[precede(query, annotations)])
distToDownstream = distance(query, annotations[follow(query, annotations)])
# calculate absolute distance and find nearest
nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))
# coerce upstream back to negative by
# finding where the upstream distance was
# chosen and force it back to negative
nearestDist[nearestDist == distToUpstream] = -1 * nearestDist[nearestDist == distToUpstream]
query$distance_to_nearest = nearestDist |
This is what I came up with: .directionalDistanceToNearest = function(x, y) {
# get distance to upsream and downstream
# with proper sign
distToUpstream = -1 * distance(x, y[precede(query, y)])
distToDownstream = distance(x, y[follow(query, y)])
# calculate absolute distance and find nearest
nearestDist = pmin(abs(distToUpstream), abs(distToDownstream))
# coerce upstream back to negative by
# finding where the upstream distance was
# chosen and force it back to negative
nearestDist[nearestDist == abs(distToUpstream)] = -1 * nearestDist[nearestDist == abs(distToUpstream)]
return(nearestDist)
} Testing on the same data, I get this result. Where chr start end nearest_gene nearest_gene_type nearest_distance TEST_nearest_distance
1: chr1 3190582 3191428 ENSG00000130762 protein_coding 179560 -179560
2: chr1 8130440 8131887 ENSG00000116285 protein_coding 44070 -44070
3: chr1 10593124 10594209 ENSG00000160049 protein_coding 60539 -60539
4: chr1 10732071 10733118 ENSG00000130940 protein_coding 123588 123588
5: chr1 10757665 10758631 ENSG00000130940 protein_coding 98075 98075
---
1335: chrX 139380917 139382199 ENSG00000134595 protein_coding 205025 205025
1336: chrX 139593503 139594774 ENSG00000134595 protein_coding 6276 -6276
1337: chrX 139674500 139675403 ENSG00000134595 protein_coding 87273 -87273
1338: chrX 147829017 147830159 ENSG00000155966 protein_coding 246877 246877
1339: chrX 150407693 150409052 ENSG00000102195 protein_coding 62567 62567 |
The code now works, so that is awesome. But I still have few things to add.
|
Should be done.
Should be done
So the reason I wasn't converting to a data table was that I thought it was non-performant. See this comment by @nsheff
Should be done Am I correct that I need to run |
You were converting for the purpose of adding a metadata column. This is talking about converting it to use object for a computation. The data.table computation will be much faster. Kristyna is correct. |
I am sorry! I completely forgot that the issues are fixed. I thought it was just forgotten. There are still few problems:
|
Third times a charm? I think my branch/base locally got messed up, so I am just opening a new one.
Given a query and set of annotations, this function will calculate
the nearest annotation to each region in the region set, as well
as the nearest gene type and the distance to the nearest gene.
This is updated from the previous version which wasn't functioning
properly!
The only issue I am having is that the function requires that the
annotation file has the name of the gene defined as
gene_id
and the type of gene as
gene_biotype
. It would be nice tointroduce a keyword parameter that defaults to these two,
but can be changed to extract out this information should an
annotation file be given with different schema or naming convention.
However, I am having a hard time attempting to dynamically access these attributes. For example:
versus