The CDS and protein data were downloaded from UCSC on the same day with running the following code that had the following warning message:
library(PGA)
annotation_path <- tempdir()
pepfasta <- "~/Downloads/hg19_refGenePro.fa"
CDSfasta <- "~/Downloads/hg19_refGeneCDS.fa"
PrepareAnnotationRefseq2(genome='hg19', CDSfasta, pepfasta, annotation_path,
dbsnp=NULL, splice_matrix=FALSE, COSMIC=FALSE)
Build TranscriptDB object (txdb.sqlite) ...
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
done
Prepare gene/transcript/protein id mapping information (ids.RData) ... done
Prepare exon annotation information (exon_anno.RData) ... done
Prepare protein sequence (proseq.RData) ... done
Prepare protein coding sequence (procodingseq.RData)... done
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable) :
UCSC data anomaly in 433 transcript(s): the cds cumulative length is not a multiple of 3
for transcripts ‘NM_033425’ ‘NM_006510’ ‘NM_001146344’ ‘NM_001010890’ ‘NM_001300891’
‘NM_001300891’ ‘NM_017940’ ‘NM_002537’ ‘NM_003954’ ‘NM_006510’ ‘NM_001278563’
‘NM_001291815’ ‘NM_001359231’ ‘NM_001354658’ ‘NM_001350198’ ‘NM_001243042’
‘NM_001243042’ ‘NM_002570’ ‘NM_001128590’ ‘NM_001271870’ ‘NM_001271872’ ‘NM_001329984’
‘NM_001037501’ ‘NM_001037675’ ‘NM_001277444’ ‘NM_001351365’ ‘NM_001297654’
‘NM_001288952’ ‘NM_001134939’ ‘NM_001301371’ ‘NM_153334’ ‘NM_001348286’ ‘NM_001348208’
‘NM_001348208’ ‘NM_001348208’ ‘NM_001348208’ ‘NM_001348208’ ‘NM_001289152’ ‘NM_199349’
‘NM_138324’ ‘NM_138323’ ‘NM_138322’ ‘NM_138319’ ‘NM_005671’ ‘NM_001143962’ ‘NM_000500’
‘NM_145171’ ‘NM_001318833’ ‘NM_006904� [... truncated]
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2018.03
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] PGA_1.13.3 rTANDEM_1.22.1 Rcpp_1.0.1
[4] XML_3.98-1.20 data.table_1.12.2 Biostrings_2.50.2
[7] XVector_0.22.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[10] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] Biobase_2.42.0 httr_1.4.0
[3] bit64_0.9-7 assertthat_0.2.1
[5] BiocManager_1.30.4 blob_1.1.1
[7] BSgenome_1.50.0 GenomeInfoDbData_1.2.0
[9] Rsamtools_1.34.1 remotes_2.0.4
[11] progress_1.2.2 pillar_1.4.1
[13] RSQLite_2.1.1 lattice_0.20-38
[15] glue_1.3.1 digest_0.6.19
[17] RColorBrewer_1.1-2 colorspace_1.4-1
[19] Matrix_1.2-17 plyr_1.8.4
[21] pkgconfig_2.0.2 pheatmap_1.0.12
[23] customProDB_1.22.1 biomaRt_2.38.0
[25] zlibbioc_1.28.0 purrr_0.3.2
[27] scales_1.0.0 processx_3.3.1
[29] BiocParallel_1.16.6 tibble_2.1.3
[31] ggplot2_3.2.0 AhoCorasickTrie_0.1.0
[33] SummarizedExperiment_1.12.0 GenomicFeatures_1.34.8
[35] lazyeval_0.2.2 magrittr_1.5
[37] crayon_1.3.4 memoise_1.1.0
[39] ps_1.3.0 MASS_7.3-51.4
[41] RMariaDB_1.0.6.9000 tools_3.5.3
[43] prettyunits_1.0.2 hms_0.4.2
[45] matrixStats_0.54.0 stringr_1.4.0
[47] munsell_0.5.0 DelayedArray_0.8.0
[49] AnnotationDbi_1.44.0 ade4_1.7-13
[51] compiler_3.5.3 rlang_0.3.4
[53] grid_3.5.3 RCurl_1.95-4.12
[55] VariantAnnotation_1.28.13 bitops_1.0-6
[57] gtable_0.3.0 curl_3.3
[59] DBI_1.0.0.9001 R6_2.4.0
[61] GenomicAlignments_1.18.1 Nozzle.R1_1.1-1
[63] dplyr_0.8.1 rtracklayer_1.42.2
[65] seqinr_3.4-5 bit_1.1-14
[67] readr_1.3.1 stringi_1.4.3
[69] tidyselect_0.2.5
The CDS and protein data were downloaded from UCSC on the same day with running the following code that had the following warning message: