注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Puriney's Notes

Puriney=purine+Y, my Wonderland

 
 
 

日志

 
 

[R][BioC] building a refseq-based transcriptDb: warnings of interest?  

2013-02-17 12:59:59|  分类: Bio |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
Hi Vince,  On 07/23/2010 02:50 AM, Vincent Carey wrote: >> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> warnings() > Warning messages: > 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]],  ... : >    UCSC data anomaly in transcript NM_017940: the cds cumulative length > is not a multiple of 3 > 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]],  ... : >    UCSC data anomaly in transcript NM_001037675: the cds cumulative > length is not a multiple of 3 > 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]],  ... : >    UCSC data anomaly in transcript NM_001039703: the cds cumulative > length is not a multiple of 3 > 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]],  ... : > > and so on.  Does this need to be reported to UCSC?  Glad you bring this in the discussion.  If you look at the schema:    http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema  the refGene table has cdsStartStat and cdsStartEnd cols (in addition to the  cdsStart and cdsEnd cols) which describe the status of each CDS. My understanding is that only CDS with status 'cmpl' (complete) are guaranteed to have a length that is a multiple of 3.  Currently makeTranscriptDbFromUCSC() imports all CDS in the TranscriptDb object, regardless of their status, and issues a warning for each CDS that doesn't look right. Maybe not the best approach. Should we allow the user to filter CDSs based on this status? Or should we import only complete CDSs? Or we import all the CDSs but we store in the metadata table of the TranscriptDb object (and then display this in the show method) the fact that not all the CDSs are complete? Then all TranscriptDb objects made with makeTranscriptDbFromUCSC() would be marked that way, except those obtained from the knownGene table where, AFAIK, all the CDSs are guaranteed to be complete.  One difficulty with the design of TranscriptDb objects was to come up with a db schema that would accommodate data coming from very different places like UCSC and biomaRt, and then to implement methods for extracting features from the db that would not be specific to one source or another. This is why adding the cdsStartStat and cdsStartEnd cols to our own db was discarded because those cols are specific to UCSC (IIRC biomaRt/Ensembl doesn't provide this info). Not even all transcript-like tables at UCSC have them. And tables that have them don't necessarily use the same set of values for this col (they use a MySQL enum type).  I guess it all depends what people want to do with those CDSs.  Cheers, H.

--
https://stat.ethz.ch/pipermail/bioconductor/2010-July/034568.html
  评论这张
 
阅读(475)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017