注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Puriney's Notes

Puriney=purine+Y, my Wonderland

 
 
 

日志

 
 

[BIO] intersectBED/coverageBED -split option  

2012-09-15 22:24:10|  分类: Bio |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

I had posted a question on BioStar. http://www.biostars.org/post/show/53032/intersectbedcoveragebed-split-purify-exon/. I was wondering how to calculate RPKM for three situations: 

  1. For given RefSeq gene, all reads overlapping with gene are included. (Q1)
  2. For given RefSeq gene, only reads overlapping exon are inclued.(Q3)
  3. For given Exon, all reads overlapping with exon are inclued.(Q2)

 

Questions are followed: 

all.reads.bam file records mapped RNA-seq reads data, including:

  1. exon:exon junction
  2. exon body
  3. intron body
  4. exon:intron junction

Q1: When calculating RPKM for given RefSeq gene including all the position reads, will the following command just calculate exon:exon junction reads and at same time ignore all other reads?

coverageBED -abam all.reads.bam -b refseq.genes.BED12.bed -s -split >coverage.bed

I'm confused by the mannual (Page 62):

When dealing with RNA-seq reads, for example, one typically wants to only tabulate coverage for the portions of the reads that come from exons (and ignore the interstitial intron seqeunce), The -split command allows for such coverage to be performed.

If "-split" is set, the exon:exon read (for example, 30M3000N46M") exists in -abam bam file, and the 3000N will NOT be wrongly intersected when running intersectBED command. But what about coverageBED command? I do hope the 3000N will be not calculated which makes sense, and I also hope the intron body reads and other reads will be NOT ignored.

Q2: If one just want to calculate exon's RPKM, does it mean one should prepare -b file to record all the exon information, and run like 

coverageBED -abam all.reads.bam -b all.exons.bed -s >coverage.bed

Q3: How to calculate RPKM for given genes whose reads overlap with exons? We all known BED12 format file record the exon information (start and length) for given RefSeq gene. Could BEDtools do the magic things? Note this calculation is different from Q1.

 

As I don't have time to wait for others' replies, dummy data is produced to help me understand how exactly intersectBED/coverageBED -split works? 

"reads.bed" input file records all the reads information in BED12 format: 

chr1    30    50    a    0    +    30    50    0    1    20,    30
chr1    110   130   b    0    +    110   130   0    1    20,    110
chr1    180   210   c    0    +    180   210   0    1    30,    180
chr1    110   230   d    0    +    110   230   0    2    20,10,    110,220
chr1    60    230   e    0    +    60    70    0    2    50,10,    60,220,

read_a: within exon1

read_b: within intron1

read_c: overlap with 5' end of exon2

read_d: spliced read, 5' half within intron1, and 3' half within exon2 ( in the real situation, both half reads will be within exon)

read_e: spliced read, 5' half overlap with 3' end of exon1, and 3' half within exon2

 

"ref.bed" input file records the given refseq gene with exon information in BED12 format: 

chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,

Now let's see what is going on with -split option? 

1. No "-split" is set at either step.

head -1 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed

then we get: 

chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    1   20    500    0.0400000

read_a (number is 1) managed to be calculated, and the base coverage length is 20 nt as expected. 

The same thing happens when we run :

head -2 reads.bed|intersectBed -a stdin -b ref.bed -s | coverageBed -a stdin -b ref.bed
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    2    40    500    0.0800000

This means as long as the read is mapped within the gene, it will be calculated as expected. 

2. intersectBED -split will ignore all the intron reads

Having run this command, we will get:

intersectBed -a reads.bed -b ref.bed -s

 

chr1    30    50    a    0    +    30    50    0    1    20,    30
chr1    110    130    b    0    +    110    130    0    1    20,    110
chr1    180    210    c    0    +    180    210    0    1    30,    180
chr1    110    230    d    0    +    110    230    0    2    20,10,    110,220
chr1    60    230    e    0    +    60    70    0    2    50,10,    60,220,

All the reads are included. 

However, when we add -split

intersectBed -a reads.bed -b ref.bed -s -split
chr1    30    50    a    0    +    30    50    0    1    20,    30
chr1    180    210    c    0    +    180    210    0    1    30,    180
chr1    110    230    d    0    +    110    230    0    2    20,10,    110,220
chr1    60    230    e    0    +    60    70    0    2    50,10,    60,220,

Only read_b which is mapped within intron is gone. Though read_d has 5' half located within intron, it still passes and gets calculated. 

3. coverageBED -split will calculate sub-region for spliced read, and will NOT ignore intron reads. 

How about the next command coverage

Having run as followd, then we will get the gene reads number. 

intersectBed -a reads.bed -b ref.bed -s |coverageBed -a stdin -b ref.bed -s 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    5    190    500    0.3800000

How to get 190? Let's check it step by step.

head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s 
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    3    70    500    0.1400000

Apparently, the read_a to read_c pass and get calculated. 30~50, 110~130, and 180~210 are used to calculate the base coverage. 

Try add read_d into the list.

head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    4    140    500    0.2800000

read_a to read_d pass. read_d has 2 blockes, one is 110~130 while the other is 220~230. Indeed, 110~230 will be included, but the repeat will not be calculated repeatedly. Thus, 30~50, 110~130, 130~180, 180~210, 210~-230, in sum, 140nt will be the anwser. 

 

After adding -split.

awk '{if ($4 == "d") print $0}' reads.bed | intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split
chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    1    30    500    0.0600000
head -3 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    3    70    500    0.1400000
head -4 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split  chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    4    100    500    0.2000000

NOTE: Thus, coverageBED -split will calculate the repeated region 110~130 (within intron). Read_b is 110~130, and read_e 's 5'half is same. 

This bug-like result probably is due to the 5'half of read_d will NOT be produced in the real RNA-seq experiments. 

 

NOTE: intron repeat will calculated twice, but the exon repeat will NOT

head -5 reads.bed |intersectBed -a stdin -b ref.bed -s |coverageBed -a stdin -b ref.bed -s -split  chr1    0    500    gene    0    +    0    500    0    3    100,100,100,    0,200,400,    5    150    500    0.3000000

The 220~230 region is covered by read_d and read_e twice. if this region is calculated twice, the answer will be 160, but it is 150 because 220~230 is calculated only once. 

  评论这张
 
阅读(1565)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017