SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
SAM/BAM format to wiggle format pinki999 Bioinformatics 19 08-12-2015 12:35 AM
Anyone know how Snowshoes-FTD finds junction-spanning reads? thondeboer Bioinformatics 2 09-04-2013 11:55 PM
Reads identifiers of reads spanning a SNP pm2012 Bioinformatics 5 04-09-2013 12:49 PM
Are intron size and softclipped reads included in TLEN field of SAM format? okyoon Bioinformatics 2 07-17-2012 02:47 PM
Tophat: find junction spanning reads thurisaz RNA Sequencing 4 11-14-2011 04:23 AM

Reply
 
Thread Tools
Old 05-08-2013, 10:26 AM   #1
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 40
Default Intron spanning reads result in plateaus in wiggle format ?

Hello everybody,

I would like to visualise the amount of mapped reads of custom RNAseq samples in the UCSC Genome Browser.

For this purpose, I can start from already sorted and indexed BAM files, which we obtained from a former collaborator. Since we are all wetlab scientists with limited insight into RNAseq, I unfortunately can’t tell you, which platform and aligner was used.

To generate the wiggle format tracks, I tried the following code, but despite using the –split option of genomeCoverageBed, I can’t reproduce the tracks as they have been previously created by the collaborator.

Code:
genomeCoverageBed –bg -split -ibam  ~/Bioinfo/BAM/RNASeq/unique_mapped/RNASeq_HYPO_L1_unique.bam  -g ./chromInfo-mm9.txt  >   ~/Bioinfo/Bioinformatics/HypoL1.bedgraph

wigToBigWig ./Bioinformatics/HypoL1.bedgraph ./chromInfo-mm9.txt ./Bioinformatics/HypoL1.bigwig
Namely, there seem to be reads which map across an intron and cause a plateau in the tracks wiggle representation - have a look at the second black track, which is the one I generated.

Somehow, the former collaborator either excluded those reads or split them to into two pieces – the blue track at least doesn’t show those plateaus. (The track in the middle shows the collaborators bedgraph track after conversion into BigWig - so its not the different format)


Link to public example session

Does anybody have a hint for me ?

Thanks a lot
Thias
Thias is offline   Reply With Quote
Old 01-22-2014, 06:32 AM   #2
john789
Junior Member
 
Location: England

Join Date: Jan 2011
Posts: 5
Default

Thias, did you find a solution? I have a similar problem with two data sets I created. I can't see what I did different that would result in there being values for the introns in the ucsc browser for one experiment but not the other.

thanks
John
john789 is offline   Reply With Quote
Old 01-23-2014, 03:08 AM   #3
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 40
Default Some pondering about this problem

Dear John,

As mentioned in the previous post, I have no bioinformatic education (apart from self-taught stuff) and depend on forums like that to get answers. So I unfortunately can't provide you with a satisfactory, expert-bioinformatician-approved, ready-to-use solution to this problem.

However I spent some time trying and will summarise what I did in the end. I decided that there should be an possibility to filter out the undesired reads, which cause the plateaus.

Therefore I first obtained a BED file with the coordinates of all introns. If you have a known organism with a released reference genome on the UCSC Genome Browser this is fairly easy:

Navigate to the UCSC Table tool and decide for your appropriate organism and assembly. Then select as group "Gene and Gene Prediction Tracks" and the gene annotation track of your choice (e.g. " Refseq Genes" or "Ensembl Genes"). Choose the region "genome" and the output format "BED - browser extensible data". Name the file and click on the "get output" button. In the radio button form select "Introns plus 0" for the exact intron regions.

I used this introns.bed file to intersect my aligned reads with. To do so, I applied intersectBed from the bedtools suite and retained only those reads, which didn't overlap more than 30% of the read length with an intron (-v and -f 0.3 options).

Code:
bedtools bamtobed -split -i Alignment.bam | bedtools intersect -v -f 0.30 -a stdin  -b refseq_introns.bed > alignment-piped.bed 

bedtools intersect -v -f 0.30 -abam -split -a Alignment.bam  -b refseq_introns.bed > alignment-direct.bed
Attention: Both of the above commands work, but they don't produce the same output as detailed below!

The second commands execution time is much faster (because it is not piped) and it retains for my file less, but longer fragments. (BED file entries: command 1: 24797290, command 2: 19152916)

I used
Code:
awk < alignment-piped.bed  '{print $3-$2 '\n'}' >> pipedlength.txt
to get an idea about the length distribution of the BED entries and summarised that in R out of curiosity:

Quote:
> summary(pipedlength)
X36
Min. : 1.00
1st Qu.:36.00
Median :36.00
Mean :31.89
3rd Qu.:36.00
Max. :36.00
> summary(directlength)
X36
Min. : 36.0
1st Qu.: 36.0
Median : 36.0
Mean : 37.9
3rd Qu.: 36.0
Max. :427053.0
>
So in case of the second command, some long reads are retained, while they are probably chopped into smaller fragments with command 1. I guess that those long reads cause the plateaus in the wiggle format.

So you have the choice: Apply the first command (despite the longer processing time) or use the second command and additionally filter those long reads.

Code:
awk '$3-$2 <= 36 { print }' alignment-direct.bed  > alignment-direct-filtered.bed
Obviously it would be interesting to resolve the nature of those long reads. I think - since our samples originate from cancer cells - this could indicate some chromosomal rearrangements. However I didn't proceed further at this point yet.



I hope I could help you a bit and if you find out about the reason of this differences, I'd be happy to read you reply here.

Best
Matthias

Last edited by Thias; 01-23-2014 at 05:50 AM.
Thias is offline   Reply With Quote
Old 05-09-2014, 03:02 AM   #4
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 40
Default

Forget what I wrote in the last post (I will not delete it though).

The solution is much simpler. I didn't align the reads myself, but recieved those files from a former collaborator and had no clue about the aligner which was used. The SAM/BAM format specification allow for two different letters to indicate a gap in the mapping: N for Introns and D for deletions.

However the -split option of bedtools only splits reads, if the gaps are recorded as N in the CIGAR string of the BAM file. (there is the special option -splitD to force bedtools to split also reads seperated by D). In my BAM file, I have had some intron-spanning reads (the longest introns) which were recorded with Ds and therefore not splitted.
Thias is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO