SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Stat summary on reads alignment after bowtie2 lgiloteaux RNA Sequencing 2 09-19-2014 12:42 PM
miRNA alignment goutham.atla RNA Sequencing 1 04-22-2014 04:54 AM
Try to further accelerate Bowtie2 alignment Tonnny97 Bioinformatics 1 01-17-2014 02:30 AM
Local alignment of short reads: Mosaik, Bowtie2, BWA? carmeyeii Bioinformatics 1 04-19-2013 03:39 PM
bowtie2 stops during alignment moritzhess Bioinformatics 3 01-13-2013 03:53 AM

Reply
 
Thread Tools
Old 03-30-2015, 12:48 AM   #1
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default miRNA reads alignment with bowtie2

Hi everyone,

I'm working on miRNA data right now (which is very new field for me). I'm trying to filter out only those reads which belongs to miRBase. Here is my procedure.
Align all reads with bowtie2 (default settings) to the reference build index. Which gives me 92.23% overall alignment rate - looks good so far.
Building new index with miRBase hairpin data (with U/T conversion) and then bowtie2 alignment to this index. Last alignment provides only 5.17% overall alignment rate.
Also I've checked first alignment with genome browser using miRNA annotations which clearly shows that reads rarely align to mRNA so it should be fine.

Do you have any clue how to solve this problem? Is it common to have only ~5% coverage of miRNA sequences while data originate from miRNA sequencing?
Ahaswer is offline   Reply With Quote
Old 03-30-2015, 01:23 AM   #2
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

For miRNA data, I would suggest using bowtie instead of bowtie2, since the author claims bowtie to be more sensitive for reads <50bps.

Also, I would suggest you make a database from Rfam and ncbi/Ensembl's known non-coding RNA (mostly rRNA,tRNA,snoRNA, etc). Map reads against this first and you'll see how much contamination you have in the data.
yueluo is offline   Reply With Quote
Old 03-30-2015, 01:31 AM   #3
mziemann
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 10
Default

Don't use Bowtie1 unless you've checked it with simulated reads yourself. With perfect 20nt reads Bowtie1 has a 15% error rate compared to Bowtie2 (~2%)
http://genomespot.blogspot.com.au/20...-compared.html
mziemann is offline   Reply With Quote
Old 03-30-2015, 01:46 AM   #4
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

Quote:
Originally Posted by mziemann View Post
Don't use Bowtie1 unless you've checked it with simulated reads yourself. With perfect 20nt reads Bowtie1 has a 15% error rate compared to Bowtie2 (~2%)
http://genomespot.blogspot.com.au/20...-compared.html
Thanks for the input, so bowtie1 has the highest mapping rate but also significantly high incorrect rate. Sound like a terrible tradeoff.
yueluo is offline   Reply With Quote
Old 03-30-2015, 01:57 AM   #5
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

I've made some research before and found opinions that bowtie1 is not better than bowtie2 while considering miRNA data. So I'm convinced to use bowtie2 for this purpose, but still I have this problem.
Ahaswer is offline   Reply With Quote
Old 03-30-2015, 02:05 AM   #6
mziemann
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 10
Default

Regarding the original Qn its really puzzling that your alignment to miRbase gave such low alignment rates. Did you align to the mature.fa or the hairpin.fa? I would think that you would get higher alignment rates for the hairpin.fa because it will capture most isomiRs. Removing special characters from the fasta headers might help too.
mziemann is offline   Reply With Quote
Old 03-30-2015, 02:11 AM   #7
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

Yes, I've aligned reads to the mature.fa also but this alignment provides me only 4.39% of overall alignment rate. As you suggested I'll try remove special characters from those files.
Ahaswer is offline   Reply With Quote
Old 03-30-2015, 01:53 PM   #8
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

Still the same overall alignment rate. Does anyone have some suggestions about bowtie2 parameters for this task?
Ahaswer is offline   Reply With Quote
Old 03-30-2015, 02:12 PM   #9
mziemann
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 10
Default

One explantion could be contamination as Yueluo suggested. Use featureCounts or HTSeq to see where the reads are landing in the reference genome. If these are mostly mRNA and rRNA, then you have a contamination issue.
mziemann is offline   Reply With Quote
Old 03-31-2015, 12:38 AM   #10
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

As you suggested I have used featureCounts to check reads location and it is more or less the same (4.6%). So it is an issue of contamination, however I will try to get some valuable data. Guess I can still get additional 1-2% through novel miRNA research.

Thank you yueluo and mziemann for your interest and help

But I still wonder what is typical overall alignment rate or content of miRNA sequences in miRNA sequencing data. Can anyone provide such information?
Ahaswer is offline   Reply With Quote
Old 03-31-2015, 12:46 AM   #11
peterawe
Member
 
Location: Oslo, Norway

Join Date: Mar 2013
Posts: 11
Default

Quote:
Originally Posted by mziemann View Post
Don't use Bowtie1 unless you've checked it with simulated reads yourself. With perfect 20nt reads Bowtie1 has a 15% error rate compared to Bowtie2 (~2%)
http://genomespot.blogspot.com.au/20...-compared.html
Hi, I read the great blog regularly. I was however a little puzzled by this analysis. My concern is what is meant by incorrectly and correctly mapped reads. If I got it right, the sequences for alignment are randomly cut out from miRNA-hairpins and mapped back against the reference genome. A possible problem is that, by chance, some of these sequences may match other loci as well. For 100% specificity, the aligner can discard such multi-mappers and thus seem very accurate, while for 100% sensitivity it could just report mapping on all corresponding loci. Thus, the large differences between the aligners may be due to different ways (default parameters) of assigning/reporting multi-mappers and not to actual errors. What's your thought on this?
peterawe is offline   Reply With Quote
Old 03-31-2015, 01:31 AM   #12
mziemann
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 10
Default

@Ahaswer
"But I still wonder what is typical overall alignment rate or content of miRNA sequences in miRNA sequencing data. Can anyone provide such information?"

Your small RNA seq should be >75% microRNA as a rule of thumb. If it is lower than that, then there is an issue about contamination which happens three ways (1) lots of adapter-only reads (2) degradation of mRNA/rRNAs into small fragments that swamp the miRNAs (3) Size selection of the miRNA-containing fragments went wrong.

@peterawe
Regarding my analysis, I use featureCounts "-R" option which outputs information regarding which feature each read is mapped to. I purposely use a map quality threshold of 20 to remove multimappers. Reads which are mapped back to the correct microRNA gene (mapq>=20) are "correct" and reads that are mapped to some other part of the genome (mapq>=20) are "incorrect". Reads that are unassigned due to mapq threshold are neither.

The % correct/incorrect proportions are relative to the starting number of reads, so they are accurate way of showing the performance. One thing I didn't explore was whether mapq20 was really the right threshold or not;. Who knows, maybe bowtie1 is really good at mapq30.
mziemann is offline   Reply With Quote
Old 03-31-2015, 02:36 AM   #13
peterawe
Member
 
Location: Oslo, Norway

Join Date: Mar 2013
Posts: 11
Default

Quote:
Originally Posted by mziemann View Post
@Ahaswer
"But I still wonder what is typical overall alignment rate or content of miRNA sequences in miRNA sequencing data. Can anyone provide such information?"

Your small RNA seq should be >75% microRNA as a rule of thumb. If it is lower than that, then there is an issue about contamination which happens three ways (1) lots of adapter-only reads (2) degradation of mRNA/rRNAs into small fragments that swamp the miRNAs (3) Size selection of the miRNA-containing fragments went wrong.

@peterawe
Regarding my analysis, I use featureCounts "-R" option which outputs information regarding which feature each read is mapped to. I purposely use a map quality threshold of 20 to remove multimappers. Reads which are mapped back to the correct microRNA gene (mapq>=20) are "correct" and reads that are mapped to some other part of the genome (mapq>=20) are "incorrect". Reads that are unassigned due to mapq threshold are neither.

The % correct/incorrect proportions are relative to the starting number of reads, so they are accurate way of showing the performance. One thing I didn't explore was whether mapq20 was really the right threshold or not;. Who knows, maybe bowtie1 is really good at mapq30.
Thanks for your reply, I think this may explain the difference. Running bowtie with default parameters returns a MAPQ of 255 for all alignable reads, regardless of multi-mapping. Thus, this part of the bowtie output is not very informative for downstream filtering.

@Ahaswer
miRNA content depends on biological sample, RNA quality, library preparation protocol and accuracy in gel-cutting. Hoen et al. did smRNA-seq on 465 lymphoblastoid cell lines and reported
Quote:
...the relative miRNA content in our samples ranged from 2% to 62% of mapped reads, with a median of 19%...
Hoen (2013) Nat Biotech. I guess this was an automated protocol, doing manual preparations we routinely get over 50%.
peterawe is offline   Reply With Quote
Old 03-31-2015, 06:12 AM   #14
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

Thanks everybody for your replies. That satisfies my curiosity, especially those papers.
Also I've checked reads alignment count for each miRNA sequence, based on miRBase annotations, which gives me subset ranging from 5 - 20 000 reads per sequence. Guess despite of mentioned before small alignment rate of overall reads I can still work on this dataset.
Ahaswer is offline   Reply With Quote
Old 03-31-2015, 10:56 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Since this was never mentioned in the thread, are the reads being adapter-trimmed prior to alignment? If not, that would cause a bias toward long rather than short sequences, and a low alignment rate.
Brian Bushnell is offline   Reply With Quote
Old 03-31-2015, 08:22 PM   #16
Ahaswer
Member
 
Location: PL

Join Date: Nov 2013
Posts: 13
Default

Hi Brian. Yes, the reads have been adapter-trimmed. Moreover alignment toward long sequences (whole reference) gives good alignment rate and low while using short (miRBase) sequences.
Ahaswer is offline   Reply With Quote
Old 03-31-2015, 10:18 PM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Ahhh... to clarify, what I meant was NOT adapter trimming reads could cause the bias, but I guess that's out if the reads were successfully trimmed.
Brian Bushnell is offline   Reply With Quote
Old 08-19-2015, 09:41 AM   #18
moses122
Junior Member
 
Location: MA

Join Date: Aug 2013
Posts: 2
Default

I got this problem too. Did you figure out why exactly alignment is better for long sequences than miRBase?
moses122 is offline   Reply With Quote
Old 08-20-2015, 04:45 AM   #19
mziemann
Member
 
Location: Australia

Join Date: Aug 2010
Posts: 10
Default

Quote:
Originally Posted by moses122 View Post
I got this problem too. Did you figure out why exactly alignment is better for long sequences than miRBase?
Please don't align directly to the mature.fa, only about 50% of miR reads are the exact mature sequence, you are missing a lot of sequence diversity that will map clearly to the harpin.fa. Calling 3p and 5p arm switching on the other hand, that's a different story.
mziemann is offline   Reply With Quote
Old 08-20-2015, 05:10 AM   #20
moses122
Junior Member
 
Location: MA

Join Date: Aug 2013
Posts: 2
Default

Quote:
Originally Posted by mziemann View Post
Please don't align directly to the mature.fa, only about 50% of miR reads are the exact mature sequence, you are missing a lot of sequence diversity that will map clearly to the harpin.fa. Calling 3p and 5p arm switching on the other hand, that's a different story.
Thanks for replying. I did mapped to hairpin.fa. The alignment rate is only around 0.01% and it's 90% for genome. I used bowtie and allowed one mismatch.
moses122 is offline   Reply With Quote
Reply

Tags
bowtie2, filtering, mirna seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO