SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
bowtie vs bowtie2 for RSEM analysis trinity assembled transcripts horvathdp Bioinformatics 1 12-05-2013 12:47 PM
TopHat Misses Splice Junctions CompBio Bioinformatics 8 04-10-2013 04:48 PM
Detecting fusion transcripts in PE_RNA-Seq RockChalkJayhawk RNA Sequencing 5 11-03-2012 05:08 AM
Novel transcripts detecting huangjun RNA Sequencing 0 03-22-2012 05:35 AM
Bowtie and Human transcripts MerFer Bioinformatics 4 02-24-2011 09:00 AM

Reply
 
Thread Tools
Old 08-06-2014, 06:58 AM   #1
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default Bowtie2 detecting human transcripts that STAR misses

Hi everyone,

I'm having the problem mentioned in the title above and it's not making any sense to me. In the RNA-Seq dataset that I have I run STAR, then I look at the left over transcripts, usually blast some of them or something. Often they are still mostly human (which get aligned to hg20 using bowtie2). I can't understand this at all, STAR being a spliced aligner should be aligning far more than bowtie2 does. I was thinking it could indicate human DNA contamination but even then shouldn't STAR still align continuous sequences? Here are two such reads that weren't aligned by STAR but are by Bowtie2 (They're not paired end, so this is two different reads). I'd hate to stop using STAR, love that speed.

TATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGT

ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC

Hopefully someone can point me in the right direction!
Thanks!
bob-loblaw is offline   Reply With Quote
Old 08-07-2014, 01:11 AM   #2
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

Did you mean you look at the leftover reads (as opposed to transcripts)?

Also, whats the quality like on those reads, and what do the bowtie alignments look like?
mikep is offline   Reply With Quote
Old 08-07-2014, 01:59 AM   #3
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by mikep View Post
Did you mean you look at the leftover reads (as opposed to transcripts)?

Also, whats the quality like on those reads, and what do the bowtie alignments look like?
Yeah the leftover reads are what I meant. The quality varies a bit, there are some bad in there, but plenty of good too. But the quality on all of these reads should be enough to allow an accurate alignment.

The alignments look fine, as I said in the previous post I blasted a lot of these reads first, then they were hitting human sequences so thats when I decided to do bowtie2. So I think the bowtie2 alignments are accurate, or relatively anyway. I just don't understand why STAR didn't detect these.
bob-loblaw is offline   Reply With Quote
Old 08-07-2014, 02:16 AM   #4
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

I'd be not worrying about them. What % of your reads fall in this category?

Any chance your username comes from Arrested Development?
mikep is offline   Reply With Quote
Old 08-07-2014, 02:46 AM   #5
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by mikep View Post
Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

I'd be not worrying about them. What % of your reads fall in this category?

Any chance your username comes from Arrested Development?

Oh sorry my bad, that first sequence must be from some other source.

Well that's the problem, in some files its as high as 50%. I've had problems with contamination in this dataset before though so I wouldn't be surprised if there was more.
bob-loblaw is offline   Reply With Quote
Old 08-07-2014, 06:48 AM   #6
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by mikep View Post
Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

I'd be not worrying about them. What % of your reads fall in this category?

Any chance your username comes from Arrested Development?
and yeah it comes from Arrested Development. Bob loblaws law blog

You know come to think of it, I have seen something like this in RNA-Seq datasets before, even published ones, where one sequences the transcritpome of human or mouse or whatever, but not all of it aligns back to the reference database (in my exp sometimes up to as much as 10 or 15%). I was never really able to find an answer as why that was, I always just figured it was chimeric reads and stuff, perhaps that is the case and bowtie2 is able to align them where STAR is not... or maybe I'm reaching at straws here.
bob-loblaw is offline   Reply With Quote
Old 08-07-2014, 09:43 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Perhaps STAR has trouble with reads containing sequencing errors. Do the alignments in bowtie2 but not STAR contain lots of mismatches and/or clipping?
Brian Bushnell is offline   Reply With Quote
Old 08-07-2014, 08:16 PM   #8
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

I normally get about a 10% miss rate with mapping, finished a bunch of star runs this morning to find a miss rate of 25%.

If I find anything in it I'll get back, otherwise 'fraid I got nothing.
mikep is offline   Reply With Quote
Old 08-07-2014, 09:30 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you want a higher mapping rate... you might give BBMap a try. It's splice-aware and substantially more sensitive than Tophat.
Brian Bushnell is offline   Reply With Quote
Old 08-14-2014, 03:38 PM   #10
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

hi @bob-loblaw,

As @mikep pointed out, the second sequence maps chimerically. You would need to enable chimeric output with --chimSegmentMin 20, and then STAR will output it into Chimeric.out.sam:

1 0 chr10 110358273 3 61M40S * 0 0 ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC * NH:i:2 HI:i:1 AS:i:62 NM:i:0 MD:Z:61
1 272 chr10 110358218 3 40M61S * 0 0 GACCATCTGCAGGTACATGATTGCTTTCTACTCAGGGGTGCTGGCATTTATTCAGCACACATTAAATGACAAAAGGTCTCAAGTAAACACCACTAGAAGGT * NH:i:2 HI:i:2 AS:i:43 NM:i:0 MD:Z:40
I believe this is the same as the BLAST alignment. This is a strange chimeric sequence, with two pieces mapping in the same locus on the opposite strands.

You can also allow the output of the longer segment into Aligned.out.sam file by reducing the max mapped score/length requirement, e.g. --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.5:
1 0 chr10 110358273 255 63M38S * 0 0 ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC * NH:i:1 HI:i:1 AS:i:62 NM:i:0 MD:Z:63

The low mapping rate maybe caused by various factors. The Log.final.out file can give you some hints about mapped length, error rate, multi-mappers etc (if you post it I can have a look at it). You can try to reduce the --outFilterMatchNminOverLread value to check the whether only small portions of the reads can be mapped. The most typical reasons for low mappability are
(i) rRNA. Normally they appear multimappers, make sure that you include unplaced scaffolds in the genome, since one of them contains very highly expressed rRNA loci.
(ii) poor sequencing quality of the read ends (then reducing --outFilterMatchNminOverLread will help)
(iii) contamination

Hopefully, that strange chimeric sequence is not representative of the reads that cannot be mapped - if so, it would mean some strange library making artifact.

Cheers
Alex
alexdobin is offline   Reply With Quote
Reply

Tags
rnaseq star bowtie2

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO