SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Lots of unmapped reads - SOLiD bacterial RNA-seq and bowtie mapping Jean RNA Sequencing 10 01-17-2013 11:11 AM
Bowtie - only 4.12% reads aligned to transcriptome mcek RNA Sequencing 0 11-15-2011 02:10 AM
mapping transcriptome reads to assembled ESTs MadraghRua Bioinformatics 0 06-28-2011 08:54 AM
mapping reads against luciferase with bowtie mjp Bioinformatics 5 04-15-2011 09:57 PM
Tophat...Mapping reads against Reference with Bowtie [FAILED] Brajbio Bioinformatics 0 06-02-2010 12:33 AM

Reply
 
Thread Tools
Old 08-10-2010, 05:40 AM   #1
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Question transcriptome mapping with bowtie -- only 22% reads hit

To check how well the sequencing parts were done, I used bowtie to map all the reads from RNA-seq directly to human refseqs, and I got only 22% (of ) of reads have at least one hit, 78% failed to align, which I think is impossible.

my command is:
"bowtie -p 5 -m 10 -f trx -1 mate1.fa -2 mate2.fa >output"

The output:
# reads processed: 130653415
# reads with at least one reported alignment: 28765542 (22.02%)
# reads that failed to align: 101882364 (77.98%)
# reads with alignments suppressed due to -m: 5509 (0.00%)
Reported 28765542 paired-end alignments to 1 output stream(s)


Any idea why so few reads got mapped? what did I do wrong?

Thank you very much,

Iris
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 06:05 AM   #2
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben
Ben Langmead is offline   Reply With Quote
Old 08-10-2010, 06:30 AM   #3
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by Ben Langmead View Post
Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben
Thanks Ben.
I see. I'll map the two mates separately.

Iris
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 11:33 AM   #4
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

A more useful way to pose this question in the future is to find one example of a read or a pair of reads that you expected would map, but did not.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 08-10-2010, 11:45 AM   #5
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

Quote:
Originally Posted by Ben Langmead View Post
Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben
If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?
Lee Sam is offline   Reply With Quote
Old 08-10-2010, 11:58 AM   #6
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by Lee Sam View Post
If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?
I am supposed to. That's why I don't use tophat.
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 12:02 PM   #7
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

===mate1=====
reads processed: 130653415
# reads with at least one reported alignment: 47875358 (36.64%)
# reads that failed to align: 82659307 (63.27%)
# reads with alignments suppressed due to -m: 118750 (0.09%)
Reported 47875358 alignments to 1 output stream(s)

====mate2=====
# reads processed: 130653415
# reads with at least one reported alignment: 35773578 (27.38%)
# reads that failed to align: 94798249 (72.56%)
# reads with alignments suppressed due to -m: 81588 (0.06%)
Reported 35773578 alignments to 1 output stream(s)

If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
Any comments are welcome

Iris
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 12:03 PM   #8
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

Quote:
Originally Posted by IrisZhu View Post
I am supposed to. That's why I don't use tophat.
that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?

Quote:
Originally Posted by IrisZhu View Post
Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

===mate1=====
reads processed: 130653415
# reads with at least one reported alignment: 47875358 (36.64%)
# reads that failed to align: 82659307 (63.27%)
# reads with alignments suppressed due to -m: 118750 (0.09%)
Reported 47875358 alignments to 1 output stream(s)

====mate2=====
# reads processed: 130653415
# reads with at least one reported alignment: 35773578 (27.38%)
# reads that failed to align: 94798249 (72.56%)
# reads with alignments suppressed due to -m: 81588 (0.06%)
Reported 35773578 alignments to 1 output stream(s)

If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
Any comments are welcome

Iris
Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?

Last edited by Lee Sam; 08-10-2010 at 12:06 PM.
Lee Sam is offline   Reply With Quote
Old 08-10-2010, 12:13 PM   #9
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Anyone knows how to make bowtie report unmapped reads?
What's the option pamameter?
Thanks!
Iris
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 12:17 PM   #10
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by Lee Sam View Post
that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?



Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?
I mapped two mates seperately, i.e. in a unpaired manner. So no inset size here.
IrisZhu is offline   Reply With Quote
Old 08-10-2010, 12:52 PM   #11
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

Quote:
Originally Posted by IrisZhu View Post
Anyone knows how to make bowtie report unmapped reads?
What's the option pamameter?
Thanks!
Iris
no but that's easy enough to infer:

#get defline, strip >
grep '>' mate1.fa | sed -e 's/>//' > in.seqs

#get first col with seqnames
cut -f1 output.bowtie > out.seqs

#get symmetric difference
sort in.seqs out.seqs | uniq -u
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 08-10-2010, 12:59 PM   #12
babati
Junior Member
 
Location: USA

Join Date: Aug 2010
Posts: 1
Default

IrisZhu,

to make bowtie report the unmapped reads, use the command line option:
--un <output_file.fa>

Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!

Last edited by babati; 08-10-2010 at 01:15 PM.
babati is offline   Reply With Quote
Old 08-10-2010, 01:12 PM   #13
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

Quote:
Originally Posted by babati View Post
IrisZhu,
to make bowtie report the unmapped reads, use the command line option:
--unmapped <output_file.fa>
can you point me to where that feature is described? it is not working for me.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 08-10-2010, 04:17 PM   #14
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by babati View Post
IrisZhu,

to make bowtie report the unmapped reads, use the command line option:
--un <output_file.fa>

Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!
My sample is supposed to be mRNA. I mapped them to the whole genome, ~30% was mapped. If i used bowtie properly , then it must be the problems of sequencing data itself.
And I just tried mapping another batch of data to human refseqs, the results are much more reasonble:
reads processed: 76269225
# reads with at least one reported alignment: 46230181 (60.61%)
# reads that failed to align: 29912348 (39.22%)
# reads with alignments suppressed due to -m: 126696 (0.17%)


BTW, is my command "bowtie -p 10 -m 10 -f hg19 myfile.fa (for unpaired reads)" right?
Thanks!
IrisZhu is offline   Reply With Quote
Old 08-11-2010, 10:46 AM   #15
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by Zigster View Post
no but that's easy enough to infer:

#get defline, strip >
grep '>' mate1.fa | sed -e 's/>//' > in.seqs

#get first col with seqnames
cut -f1 output.bowtie > out.seqs

#get symmetric difference
sort in.seqs out.seqs | uniq -u
Oh yes we can do it this way .... Thank you!
IrisZhu is offline   Reply With Quote
Old 08-11-2010, 12:01 PM   #16
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Iris,

There is another potential issue I can think of which might cause such a low mapping percentage, and that is quite long reads paired with a high error rate.

As you are apparently using fasta sequences a Phred score of 40 is assumed for each base (making it impossible to look for base call error rates). If the sequencing run had a high error rate towards the end of the run you will potentially accumulate too many high quality mismatches which might result in the rejection of alignments. If this is the case you could try and raise the ceiling of cumulative mismatch scores (default 70) to 150 or so (-e 150). Alternatively you could trim the read length down to a value which should not have a high error rate, e.g. trim 100bp reads down to 50 or 60 bases which should be plenty to allow unique mapping.

Good luck!
fkrueger is offline   Reply With Quote
Old 08-11-2010, 12:14 PM   #17
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by fkrueger View Post
Hi Iris,

There is another potential issue I can think of which might cause such a low mapping percentage, and that is quite long reads paired with a high error rate.

As you are apparently using fasta sequences a Phred score of 40 is assumed for each base (making it impossible to look for base call error rates). If the sequencing run had a high error rate towards the end of the run you will potentially accumulate too many high quality mismatches which might result in the rejection of alignments. If this is the case you could try and raise the ceiling of cumulative mismatch scores (default 70) to 150 or so (-e 150). Alternatively you could trim the read length down to a value which should not have a high error rate, e.g. trim 100bp reads down to 50 or 60 bases which should be plenty to allow unique mapping.

Good luck!
Thanks for your reply.
Is what you suggested (raise the mismatch score) equivalent to increasing the # of mismatches? Now the default is 2 (allowing 2 mismatch bases), so I can increase it to, say, 5? But either way will compromise the accuracy, right? --- more unqualified reads will get mapped.
IrisZhu is offline   Reply With Quote
Old 08-11-2010, 12:15 PM   #18
Lee Sam
Member
 
Location: Ann Arbor, MI

Join Date: Oct 2008
Posts: 57
Default

Quote:
Originally Posted by fkrueger View Post
Hi Iris,

There is another potential issue I can think of which might cause such a low mapping percentage, and that is quite long reads paired with a high error rate.

As you are apparently using fasta sequences a Phred score of 40 is assumed for each base (making it impossible to look for base call error rates). If the sequencing run had a high error rate towards the end of the run you will potentially accumulate too many high quality mismatches which might result in the rejection of alignments. If this is the case you could try and raise the ceiling of cumulative mismatch scores (default 70) to 150 or so (-e 150). Alternatively you could trim the read length down to a value which should not have a high error rate, e.g. trim 100bp reads down to 50 or 60 bases which should be plenty to allow unique mapping.

Good luck!
A high error rate doing bowtie alignments shouldn't make a difference using the parameters she specifies since a seed region from the 5' end of the read is used to do the alignment (if memory serves me correctly).

The bowtie manual (http://bowtie-bio.sourceforge.net/manual.shtml) says:
Quote:
-l/--seedlen <int>
The "seed length"; i.e., the number of bases on the high-quality end of the read to which the -n ceiling applies. The lowest permitted setting is 5 and the default is 28. bowtie is faster for larger values of -l.
If she was using the end-to-end alignment mode (which at least used to be an option in the past) errors in the reads may be an issue causing this, but the default is the seed-and-extend method. It shouldn't be that hard to find unique 28-mers...

Last edited by Lee Sam; 08-11-2010 at 12:28 PM.
Lee Sam is offline   Reply With Quote
Old 08-11-2010, 12:24 PM   #19
IrisZhu
Member
 
Location: Maryland

Join Date: Jul 2010
Posts: 25
Default

Quote:
Originally Posted by Lee Sam View Post
A high error rate doing bowtie alignments shouldn't make a difference using the parameters she specifies since a seed region from the 5' end of the read is used to do the alignment (if memory serves me correctly).

The bowtie manual (http://bowtie-bio.sourceforge.net/manual.shtml) says:


If she was using the end-to-end alignment mode (which at lest used to be an option) errors in the reads may be an issue causing this, but the default is the seed-and-extend method. It shouldn't be that hard to find unique 28-mers...
Oh I see i got it wrong. The 2 mismatches are for the seed length (28 bp from the high-quality end). But does it mean that it allows lots of mismatches in the other end?
Thank you so much guys. I learned a lot!
IrisZhu is offline   Reply With Quote
Old 08-11-2010, 12:54 PM   #20
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

I wasn't talking about an error rate during the bowtie alignment, but the error rate of the sequencing run. We have seen quite a few runs where the error rate increased drastically towards the end of long runs (such as 75 bp). If actualy errors from the sequencing run get transformed into fastA files, a Phred score of 40, and therefore a very reliable (but wrong) basecall is assumed for that base.

I was talking about the -e ceiling (not the -n ceiling which applies for the seed length, which can be anything between 0 and 3), which is defined as follows:

-e/--maqerr <int>
Maximum permitted total of quality values at all mismatched read positions throughout the entire alignment, not just in the "seed". The default is 70. Like Maq, bowtie rounds quality values to the nearest 10 and saturates at 30; rounding can be disabled with --nomaqround.

This means that each mismatch in a fastA file counts as 30, i.e. 3 mismatches (even if it is quite close to the 3' end of the sequence) willexceed the default value of 70 and cause the sequence to be removed (and scored as not aligned).
fkrueger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO