SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to trim solid reads length? lei Bioinformatics 7 12-14-2012 08:55 AM
Guys, What do you use to trim the Illumina sequences? Daisy-Fu Illumina/Solexa 1 09-15-2011 10:18 AM
Trim Illumina reads? sapearl Bioinformatics 3 08-10-2011 09:35 AM
Newbler Trim Status blindtiger454 De novo discovery 2 05-18-2011 05:46 AM
How to trim 454 low quality sequences biohumin Bioinformatics 4 07-13-2010 09:14 AM

Reply
 
Thread Tools
Old 08-16-2010, 01:29 PM   #1
days369
Junior Member
 
Location: US

Join Date: Aug 2009
Posts: 8
Default Do I need to trim the sequences like this?

When I checked my solexa sequencing reads, i found that some of them are like this.


NNNNNNNNAGGNNNNNGGAGNGNNGNNNCAGNGNTGNNNNNNNNNNNNNANNNNNNGNNNNNNNTGGNGGNNNNNNNN
+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


First, there are poly "N" in the middle of the sequence as well as at the end.
Second, all of the base calls are in low quality (I guess that % is the lowest quality score in this format, right?)
Third, in some other cases, I can see poly "A" at the end of a sequence.

How should I deal with the reads having those features? Should I just get rid of them, or do some trimming? If trimming is recommended in some cases, what software is suitable for solexa reads?
days369 is offline   Reply With Quote
Old 08-16-2010, 02:03 PM   #2
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

I count about 21 bases that are not N's in that sequence. You may not have enough bases for a unique match to your genome (depends on the genome size). A score of '%' is the fifth-from-lowest score possible (on Phred-33), which makes it likely either a 5 or a -1.
Personally I would throw this read out. because most of the bases aren't called, and none of them have reasonable scores.
mrawlins is offline   Reply With Quote
Old 08-16-2010, 02:29 PM   #3
days369
Junior Member
 
Location: US

Join Date: Aug 2009
Posts: 8
Default

Hi mrawlins,

THanks for answering. Do you have any idea about the common quality score people use to trim sequences?

Quote:
Originally Posted by mrawlins View Post
I count about 21 bases that are not N's in that sequence. You may not have enough bases for a unique match to your genome (depends on the genome size). A score of '%' is the fifth-from-lowest score possible (on Phred-33), which makes it likely either a 5 or a -1.
Personally I would throw this read out. because most of the bases aren't called, and none of them have reasonable scores.
days369 is offline   Reply With Quote
Old 08-16-2010, 03:12 PM   #4
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

I don't know what scores people would use to trim/reject reads. We use SOLiD machines, so the calling is done differently than in Solexa, and the scores are different. For one thing, we never see N's. I would probably throw out any read where there wasn't at least 20 contiguous base calls and 25 base calls total (though I may expect at least 25 contiguous base calls to be safe). That makes it unlikely to match to the genome by random chance, so if the low quality reads are mis-called they will likely not map to the genome.
mrawlins is offline   Reply With Quote
Old 08-16-2010, 09:19 PM   #5
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

Quote:
Originally Posted by mrawlins View Post
I would probably throw out any read where there wasn't at least 20 contiguous base calls and 25 base calls total (though I may expect at least 25 contiguous base calls to be safe). That makes it unlikely to match to the genome by random chance, so if the low quality reads are mis-called they will likely not map to the genome.
This is reasonable BUT you have to make sure your software can actually handle ambiguous/unknown bases like 'N. For example, some fast read aligners will NOT align the read if it has an 'N', and some assembly software ignores them or converts them to 'A'.

We throw away all our reads with any N in them at all after trimming from 3' end. This usually only rejects about 1% to 5% of the total.
Torst is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO