SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RTA v2.8 : Conflicts with low complexity sequence nickp Illumina/Solexa 2 06-04-2014 09:19 AM
Changing dNTP Flow Order for Low-Complexity Template Regions SeqNerd Ion Torrent 9 01-16-2012 06:28 AM
PE sequencing of a lib with ONE end high and the other low complexity ein_io Illumina/Solexa 4 12-01-2011 05:54 PM
Sequencing low complexity libraries: effects on data casbon Illumina/Solexa 7 09-05-2011 11:51 PM
Help:primer and low complexity sequence filter alvin1982 Illumina/Solexa 0 04-21-2010 07:05 PM

Reply
 
Thread Tools
Old 11-07-2009, 12:02 PM   #1
swarbre
Member
 
Location: CA

Join Date: Sep 2009
Posts: 11
Default programs for filtering low complexity

This question was asked elsewhere but I have the same question as below, any help.

1) I have many "low-complexity" reads. Some are simply polyA, polyC,
> etc. But some others are runs of "ATATAT" or "CACACACA", etc. Previously
>
> I would have used "dust" on the command line to filter out this kind of
> read in a fasta file. Any ideas on how to achieve similar functionality
> in the ShortRead world?
swarbre is offline   Reply With Quote
Old 12-14-2009, 06:19 AM   #2
olus
Member
 
Location: milan, italy

Join Date: Aug 2008
Posts: 22
Default

Look at :

https://stat.ethz.ch/pipermail/bioc-...ch/000191.html

http://www.mail-archive.com/bioc-sig.../msg00148.html

Though not a ShortRead package:
http://genome.gsc.riken.jp/osc/engli...rc/tagdust.tgz

Cheers
olus is offline   Reply With Quote
Old 02-04-2012, 09:46 PM   #3
mdaskal
Junior Member
 
Location: Israel

Join Date: Feb 2012
Posts: 2
Default

Hi,

I am looking for a definition of low complexity reads for reads of variable lengths (about 100 nucleotides long).

Right now, I am using the following definition:
- Divide a read in subsegments of 32 nucleotides. (last subsegment is overlapping one before last)
- Count number of unique tri-nucleotides in each segment.
- If number of unique tri-nucleotides is smaller than 5, then the segment is of "low complexity"
- If there is at least one "low complexity" segment, the read is considered "low complexity"

Comments regarding the relevancy of this definition would be appreciated.

Regards, Michael.
mdaskal is offline   Reply With Quote
Old 02-05-2012, 05:05 AM   #4
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 266
Default

Quote:
Originally Posted by mdaskal View Post
Hi,

I am looking for a definition of low complexity reads for reads of variable lengths (about 100 nucleotides long).

Right now, I am using the following definition:
- Divide a read in subsegments of 32 nucleotides. (last subsegment is overlapping one before last)
- Count number of unique tri-nucleotides in each segment.
- If number of unique tri-nucleotides is smaller than 5, then the segment is of "low complexity"
- If there is at least one "low complexity" segment, the read is considered "low complexity"

Comments regarding the relevancy of this definition would be appreciated.

Regards, Michael.

Why 5?

How does your operational definition compare to SEG / Dust?
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 02-05-2012, 05:46 AM   #5
mdaskal
Junior Member
 
Location: Israel

Join Date: Feb 2012
Posts: 2
Default Low complexity reads

Hi Dan,

I am not familiar with the definition of SEG / Dust.
Where can I find some details about it?

Would you suggest another limit than 5 unique tri-nucleotides (higher?lower?)?

Regards, Michael.

Last edited by mdaskal; 02-05-2012 at 06:07 AM.
mdaskal is offline   Reply With Quote
Old 02-05-2012, 07:04 AM   #6
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 266
Default

I guess they are both in PubMed? Internet is slow here atm or else I'd link...

I don't suggest an alternative without data... so my question really was, did you analyse / benchmark your metric? i.e. what fraction of reads are low complexity at 3, 4, 5, 6, etc...
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO