SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
< Script to compute distribution length of sequences > Giorgio C Bioinformatics 8 08-23-2012 02:29 AM
RNA-Seq: Summarizing and correcting the GC content bias in high-throughput sequencing Newsbot! Literature Watch 0 02-11-2012 02:00 AM
Publication-quality graphics program for chromosome-length sequences? ssully General 4 01-20-2012 04:24 AM
Mapping Short Reads with unequal length using MAQ TOLEN Illumina/Solexa 0 12-30-2010 07:57 PM
What is the length limitation of sequences for newbler? ljhwahaha Bioinformatics 3 12-01-2010 07:18 AM

Reply
 
Thread Tools
Old 05-29-2012, 11:27 PM   #1
Bastian
Junior Member
 
Location: Oslo

Join Date: Apr 2012
Posts: 6
Default summarizing sequences of unequal length

Hi,

I have a data set that contains millions of sequences from 50 up to 600bp.

A lot of the sequences are redundant in that respect, that they are fragments of the bigger chunks.

like this:
>12124334
ABCDEFGHIJKLMNOPQRSTUVXYZ
>121
ABCD
>2343456
ABCDEFGHIJKLMNOPQRSTUV
>23123443556
CDEFGHIJKLMNOPQRSTUV

I am looking for a way to check (blast?) all sequences for being a fragment of another one (perfect hits in full length only) and to remove these sequences.

thanks alot!
Bastian is offline   Reply With Quote
Old 05-30-2012, 04:18 AM   #2
Bastian
Junior Member
 
Location: Oslo

Join Date: Apr 2012
Posts: 6
Default

or just grep?
Bastian is offline   Reply With Quote
Old 05-30-2012, 06:51 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,165
Default

CD-HIT will do what you want. It is designed take a large number of input sequences and cluster them to produce a non-redundant set (i.e. eliminate duplicates or sub-sequences) of the longest sequences. You can adjust the degree of identity threshold required to cluster sequences.
kmcarr is offline   Reply With Quote
Old 05-30-2012, 07:18 AM   #4
Bastian
Junior Member
 
Location: Oslo

Join Date: Apr 2012
Posts: 6
Default

cheers mate, this does the job!! Thanks a lot!
Bastian is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO