SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimming or filtering the data from Solid anusha Bioinformatics 4 12-19-2012 09:00 AM
Filtering SOLiD reads before mapping?? Conflicting advice hlwright SOLiD 5 06-27-2011 06:10 AM
Filtering clonal reads AlexB 454 Pyrosequencing 2 05-11-2010 01:30 PM
Trimming or filtering the data from Solid anusha SOLiD 1 01-21-2010 10:19 AM
ABI SOLiD data filtering and conversion to base-space PRJ SOLiD 5 12-15-2009 06:55 AM

Reply
 
Thread Tools
Old 03-09-2010, 08:28 AM   #1
k-gun12
Member
 
Location: NJ

Join Date: Feb 2010
Posts: 55
Default Filtering SOLiD reads

I've got 120 million 50bp SOLiD reads from a Eukaryote, and I'd like to remove anything plastid related. I've got the assembled genome of the plastid, but I need to do the matching in color space, correct? Normally I'd just do this with blast.. is there a tool in Corona that will do this?

Thanks!
k-gun12 is offline   Reply With Quote
Old 03-09-2010, 02:14 PM   #2
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

1. Align you reads againt plastid (I personally like bwa and bfast).
2. Once you have the alignments is trivial to separate reads that come
from one or the other organisms.

If you want to go the ABi way use Bioscope instead of corona.
__________________
-drd
drio is offline   Reply With Quote
Old 03-10-2010, 12:27 AM   #3
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by drio View Post
2. Once you have the alignments is trivial to separate reads that come
from one or the other organisms.
Hmmm I beg to differ that its trivial to separate the reads.
Getting the ids of the reads that map to two different is simple.

but working with the large number of reads isn't.
you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
KevinLam is offline   Reply With Quote
Old 03-10-2010, 05:29 AM   #4
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Quote:
Originally Posted by KevinLam View Post
Hmmm I beg to differ that its trivial to separate the reads.
Getting the ids of the reads that map to two different is simple.

but working with the large number of reads isn't.
you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads
Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
__________________
-drd
drio is offline   Reply With Quote
Old 03-10-2010, 11:20 PM   #5
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by drio View Post
Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.
I would love to look at your code if you got it working the way you mentioned.
for me?

I needed to extract 40 mil ids from a 70 mil csfasta.
looping thru the csfasta is simple.
but I found that I had memory issues if I stored 40 mil ids in a normal hash.
So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

So if you got it working like the way you said I would really love to c how I got it wrong.
KevinLam is offline   Reply With Quote
Old 03-10-2010, 11:43 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by KevinLam View Post
I would love to look at your code if you got it working the way you mentioned.
for me?

I needed to extract 40 mil ids from a 70 mil csfasta.
looping thru the csfasta is simple.
but I found that I had memory issues if I stored 40 mil ids in a normal hash.
So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

So if you got it working like the way you said I would really love to c how I got it wrong.
If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.
nilshomer is offline   Reply With Quote
Old 03-11-2010, 12:02 AM   #7
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 203
Default

Quote:
Originally Posted by nilshomer View Post
If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.
I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
I am actually not sure if they are sorted already (coming out of the machine)
KevinLam is offline   Reply With Quote
Old 03-11-2010, 12:16 AM   #8
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by KevinLam View Post
I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
I am actually not sure if they are sorted already (coming out of the machine)
They are sorted coming off the machine, so no need to resort.
nilshomer is offline   Reply With Quote
Old 03-12-2010, 09:51 PM   #9
sci_guy
Member
 
Location: Sydney, Australia

Join Date: Jan 2008
Posts: 83
Default

I agree with drio. It's a old classical computer science problem. Google "Intersection of sorted lists". If your lists aren't sorted then use GNU sort beforehand. You only need to write a shell script, no requirement for huge hashes in RAM.
sci_guy is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO