SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Why the raw RNA-seq data of the same size gave accepted hits in very different size HSV-1 RNA Sequencing 0 07-20-2012 12:23 AM
Filtering out Illumina primer sequence kevpar General 1 05-09-2011 08:45 AM
what is the file size for a 30X human genome sequencing file, raw and BAM? RNA-seq Illumina/Solexa 2 04-15-2011 12:27 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM
Using AFLP to reduce template genome size thomasvangurp Bioinformatics 0 08-13-2009 11:34 AM

Reply
 
Thread Tools
Old 10-11-2012, 07:10 AM   #1
Mona
Member
 
Location: Uppsala

Join Date: Feb 2010
Posts: 27
Default Filtering Illumina data to reduce file size

Hello,
I have paired end data from Illumina hi seq for a bacterial genome that has been sequenced using three different insert sizes, 160, 305 and 505 respectively. My task is to perform de novo assembly of the genome but the problem is that every single file contains more than 60 million reads and its not possible to run assembly of this much large file. Is there any way I can reduce the size of the file, by removing some reads?? or performing some kind of filteration?
Mona is offline   Reply With Quote
Old 10-11-2012, 07:20 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Look for Titus Brown's 'diginorm' program. It does an intelligent reduction of data. Seems to work well for genomic data. Perhaps not so well for transcriptome data.
westerman is offline   Reply With Quote
Old 10-11-2012, 07:25 AM   #3
Mona
Member
 
Location: Uppsala

Join Date: Feb 2010
Posts: 27
Default

Thanks for the suggestion, i will try that and get back for further problems
Mona is offline   Reply With Quote
Old 10-11-2012, 08:16 AM   #4
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Subsample reads from your files using Heng Li's seqtk program (https://github.com/lh3/seqtk) and the "sample" command.
nickloman is offline   Reply With Quote
Old 10-11-2012, 09:44 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by nickloman View Post
Subsample reads from your files using Heng Li's seqtk program (https://github.com/lh3/seqtk) and the "sample" command.
If you are just going to randomly throw away reads then you might as well go the cheap route and not do as much sequencing in the first place. No disrespect to Li's program but since diginorm provides for an intelligent reduction of reads then I suggest using it instead of a random selection.
westerman is offline   Reply With Quote
Old 10-11-2012, 05:19 PM   #6
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

At the Boston Illumina User's Group meeting today, Illumina mentioned that BaseSpace will have an option for "quality-binning" -- by reducing quality scores to a small number of bins, the data compresses quite a bit (they claimed 50% reduction in compressed FASTQ size). An underlying assumption is that quality scores offer more gradation than programs really find useful.

Pretty trivial to implement in Perl, though I leave that as an exercise for the student :-)
krobison is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO