SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fetch fastqs from basespace with command line julio514 Bioinformatics 10 10-25-2016 10:22 AM
Sequence Quality for large number of fastqs bio_informatics Bioinformatics 14 11-12-2015 09:55 AM
Limited number of independent clones - how many 'independent' biological replicates? Alex234 RNA Sequencing 1 08-30-2013 12:15 PM
Trimming reads messes up PE fastqs? any ideas please? swNGS Bioinformatics 24 07-26-2013 02:02 PM
May I a question about fastqs and the script fq_all2std? cllorens Bioinformatics 4 02-28-2012 01:28 PM

Reply
 
Thread Tools
Old 07-13-2017, 02:47 AM   #1
cyanoevo
Member
 
Location: Bristol

Join Date: Jan 2015
Posts: 16
Question Independent assemblies from NextSeq FASTQs

Haven't found a thread answering this question yet, apologies if it already exists somewhere.

I'm assembling some very large metagenomes in SPades from NextSeq data. I understand that the four FASTQ files from each of the flowcell lanes is typically concatenated to make a single file. However, SPades is running out of memory on my server mid assembly.

My question is this: Is there any technical reason for concatenating the FASTQ prior to analysis, rather than doing four assemblies and merging the scaffolds later? Doing the latter would save me memory but don't want to do it if it's bad form.

Still learning this stuff so any pointers welcome...

Cheers,
Nathan
cyanoevo is offline   Reply With Quote
Old 07-13-2017, 05:21 AM   #2
Markiyan
Member
 
Location: Cambridge

Join Date: Sep 2010
Posts: 88
Lightbulb Try assembling less data first... Use MiSeq 2x250 or 2x300...

First I would try assembling less data, and see what are the most abundant species in the datasets... Than filter it out and repeat with more data...

Also I would use 4 channel Illumina sequences in 2x250bp mode (Miseq or Hiseq 2500) which have 3-4 times less raw reads errors than 2 channel Nextseq.

The amount of RAM/CPU used by most de novo assemblers can grow exponentially from increased raw reads error rates... Also high coverage noisy data is much more resource demanding than low coverage good quality data.

Nextseq should be used a REsequencing platform, not as a de novo sequencing one...

While the data from the above platforms is more expensive than Nexteseq on /Gbp basis, but an extra sequencing cost of a good quality input dataset is usually way less than the cost of wasted scientists/experiments time/reagents analysing bad assembly results...
Markiyan is offline   Reply With Quote
Old 07-13-2017, 06:38 AM   #3
cyanoevo
Member
 
Location: Bristol

Join Date: Jan 2015
Posts: 16
Default

Thanks for your thoughts. Unfortunately our sequencing centre has seen fit the swap their HiSeq 2500 for a NextSeq some am stuck with it. Funnily enough I had no problems when I was working with HiSeq data....
cyanoevo is offline   Reply With Quote
Old 07-13-2017, 08:01 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,638
Default

Spurious kmers increase memory consumption; you can get rid of a lot of these via preprocessing: adapter-trimming, error-correction, discarding reads with singleton kmers, normalization, overlap-based read merging, and so forth. If SPAdes still runs out of memory, you can try Megahit instead. Don't assemble the lanes independently and try to merge them; that won't be beneficial.

NextSeq has a much higher error rate than HiSeq 2500. You may want to try FilterByTile to get rid of the lowest-quality reads by flowcell position.
Brian Bushnell is offline   Reply With Quote
Old 07-13-2017, 08:08 AM   #5
cyanoevo
Member
 
Location: Bristol

Join Date: Jan 2015
Posts: 16
Default

Thanks Brian, that's very helpful. Was actually about to try normalizing with bbnorm to see if that improved things.
cyanoevo is offline   Reply With Quote
Old 07-13-2017, 08:12 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,411
Default

Even though NextSeq has 4 "lanes" that are optically distinct they share the same fluidic path. If you were going to normalize the data then do it on all 4 "lanes" at the same time.
GenoMax is offline   Reply With Quote
Reply

Tags
fastq, nextseq, spades

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:01 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO