SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SPAdes contigs vpi Bioinformatics 3 12-02-2015 03:49 PM
SPAdes gap representation vs unknown base hkg De novo discovery 0 03-24-2015 02:19 AM
SPAdes trusted contigs aimc Bioinformatics 0 01-12-2015 09:22 AM
FastQC, Kmer count, Trimmomatic: no success in trimming, still fail Kmer skmotay RNA Sequencing 6 10-09-2014 06:24 AM
Strange/erroneous TopHat output with paired-end data hoho Bioinformatics 7 03-09-2011 12:21 PM

Reply
 
Thread Tools
Old 05-14-2015, 06:11 AM   #1
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default SPAdes: erroneous kmer threshold

Hello Members,

I'm working with E. coli, paired end, Illumina data. Using SPAdes 3.5 version.
SLURM Environment. 128 gb RAM node.

I checked out QUAST report for one of isolate which was alarmingly high for an E. coli:- 6.592727 mb, with 918 contigs (1000>= bp).

Code:
zcat file_R1_001.fastq.gz | awk '{if(NR%4==2) print length($1)} ' | head -1
Read length: 103

I traced back to warnings.log file for its assembly run, it says:

Quote:
=== Error correction and assembling warnings:
* 0:03:13.851 484M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 1
* 0:02:26.558 620M / 9G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 4
* 0:02:26.565 620M / 9G WARN General (kmer_coverage_model.cpp : 366) Failed to determine erroneous kmer threshold. Threshold set to: 4
======= Warnings saved to $HOME/Docs/warnings.log
No idea how and what is causing it.
Meanwhile, I shall check fastqc report, see if that would make sense to me.

Any guidance would be of great help.

Thank you.

Last edited by bio_informatics; 05-14-2015 at 07:04 AM.
bio_informatics is offline   Reply With Quote
Old 05-14-2015, 09:13 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

With such a large assembly, I'd suspect contamination. I suggest you look at the GC vs coverage distribution of contigs; you may get two distinct clouds for different organisms. Also blasting them may help figure it out.

If this is an isolate rather than single cell, a kmer frequency histogram could also indicate the presence of multiple organisms. None of these will help much if it's two strains of e.coli, though.
Brian Bushnell is offline   Reply With Quote
Old 05-14-2015, 09:31 AM   #3
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

Hi Brian,

Thanks for your reply.
I checked fastqc report, they were reasonably well. Not too much of quality drop.
It's an isolate.

- Is there any free tool to check GC vs coverage distribution?
- Should I blast the whole assembly in NCBI?
- would kmer frequency by kmergeinie be something good?

I apologize for such naive questions. I've not come across this situation.

Thanks for your guidance.

Quote:
Originally Posted by Brian Bushnell View Post
With such a large assembly, I'd suspect contamination. I suggest you look at the GC vs coverage distribution of contigs; you may get two distinct clouds for different organisms. Also blasting them may help figure it out.

If this is an isolate rather than single cell, a kmer frequency histogram could also indicate the presence of multiple organisms. None of these will help much if it's two strains of e.coli, though.
bio_informatics is offline   Reply With Quote
Old 05-14-2015, 09:40 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Was this single cell, or isolate? If single-cell, a kmer-frequency histogram won't help, but it will for isolates. Kmer-genie produces the wrong kind of histogram; the kind you need is for the number of unique kmers per depth for a fixed kmer length. You can generate that using BBNorm like this:

khist.sh in=reads.fq hist=histogram.txt

The GC versus coverage plot can be generated fairly easily with BBMap:

bbmap.sh ref=assembly.fa in=reads.fq covstats=covstats.txt fast

The covstats file will list the length, coverage, and gc content of all contigs.

The simplest thing to do, though, is probably to blast the entire assembly versus nt and see what you get. I have never personally done that, though; I think when we do it here we use some kind of wrapper that summarizes which taxa are hit in which amounts. Not sure how complicated that wrapper is; I only interact with blast via a browser, one sequence at a time
Brian Bushnell is offline   Reply With Quote
Old 05-14-2015, 09:53 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,059
Default

Not sure if I am following the thought here. Blasting will identify contamination i.e. some of those 918 contigs would not have E coli as the best hit?
GenoMax is online now   Reply With Quote
Old 05-14-2015, 09:57 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
Not sure if I am following the thought here. Blasting will identify contamination i.e. some of those 918 contigs would not have E coli as the best hit?
Yep. E.coli should only account for ~4.5Mbp of the assembly, so the remaining 2Mbp are probably either misassemblies or contaminant. That's enough for another complete (small) genome.
Brian Bushnell is offline   Reply With Quote
Old 05-14-2015, 10:43 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,059
Default

Or one or more plasmids.

But this may get complicated since depending on how good the assembly is E coli may not be the top/best/only hit. We shall see what OP finds.
GenoMax is online now   Reply With Quote
Old 05-15-2015, 09:50 AM   #8
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

@Genomax, Brian: Thank you for your inputs.
I spoke with my team mates. The data we have is contaminated. Isolate shall be put on re-sequencing in couple of weeks.

I skipped k-mer v/s gc (something) plots, and BLAST.

Thanks again for your valuable suggestions, and guidance.
bio_informatics is offline   Reply With Quote
Old 05-17-2015, 11:08 AM   #9
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

The plot that Brian mentioned can be produced with GAEMR (http://www.broadinstitute.org/softwa...erence-manual/). See an example of the plot at Figure 6.1 'Blast Bubbles'
boetsie is offline   Reply With Quote
Old 05-17-2015, 05:03 PM   #10
bio_informatics
Senior Member
 
Location: USA

Join Date: Nov 2013
Posts: 182
Default

@boetsie:
Thanks for pointing to this tool.
This looks to have a whole lot of utility with it.

Quote:
Originally Posted by boetsie View Post
The plot that Brian mentioned can be produced with GAEMR (http://www.broadinstitute.org/softwa...erence-manual/). See an example of the plot at Figure 6.1 'Blast Bubbles'
bio_informatics is offline   Reply With Quote
Reply

Tags
assembly, spades

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO