![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
SPAdes contigs | vpi | Bioinformatics | 3 | 12-02-2015 04:49 PM |
SPAdes gap representation vs unknown base | hkg | De novo discovery | 0 | 03-24-2015 03:19 AM |
SPAdes trusted contigs | aimc | Bioinformatics | 0 | 01-12-2015 10:22 AM |
FastQC, Kmer count, Trimmomatic: no success in trimming, still fail Kmer | skmotay | RNA Sequencing | 6 | 10-09-2014 07:24 AM |
Strange/erroneous TopHat output with paired-end data | hoho | Bioinformatics | 7 | 03-09-2011 01:21 PM |
![]() |
|
Thread Tools |
![]() |
#1 | |
Senior Member
Location: USA Join Date: Nov 2013
Posts: 182
|
![]()
Hello Members,
I'm working with E. coli, paired end, Illumina data. Using SPAdes 3.5 version. SLURM Environment. 128 gb RAM node. I checked out QUAST report for one of isolate which was alarmingly high for an E. coli:- 6.592727 mb, with 918 contigs (1000>= bp). Code:
zcat file_R1_001.fastq.gz | awk '{if(NR%4==2) print length($1)} ' | head -1 I traced back to warnings.log file for its assembly run, it says: Quote:
Meanwhile, I shall check fastqc report, see if that would make sense to me. Any guidance would be of great help. Thank you. Last edited by bio_informatics; 05-14-2015 at 08:04 AM. |
|
![]() |
![]() |
![]() |
#2 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
With such a large assembly, I'd suspect contamination. I suggest you look at the GC vs coverage distribution of contigs; you may get two distinct clouds for different organisms. Also blasting them may help figure it out.
If this is an isolate rather than single cell, a kmer frequency histogram could also indicate the presence of multiple organisms. None of these will help much if it's two strains of e.coli, though. |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: USA Join Date: Nov 2013
Posts: 182
|
![]()
Hi Brian,
Thanks for your reply. I checked fastqc report, they were reasonably well. Not too much of quality drop. It's an isolate. - Is there any free tool to check GC vs coverage distribution? - Should I blast the whole assembly in NCBI? - would kmer frequency by kmergeinie be something good? I apologize for such naive questions. I've not come across this situation. Thanks for your guidance. Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Was this single cell, or isolate? If single-cell, a kmer-frequency histogram won't help, but it will for isolates. Kmer-genie produces the wrong kind of histogram; the kind you need is for the number of unique kmers per depth for a fixed kmer length. You can generate that using BBNorm like this:
khist.sh in=reads.fq hist=histogram.txt The GC versus coverage plot can be generated fairly easily with BBMap: bbmap.sh ref=assembly.fa in=reads.fq covstats=covstats.txt fast The covstats file will list the length, coverage, and gc content of all contigs. The simplest thing to do, though, is probably to blast the entire assembly versus nt and see what you get. I have never personally done that, though; I think when we do it here we use some kind of wrapper that summarizes which taxa are hit in which amounts. Not sure how complicated that wrapper is; I only interact with blast via a browser, one sequence at a time ![]() |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Not sure if I am following the thought here. Blasting will identify contamination i.e. some of those 918 contigs would not have E coli as the best hit?
|
![]() |
![]() |
![]() |
#6 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yep. E.coli should only account for ~4.5Mbp of the assembly, so the remaining 2Mbp are probably either misassemblies or contaminant. That's enough for another complete (small) genome.
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Or one or more plasmids.
But this may get complicated since depending on how good the assembly is E coli may not be the top/best/only hit. We shall see what OP finds. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: USA Join Date: Nov 2013
Posts: 182
|
![]()
@Genomax, Brian: Thank you for your inputs.
I spoke with my team mates. The data we have is contaminated. Isolate shall be put on re-sequencing in couple of weeks. I skipped k-mer v/s gc (something) plots, and BLAST. Thanks again for your valuable suggestions, and guidance. |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: NL, Leiden Join Date: Feb 2010
Posts: 245
|
![]()
The plot that Brian mentioned can be produced with GAEMR (http://www.broadinstitute.org/softwa...erence-manual/). See an example of the plot at Figure 6.1 'Blast Bubbles'
|
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: USA Join Date: Nov 2013
Posts: 182
|
![]()
@boetsie:
Thanks for pointing to this tool. This looks to have a whole lot of utility with it. ![]() Quote:
|
|
![]() |
![]() |
![]() |
Tags |
assembly, spades |
Thread Tools | |
|
|