![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Analyzing metagenomic contigs | vanillasky | Bioinformatics | 3 | 07-03-2014 09:35 AM |
Emergent self-organizing maps from 4mer frequencies (binning metagenomic contigs) | rhinoceros | Bioinformatics | 0 | 05-16-2014 04:35 AM |
Low mapping percentage of reads on assembled contigs | morning latte | Bioinformatics | 20 | 03-23-2014 10:01 AM |
Celera assembler and metagenomic data - no contigs? | wichne | Bioinformatics | 0 | 03-16-2012 11:49 AM |
Too few reads mapping back to contigs | cerebralrust | Bioinformatics | 8 | 02-26-2012 11:25 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
Hi all - I am very new to metagenomics analyses, so please bear with me. I have a metagenome generated from 2L of Sterivex filtered stream water run on a NextSeq500 using NexteraXT library prep, which was dual indexed and run alongside other samples. The number of raw reads was 29,570,310 paired end reads (150bpX2), which I know is not necessarily a lot for a metagenome. In the following code snippets below I did not use the real file names since I have 4 sets of merged reads, and four sets of unmerged reads (2 files for each of four lanes) with long names that I need to shorten, but the commands worked with the real filenames. For read processing I used bbduk and bbmerge as follows:
Adapter and Quality trimming: Code:
>bbduk.sh in1=read1.fq in2=read2.fq out1=name_R1_Q15_k29.fq out2=name_R2_Q15_k29.fq ref=/usr/local/bbmap/resources/nextera.fa qtrim=r trimq=15 k=29 mink=11 hdist=1 ktrim=r tpe tbo Code:
>bbmerge.sh in1=name_R1_Q15_k29.fq in2=name_R2_Q15_k29.fq out=name_Q15_K29_merged.fq outu=name_Q15_k29_unmerged.fq mininsert=20 Code:
>./megahit --12 name_Q15_k29_unmerged.fq -r name_Q15_k29_merged.fq --k-max=127 -o output_directory - 40,747 contigs totalling 19663614 bp - min contig size of 200, max size of 7816 bp, avg 483 bp - N50 = 458 bp From what I have read, that N50 is low and likely reflects insufficient coverage of the sampled community. I should be shooting for N50 >= 1000 correct? At this point, additional sequencing is not an option, so I am trying to work with what I have. Now that I have contigs from a de novo assembly, I should have a large proportion of reads not incorporated into those contigs that are still valuable potentially for taxonomic and functional annotation. From the megahit assembly, I do not get a subset of reads that were not incorporated into the contigs. Therefore I followed the tutorial at the following url for getting unassembled reads: https://github.com/voutcn/megahit/wi...-real-assembly I mapped the reads as follows: Code:
>bbwrap.sh ref=name_sterivex_assembly/final.contigs.fa in=name_Q15_k29_unmerged.fq,name_Q15_k29_merged.fq out=aln.sam.gz kfilter=22 subfilter=15 maxindel=80 Code:
>pileup.sh in=aln.sam.gz out=cov.txt The following is the first few lines of the coverage file: Code:
aaron@aaron-VirtualBox:~/PineCreek/Boone_sterivex_2014-30975030$ head cov.txt #ID Avg_fold Length Ref_GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC k127_4 flag=1 multi=2.0000 len=391 0.0000 391 0.0000 0.0000 0 0 0 0.0000 k127_5 flag=1 multi=2.0000 len=414 0.0000 414 0.0000 0.0000 0 0 0 0.0000 k127_6 flag=1 multi=1.0000 len=440 0.3432 440 0.0000 34.3182 151 0 1 0.6424 Code:
>samtools view -u -f4 aln.sam.gz | samtools bam2fq -s unmapped.se.fq - > unmapped.pe.fq |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: sub-surface moon base Join Date: Apr 2013
Posts: 372
|
![]()
In my experience, IDBA-UD assembles metagenomes way better than Megahit. Can't comment on your QC procedures. For Illumina QC, I've used Trim Galore exclusively. As to odd coverage results, perhaps you should try to map with e.g. BWA and see if the results differ?
__________________
savetherhino.org |
![]() |
![]() |
![]() |
#3 |
Member
Location: Atlanta, GA Join Date: Feb 2013
Posts: 17
|
![]()
You might try normalizing your data first, using bbnorm. It helped me tremendously. Like rhinoceros above, I've found MEGAHIT not as good as some others. I suggest you might try OMEGA. Also some assemblers that can use native paired reads (OMEGA) do not like the reads paired before - I can't explain why, but it will do better if you feed it the paired fastq files rather than the already merged pairs. Maybe Brian Bushnell can answer why.
|
![]() |
![]() |
![]() |
#4 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
Thanks for the replies.
Rhinoceros - I do have IDBA-UD installed and played around with it a little with smaller datasets, but have not tried to assemble this dataset with it yet. Comparing output from different assemblers is something I definitely need to try. Trexbob - I have read a little about read normalization, but have not tried it yet. That should be relatively easy to do using tools from the bbmap suite. I think I saw a thread specifically about normalization for metagenomes on this forum somewhere...It's interesting that assembly without merging overlapping pairs can have a significant impact on the resulting assembly - I will try both ways and see how the results differ. I downloaded BWA today, and am going to compare those results to the ones I get using bbmap. I'll report back with my findings. Thanks for the suggestions so far. |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: Hong Kong Join Date: Jun 2015
Posts: 4
|
![]()
sapidissima -
I think it is true that the coverage insufficient to obtain a good assembly. I suggest to use "--min-count 1" and "--k-step 10" for Megahit to rescue more low coverage regions. You may do the same tricks when trying out IDBA-UD. For the 0-coverage contigs, I suspect that because you are using relatively long reads (could be 200+ after merging?) and it is possible that a DBG assembler makes use of just some kmers of a read. BBMap outputs global alignments by default, and probably running it with "local=t" is better in your case. rhinoceros - Megahit is actually vastly influenced by IDBA-UD. As the developer of Megahit, I have the experience that the current version of Megahit (v1.0+) should have matched the quality IDBA-UD. Generally speaking, Megahit's contig is a bit shorter than IDBA-UD but contains fewer misassemblies. The absolute advantage of Megahit is its time-/memory-efficiency. We are still actively maintaining & developing Megahit, so feedbacks/suggestions are most welcomed. Last edited by vout; 01-05-2016 at 07:54 PM. |
![]() |
![]() |
![]() |
#6 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
1) The assembler; some (like Ray and from what I hear, Mira) work well with merged reads, others (like Allpaths) do their own merging internally, and still others (like SOAP) work better with the unmerged pairs. BBMerge is currently part of the recommended OMEGA workflow, though; in their testing, it substantially improves the assemblies. That team also made an alternative merger, though, for non-overlapping reads. 2) The merging program (and stringency setting). The latest version of BBMerge is the most accurate read merger, with the lowest false-positive rate of all overlap-based mergers. But earlier versions, particularly before mid-2015 or so, were less accurate. 3) The insert-size distribution of the reads. If they mostly don't overlap, then the fraction of false-positive merges increases. So, don't merge reads if the merge rate is low (under 20% or so). I wish I had time, but I've never gotten around to an extensive benchmark of multiple assemblers with merged versus unmerged reads, on multiple datasets. So the results of merging for different assemblers are just things I have noticed and/or been told by others, but do not have enough data points to be conclusive. With Spades, for example, I've had datasets that did better with merging (particularly low-coverage ones), and others that did worse. I do not know whether Megahit does better or worse as a result of merging; most of my testing has focused on single-cells or isolates, while Megahit is used in our metagenome assembly pipeline that I don't directly interact with. Perhaps vout can give a recommendation? Last edited by Brian Bushnell; 01-05-2016 at 10:13 PM. |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Atlanta, GA Join Date: Feb 2013
Posts: 17
|
![]()
Thanks Brian, I appreciate the response! It's been my experience that merging reads has not improved the datasets with OMEGA, but has helped with Megahit (our reads have been merging 70-90%).
More generally, it seems that getting a good assembly from a metagenome is not an easy task. I've been running multiple assemblers with multiple data sets, mostly low diversity, some highly-mixed eukaryote/prokaryote (according to metaphlan). All the pre-processing steps (quality trimming, merging, normalization, binning, etc), tweaking the parameters of each assembler, as well as tweaking what you feed into them, produce vastly different results. Each data set seems to have its own 'best' parameters, and even its 'best' assembler, unfortunately for us researchers. |
![]() |
![]() |
![]() |
#8 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
I tried replaying last Friday, and got a message that my message would appear after moderator approval. I think it never made it that far, so I am posting a reply again...
Vout, thanks for the suggestions for changing the input parameters "--min-count 1" and --k-step 10" for Megahit. I will try those. You asked about my read lengths, and I figure it would help to provide some additional information about my data. Here is a pic of the Nextera library on the Bioanalyzer after index PCR. It is somewhat under-tagmented: [IMG] ![]() The following is the raw read length distribution (sorry for the ugly Excel graphs): [IMG] ![]() These are the reads after adapter removal and trimming using Q15 in bbduk: [IMG] ![]() Finally, here are the read lengths after merging overlapping pairs: [IMG] ![]() I am surprised at how many reads were merged given that the proportion of reads with an insert less than 300 bp is very small. However, it could be that a library like that does not cluster large fragments efficiently, so I still ended up with a lot of mergeable pairs, or something is wrong with how I did my merging. For mapping, I discovered when using interleaved fasta files with bbwrap, only the unmerged reads were used in the mapping. I re-did the mapping by performing the mapping step separate for unmerged and merged read pairs separately, and merging the sam files. Doing this I ended up with: - 360,392 mapped reads - 19.278,578 unmapped reads. - avg coverage = 2.8 - percent scaffolds with any coverage = 99.75 - percent reference bases covered = 97.79 I had 100 contigs with zero coverage out of 40,747 total contigs. ~99% of these zero coverage contigs were 250 bp or less in length. This seems like a much more reasonable result, and I often see that people elect to not even report contigs of these lengths in final assemblies. Do these mapping results seem reasonable then? |
![]() |
![]() |
![]() |
#9 | ||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
![]() Quote:
|
||
![]() |
![]() |
![]() |
#10 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
Hi Brian - Thanks for looking at my results. Yeah, I guess I'll be running the NextSeq some more over the next few weeks to generate additional data for these libraries. Its pretty amazing to me how diverse crystal clear mountain stream water can be.
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
May want to think HiSeq for a "lot more".
It actually may be more cost effective to get this sequenced elsewhere after you calculate the number of runs required on NextSeq. |
![]() |
![]() |
![]() |
#12 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]() Quote:
Also, it's hard to say because the axes are different, but your adapter-trimming results look strange... since merging results in ~200k reads in the 100bp insert size bin, adapter-trimming should also result in ~200k (or ~400k if you are counting the two reads of pairs independently) reads with that length - but your graph indicates perhaps under 50k. Which is odd. Can you post the screen output of the BBDuk run? |
|
![]() |
![]() |
![]() |
#13 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
Hey GenoMax - yeah, we own the NextSeq so it is convenient, but farming it out to a higher throughput sequencer may be more cost effective for our study. This project is a comparative analysis among several samples, so we are trying to balance the amount of data per sample and the number of samples simultaneously. For a rigorous comparative analysis, we are coming to the realization that sequencing just isn't going to be cheap and more data is needed.
Thanks for taking a closer look at my results, Brian. I may have gotten confused and generated a histogram of data from one lane of data instead of all four lanes combined for bbduk in my post above. I'll start from scratch here again and I'll post the results of each QC step using all of the data. First I concatenated the files for each of the 4 lanes into a F and R .fq file, and ran the bbduk.sh script as follows: Code:
$ bbduk.sh in1=Boone_sterivex_R1_raw.fq in2=Boone_sterivex_R2_raw.fq out1=Boone_sterivex_R1_Q15_k29.fq out2=Boone_sterivex_R2_Q15_k29.fq ref=/usr/local/bbmap/resources/nextera.fa qtrim=r trimq=15 k=29 mink=11 hdist=1 ktrim=r tpe tbo HTML Code:
BBDuk version 35.82 maskMiddle was disabled because useShortKmers=true Initial: Memory: max=8958m, free=8818m, used=140m Added 143423 kmers; time: 0.125 seconds. Memory: max=8958m, free=8491m, used=467m Input is being processed as paired Started output streams: 0.017 seconds. Processing time: 183.722 seconds. Input: 29570310 reads 3876898301 bases. QTrimmed: 8761068 reads (29.63%) 37713560 bases (0.97%) KTrimmed: 430993 reads (1.46%) 13990985 bases (0.36%) Trimmed by overlap: 247652 reads (0.84%) 5692612 bases (0.15%) Result: 29462726 reads (99.64%) 3819501144 bases (98.52%) Time: 183.878 seconds. Reads Processed: 29570k 160.81k reads/sec Bases Processed: 3876m 21.08m bases/sec [IMG] ![]() Here is the command and output of bbmerge: Code:
$ bbmerge.sh in1=Boone_sterivex_R1_Q15_k29.fq in2=Boone_sterivex_R2_Q15_k29.fq out=Boone_sterivex_Q15_k29_merged.fq outu=Boone_sterivex_Q15_k29_unmerged.fq mininsert=20 HTML Code:
BBMerge version 8.9 Writing mergable reads merged. Started output threads. Total time: 148.209 seconds. Pairs: 14731363 Joined: 9889176 67.130% Ambiguous: 4530331 30.753% No Solution: 311856 2.117% Too Short: 0 0.000% Avg Insert: 150.6 Standard Deviation: 66.2 Mode: 111 Insert range: 20 - 293 90th percentile: 247 75th percentile: 202 50th percentile: 145 25th percentile: 96 10th percentile: 65 [IMG] ![]() Finally, here are just the length distribution of merged reads alone: ![]() |
![]() |
![]() |
![]() |
#14 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hmmm, this looks like the "raw" reads were already adapter-trimmed. The first histogram in that last post indicates millions of reads with lengths after trimming of 40bp-140bp. But the trimming results indicate only 1.5% of the bases were trimmed. These cannot both be true. What does the length histogram look like of the raw reads?
|
![]() |
![]() |
![]() |
#15 |
Member
Location: West Virginia, Eastern Panhandle Join Date: Apr 2014
Posts: 11
|
![]()
Hey Brian - Here is a histogram of the reads as downloaded from Basespace, with no processing outside of what may have been done by the basecalling software on the NextSeq:
[IMG] ![]() Data for the histogram: HTML Code:
#Reads: 29570310 #Bases: 3876898301 #Max: 151 #Min: 35 #Avg: 131.1 #Median: 150 #Mode: 150 #Std_Dev: 32.5 #Read Length Histogram: #Length reads pct_reads cum_reads cum_pct_reads bases pct_bases cum_bases cum_pct_bases 0 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 5 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 10 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 15 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 20 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 25 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 30 0 0.000% 29570310 100.000% 0 0.000% 3876898301 100.000% 35 306855 1.038% 29570310 100.000% 11230931 0.290% 3876898301 100.000% 40 268760 0.909% 29263455 98.962% 11297484 0.291% 3865667370 99.710% 45 325316 1.100% 28994695 98.053% 15311599 0.395% 3854369886 99.419% 50 347713 1.176% 28669379 96.953% 18082799 0.466% 3839058287 99.024% 55 387908 1.312% 28321666 95.777% 22131776 0.571% 3820975488 98.558% 60 409809 1.386% 27933758 94.466% 25405243 0.655% 3798843712 97.987% 65 439474 1.486% 27523949 93.080% 29467832 0.760% 3773438469 97.331% 70 461149 1.560% 27084475 91.593% 33195498 0.856% 3743970637 96.571% 75 477018 1.613% 26623326 90.034% 36752397 0.948% 3710775139 95.715% 80 500909 1.694% 26146308 88.421% 41070670 1.059% 3674022742 94.767% 85 502759 1.700% 25645399 86.727% 43755986 1.129% 3632952072 93.708% 90 527245 1.783% 25142640 85.027% 48501710 1.251% 3589196086 92.579% 95 520416 1.760% 24615395 83.244% 50492252 1.302% 3540694376 91.328% 100 538683 1.822% 24094979 81.484% 54943605 1.417% 3490202124 90.026% 105 525914 1.779% 23556296 79.662% 56276407 1.452% 3435258519 88.608% 110 540789 1.829% 23030382 77.883% 60566193 1.562% 3378982112 87.157% 115 525258 1.776% 22489593 76.055% 61451710 1.585% 3318415919 85.595% 120 534775 1.808% 21964335 74.278% 65244167 1.683% 3256964209 84.010% 125 516625 1.747% 21429560 72.470% 65606483 1.692% 3191720042 82.327% 130 512322 1.733% 20912935 70.723% 67621417 1.744% 3126113559 80.634% 135 515765 1.744% 20400613 68.990% 70652744 1.822% 3058492142 78.890% 140 483569 1.635% 19884848 67.246% 68622099 1.770% 2987839398 77.068% 145 1633076 5.523% 19401279 65.611% 241998874 6.242% 2919217299 75.298% 150 17768203 60.088% 17768203 60.088% 2677218425 69.056% 2677218425 69.056% 155 0 0.000% 0 0.000% 0 0.000% 0 0.000% |
![]() |
![]() |
![]() |
#16 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Yep, your reads were already adapter-trimmed. Some versions of Illumina software will do this on the machine if configured to do so. I don't recommend that, though, as it's less accurate. Re-trimming does not really hurt, but it's always best to start with the raw data.
Merging is even more useful because of the high proportion of short reads - those are the ones that overlap 100%, and thus get fully error-corrected when merged because each base was called twice. So, you should certainly do it. The reads under 150bp had adapter sequence detected by Illumina's software, meaning they had insert sizes under 150bp. Unlike most other platforms, Illumina reads all come out from the machine the same length (in this case 150bp) unless they are explicitly trimmed during a post-processing step. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|