Seqanswers Leaderboard Ad

**rhinoceros** · 01-04-2016, 11:52 PM

In my experience, IDBA-UD assembles metagenomes way better than Megahit. Can't comment on your QC procedures. For Illumina QC, I've used Trim Galore exclusively. As to odd coverage results, perhaps you should try to map with e.g. BWA and see if the results differ?

**trexbob** · 01-05-2016, 10:44 AM

You might try normalizing your data first, using bbnorm. It helped me tremendously. Like rhinoceros above, I've found MEGAHIT not as good as some others. I suggest you might try OMEGA. Also some assemblers that can use native paired reads (OMEGA) do not like the reads paired before - I can't explain why, but it will do better if you feed it the paired fastq files rather than the already merged pairs. Maybe Brian Bushnell can answer why.

**A_sapidissima** · 01-05-2016, 01:57 PM

Thanks for the replies.

Rhinoceros - I do have IDBA-UD installed and played around with it a little with smaller datasets, but have not tried to assemble this dataset with it yet. Comparing output from different assemblers is something I definitely need to try.

Trexbob - I have read a little about read normalization, but have not tried it yet. That should be relatively easy to do using tools from the bbmap suite. I think I saw a thread specifically about normalization for metagenomes on this forum somewhere...It's interesting that assembly without merging overlapping pairs can have a significant impact on the resulting assembly - I will try both ways and see how the results differ.

I downloaded BWA today, and am going to compare those results to the ones I get using bbmap. I'll report back with my findings. Thanks for the suggestions so far.

**vout** · 01-05-2016, 07:49 PM

sapidissima -

I think it is true that the coverage insufficient to obtain a good assembly. I suggest to use "--min-count 1" and "--k-step 10" for Megahit to rescue more low coverage regions. You may do the same tricks when trying out IDBA-UD.

For the 0-coverage contigs, I suspect that because you are using relatively long reads (could be 200+ after merging?) and it is possible that a DBG assembler makes use of just some kmers of a read. BBMap outputs global alignments by default, and probably running it with "local=t" is better in your case.

rhinoceros -

Megahit is actually vastly influenced by IDBA-UD. As the developer of Megahit, I have the experience that the current version of Megahit (v1.0+) should have matched the quality IDBA-UD. Generally speaking, Megahit's contig is a bit shorter than IDBA-UD but contains fewer misassemblies. The absolute advantage of Megahit is its time-/memory-efficiency.

We are still actively maintaining & developing Megahit, so feedbacks/suggestions are most welcomed.

**Brian Bushnell** · 01-05-2016, 10:09 PM

Originally posted by trexbob View Post

Also some assemblers that can use native paired reads (OMEGA) do not like the reads paired before - I can't explain why, but it will do better if you feed it the paired fastq files rather than the already merged pairs. Maybe Brian Bushnell can answer why.

This depends three things:

1) The assembler; some (like Ray and from what I hear, Mira) work well with merged reads, others (like Allpaths) do their own merging internally, and still others (like SOAP) work better with the unmerged pairs. BBMerge is currently part of the recommended OMEGA workflow, though; in their testing, it substantially improves the assemblies. That team also made an alternative merger, though, for non-overlapping reads.

2) The merging program (and stringency setting). The latest version of BBMerge is the most accurate read merger, with the lowest false-positive rate of all overlap-based mergers. But earlier versions, particularly before mid-2015 or so, were less accurate.

3) The insert-size distribution of the reads. If they mostly don't overlap, then the fraction of false-positive merges increases. So, don't merge reads if the merge rate is low (under 20% or so).

I wish I had time, but I've never gotten around to an extensive benchmark of multiple assemblers with merged versus unmerged reads, on multiple datasets. So the results of merging for different assemblers are just things I have noticed and/or been told by others, but do not have enough data points to be conclusive. With Spades, for example, I've had datasets that did better with merging (particularly low-coverage ones), and others that did worse. I do not know whether Megahit does better or worse as a result of merging; most of my testing has focused on single-cells or isolates, while Megahit is used in our metagenome assembly pipeline that I don't directly interact with. Perhaps vout can give a recommendation?

**trexbob** · 01-06-2016, 05:43 AM

Thanks Brian, I appreciate the response! It's been my experience that merging reads has not improved the datasets with OMEGA, but has helped with Megahit (our reads have been merging 70-90%).

More generally, it seems that getting a good assembly from a metagenome is not an easy task. I've been running multiple assemblers with multiple data sets, mostly low diversity, some highly-mixed eukaryote/prokaryote (according to metaphlan). All the pre-processing steps (quality trimming, merging, normalization, binning, etc), tweaking the parameters of each assembler, as well as tweaking what you feed into them, produce vastly different results. Each data set seems to have its own 'best' parameters, and even its 'best' assembler, unfortunately for us researchers.

**A_sapidissima** · 01-11-2016, 07:01 AM

I tried replaying last Friday, and got a message that my message would appear after moderator approval. I think it never made it that far, so I am posting a reply again...

Vout, thanks for the suggestions for changing the input parameters "--min-count 1" and --k-step 10" for Megahit. I will try those. You asked about my read lengths, and I figure it would help to provide some additional information about my data. Here is a pic of the Nextera library on the Bioanalyzer after index PCR. It is somewhat under-tagmented:
[IMG][/IMG]

The following is the raw read length distribution (sorry for the ugly Excel graphs):
[IMG][/IMG]

These are the reads after adapter removal and trimming using Q15 in bbduk:
[IMG][/IMG]

Finally, here are the read lengths after merging overlapping pairs:
[IMG][/IMG]

I am surprised at how many reads were merged given that the proportion of reads with an insert less than 300 bp is very small. However, it could be that a library like that does not cluster large fragments efficiently, so I still ended up with a lot of mergeable pairs, or something is wrong with how I did my merging.

For mapping, I discovered when using interleaved fasta files with bbwrap, only the unmerged reads were used in the mapping. I re-did the mapping by performing the mapping step separate for unmerged and merged read pairs separately, and merging the sam files. Doing this I ended up with:
- 360,392 mapped reads
- 19.278,578 unmapped reads.
- avg coverage = 2.8
- percent scaffolds with any coverage = 99.75
- percent reference bases covered = 97.79

I had 100 contigs with zero coverage out of 40,747 total contigs. ~99% of these zero coverage contigs were 250 bp or less in length. This seems like a much more reasonable result, and I often see that people elect to not even report contigs of these lengths in final assemblies. Do these mapping results seem reasonable then?

**Brian Bushnell** · 01-11-2016, 07:07 PM

Originally posted by A_sapidissima View Post

Doing this I ended up with:
- 360,392 mapped reads
- 19.278,578 unmapped reads.
- avg coverage = 2.8

Sounds like you need to sequence more - a lot more

I had 100 contigs with zero coverage out of 40,747 total contigs. ~99% of these zero coverage contigs were 250 bp or less in length. This seems like a much more reasonable result, and I often see that people elect to not even report contigs of these lengths in final assemblies. Do these mapping results seem reasonable then?

Yes, they do.

**A_sapidissima** · 01-12-2016, 09:42 AM

Hi Brian - Thanks for looking at my results. Yeah, I guess I'll be running the NextSeq some more over the next few weeks to generate additional data for these libraries. Its pretty amazing to me how diverse crystal clear mountain stream water can be.

**GenoMax** · 01-12-2016, 09:45 AM

May want to think HiSeq for a "lot more".

It actually may be more cost effective to get this sequenced elsewhere after you calculate the number of runs required on NextSeq.

**Brian Bushnell** · 01-12-2016, 06:40 PM

Originally posted by A_sapidissima View Post

I am surprised at how many reads were merged given that the proportion of reads with an insert less than 300 bp is very small. However, it could be that a library like that does not cluster large fragments efficiently, so I still ended up with a lot of mergeable pairs, or something is wrong with how I did my merging.

I forgot to comment on this. It looks like the merging was done correctly. However, regardless of your sonication settings / bioanalyzer results / etc, you won't know the true insert size distribution until you get the sequenced pairs and merge (or map). For one thing, short sequences can greatly out-compete long sequences during cluster formation. If you sequence more, be sure to do more robust size-selection; with the current library, you are wasting quite a lot of data due to short inserts.

Also, it's hard to say because the axes are different, but your adapter-trimming results look strange... since merging results in ~200k reads in the 100bp insert size bin, adapter-trimming should also result in ~200k (or ~400k if you are counting the two reads of pairs independently) reads with that length - but your graph indicates perhaps under 50k. Which is odd. Can you post the screen output of the BBDuk run?

**A_sapidissima** · 01-13-2016, 08:12 AM

Hey GenoMax - yeah, we own the NextSeq so it is convenient, but farming it out to a higher throughput sequencer may be more cost effective for our study. This project is a comparative analysis among several samples, so we are trying to balance the amount of data per sample and the number of samples simultaneously. For a rigorous comparative analysis, we are coming to the realization that sequencing just isn't going to be cheap and more data is needed.

Thanks for taking a closer look at my results, Brian. I may have gotten confused and generated a histogram of data from one lane of data instead of all four lanes combined for bbduk in my post above. I'll start from scratch here again and I'll post the results of each QC step using all of the data. First I concatenated the files for each of the 4 lanes into a F and R .fq file, and ran the bbduk.sh script as follows:

Code:

$ bbduk.sh in1=Boone_sterivex_R1_raw.fq in2=Boone_sterivex_R2_raw.fq out1=Boone_sterivex_R1_Q15_k29.fq out2=Boone_sterivex_R2_Q15_k29.fq ref=/usr/local/bbmap/resources/nextera.fa qtrim=r trimq=15 k=29 mink=11 hdist=1 ktrim=r tpe tbo

I had thought that just trimming the right side of the reads was okay since the quality tends to be okay at the start of the reads, but maybe I should change that option to 'qtrim=rl'. Also, I am only doing the adapter trimming on the right side and not the left, e.g. 'ktrim=r'. Here is the output from this step:

HTML Code:

BBDuk version 35.82
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=8958m, free=8818m, used=140m

Added 143423 kmers; time: 	0.125 seconds.
Memory: max=8958m, free=8491m, used=467m

Input is being processed as paired
Started output streams:	0.017 seconds.
Processing time:   		183.722 seconds.

Input:                  	29570310 reads 		3876898301 bases.
QTrimmed:               	8761068 reads (29.63%) 	37713560 bases (0.97%)
KTrimmed:               	430993 reads (1.46%) 	13990985 bases (0.36%)
Trimmed by overlap:     	247652 reads (0.84%) 	5692612 bases (0.15%)
Result:                 	29462726 reads (99.64%) 	3819501144 bases (98.52%)

Time:   			183.878 seconds.
Reads Processed:      29570k 	160.81k reads/sec
Bases Processed:       3876m 	21.08m bases/sec

Here is a histogram of the trimmed reads:
[IMG][/IMG]

Here is the command and output of bbmerge:

Code:

$ bbmerge.sh in1=Boone_sterivex_R1_Q15_k29.fq in2=Boone_sterivex_R2_Q15_k29.fq out=Boone_sterivex_Q15_k29_merged.fq outu=Boone_sterivex_Q15_k29_unmerged.fq mininsert=20

HTML Code:

BBMerge version 8.9
Writing mergable reads merged.
Started output threads.
Total time: 148.209 seconds.

Pairs:               	14731363
Joined:              	9889176   	67.130%
Ambiguous:           	4530331   	30.753%
No Solution:         	311856   	2.117%
Too Short:           	0       	0.000%

Avg Insert:          	150.6
Standard Deviation:  	66.2
Mode:                	111

Insert range:        	20 - 293
90th percentile:     	247
75th percentile:     	202
50th percentile:     	145
25th percentile:     	96
10th percentile:     	65

Here is a read length histogram of all merged and unmerged reads together:
[IMG][/IMG]
Finally, here are just the length distribution of merged reads alone:

**Brian Bushnell** · 01-13-2016, 09:57 PM

Hmmm, this looks like the "raw" reads were already adapter-trimmed. The first histogram in that last post indicates millions of reads with lengths after trimming of 40bp-140bp. But the trimming results indicate only 1.5% of the bases were trimmed. These cannot both be true. What does the length histogram look like of the raw reads?

**A_sapidissima** · 01-14-2016, 07:29 AM

Hey Brian - Here is a histogram of the reads as downloaded from Basespace, with no processing outside of what may have been done by the basecalling software on the NextSeq:

[IMG][/IMG]

Data for the histogram:

HTML Code:

#Reads:	29570310
#Bases:	3876898301
#Max:	151
#Min:	35
#Avg:	131.1
#Median:	150
#Mode:	150
#Std_Dev:	32.5
#Read Length Histogram:
#Length	reads	pct_reads	cum_reads	cum_pct_reads	bases	pct_bases	cum_bases	cum_pct_bases
0	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
5	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
10	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
15	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
20	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
25	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
30	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
35	306855	1.038%	29570310	100.000%	11230931	0.290%	3876898301	100.000%
40	268760	0.909%	29263455	98.962%	11297484	0.291%	3865667370	99.710%
45	325316	1.100%	28994695	98.053%	15311599	0.395%	3854369886	99.419%
50	347713	1.176%	28669379	96.953%	18082799	0.466%	3839058287	99.024%
55	387908	1.312%	28321666	95.777%	22131776	0.571%	3820975488	98.558%
60	409809	1.386%	27933758	94.466%	25405243	0.655%	3798843712	97.987%
65	439474	1.486%	27523949	93.080%	29467832	0.760%	3773438469	97.331%
70	461149	1.560%	27084475	91.593%	33195498	0.856%	3743970637	96.571%
75	477018	1.613%	26623326	90.034%	36752397	0.948%	3710775139	95.715%
80	500909	1.694%	26146308	88.421%	41070670	1.059%	3674022742	94.767%
85	502759	1.700%	25645399	86.727%	43755986	1.129%	3632952072	93.708%
90	527245	1.783%	25142640	85.027%	48501710	1.251%	3589196086	92.579%
95	520416	1.760%	24615395	83.244%	50492252	1.302%	3540694376	91.328%
100	538683	1.822%	24094979	81.484%	54943605	1.417%	3490202124	90.026%
105	525914	1.779%	23556296	79.662%	56276407	1.452%	3435258519	88.608%
110	540789	1.829%	23030382	77.883%	60566193	1.562%	3378982112	87.157%
115	525258	1.776%	22489593	76.055%	61451710	1.585%	3318415919	85.595%
120	534775	1.808%	21964335	74.278%	65244167	1.683%	3256964209	84.010%
125	516625	1.747%	21429560	72.470%	65606483	1.692%	3191720042	82.327%
130	512322	1.733%	20912935	70.723%	67621417	1.744%	3126113559	80.634%
135	515765	1.744%	20400613	68.990%	70652744	1.822%	3058492142	78.890%
140	483569	1.635%	19884848	67.246%	68622099	1.770%	2987839398	77.068%
145	1633076	5.523%	19401279	65.611%	241998874	6.242%	2919217299	75.298%
150	17768203	60.088%	17768203	60.088%	2677218425	69.056%	2677218425	69.056%
155	0	0.000%	0	0.000%	0	0.000%	0	0.000%

So there are only 60% of reads that are 150 bp to start with from the beginning as output by the machine. I assumed that a range of read sizes would be normal, but perhaps this proportion of 150 bp reads is unusually poor. Given the large proportion of short reads, should I avoid trying to merge overlapping pairs as I have done already?

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Mapping reads to metagenomic contigs questions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News