SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Analyzing metagenomic contigs vanillasky Bioinformatics 3 07-03-2014 08:35 AM
Emergent self-organizing maps from 4mer frequencies (binning metagenomic contigs) rhinoceros Bioinformatics 0 05-16-2014 03:35 AM
Low mapping percentage of reads on assembled contigs morning latte Bioinformatics 20 03-23-2014 09:01 AM
Celera assembler and metagenomic data - no contigs? wichne Bioinformatics 0 03-16-2012 10:49 AM
Too few reads mapping back to contigs cerebralrust Bioinformatics 8 02-26-2012 10:25 PM

Reply
 
Thread Tools
Old 01-04-2016, 10:57 AM   #1
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default Mapping reads to metagenomic contigs questions

Hi all - I am very new to metagenomics analyses, so please bear with me. I have a metagenome generated from 2L of Sterivex filtered stream water run on a NextSeq500 using NexteraXT library prep, which was dual indexed and run alongside other samples. The number of raw reads was 29,570,310 paired end reads (150bpX2), which I know is not necessarily a lot for a metagenome. In the following code snippets below I did not use the real file names since I have 4 sets of merged reads, and four sets of unmerged reads (2 files for each of four lanes) with long names that I need to shorten, but the commands worked with the real filenames. For read processing I used bbduk and bbmerge as follows:

Adapter and Quality trimming:
Code:
>bbduk.sh in1=read1.fq in2=read2.fq out1=name_R1_Q15_k29.fq out2=name_R2_Q15_k29.fq ref=/usr/local/bbmap/resources/nextera.fa qtrim=r trimq=15 k=29 mink=11 hdist=1 ktrim=r tpe tbo
Merging overlapping pairs:
Code:
>bbmerge.sh in1=name_R1_Q15_k29.fq in2=name_R2_Q15_k29.fq out=name_Q15_K29_merged.fq outu=name_Q15_k29_unmerged.fq mininsert=20
I have taken these quality trimmed reads and submitted both the merged and unmerged read pairs to MG-RAST. I also want to do a de novo assembly, which is largely where my questions arise. I used Megahit as follows:
Code:
>./megahit --12 name_Q15_k29_unmerged.fq -r name_Q15_k29_merged.fq --k-max=127 -o output_directory
I ended up with:
- 40,747 contigs totalling 19663614 bp
- min contig size of 200, max size of 7816 bp, avg 483 bp
- N50 = 458 bp

From what I have read, that N50 is low and likely reflects insufficient coverage of the sampled community. I should be shooting for N50 >= 1000 correct? At this point, additional sequencing is not an option, so I am trying to work with what I have.

Now that I have contigs from a de novo assembly, I should have a large proportion of reads not incorporated into those contigs that are still valuable potentially for taxonomic and functional annotation. From the megahit assembly, I do not get a subset of reads that were not incorporated into the contigs. Therefore I followed the tutorial at the following url for getting unassembled reads:
https://github.com/voutcn/megahit/wi...-real-assembly

I mapped the reads as follows:
Code:
>bbwrap.sh ref=name_sterivex_assembly/final.contigs.fa in=name_Q15_k29_unmerged.fq,name_Q15_k29_merged.fq out=aln.sam.gz kfilter=22 subfilter=15 maxindel=80
Then, I output coverage per contig using pileup.sh from the bbmap suite as follows:
Code:
>pileup.sh in=aln.sam.gz out=cov.txt

The following is the first few lines of the coverage file:
Code:
aaron@aaron-VirtualBox:~/PineCreek/Boone_sterivex_2014-30975030$ head cov.txt
#ID	Avg_fold	Length	Ref_GC	Covered_percent	Covered_bases	Plus_reads	Minus_reads	Read_GC
k127_4 flag=1 multi=2.0000 len=391	0.0000	391	0.0000	0.0000	0	0	0	0.0000
k127_5 flag=1 multi=2.0000 len=414	0.0000	414	0.0000	0.0000	0	0	0	0.0000
k127_6 flag=1 multi=1.0000 len=440	0.3432	440	0.0000	34.3182	151	0	1	0.6424
Taking the first contig k127_4 as an example, the coverage (Covered_percent) listed is 0. However, at least some reads had to be joined to make that contig. Shouldn't each contig have a coverage greater than zero? I have output a list of unmapped reads as follows (admittedly at this point I am just following the Megahit tutorial, and am still learning what the options mean for Samtools):

Code:
>samtools view -u -f4 aln.sam.gz | samtools bam2fq -s unmapped.se.fq - > unmapped.pe.fq
How can I be sure these unmapped reads are not part of the assembled contigs when the coverage listed for many of the contigs is zero? My end goal here is a list of contigs, and a list of unmapped reads that can both be Blastx'd using DIAMOND and visualized in MEGAN. If this is the wrong approach to attain this goal, I would appreciate any guidance or comments. Thanks for your time -
A_sapidissima is offline   Reply With Quote
Old 01-04-2016, 10:52 PM   #2
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

In my experience, IDBA-UD assembles metagenomes way better than Megahit. Can't comment on your QC procedures. For Illumina QC, I've used Trim Galore exclusively. As to odd coverage results, perhaps you should try to map with e.g. BWA and see if the results differ?
__________________
savetherhino.org
rhinoceros is offline   Reply With Quote
Old 01-05-2016, 09:44 AM   #3
trexbob
Member
 
Location: Atlanta, GA

Join Date: Feb 2013
Posts: 17
Default

You might try normalizing your data first, using bbnorm. It helped me tremendously. Like rhinoceros above, I've found MEGAHIT not as good as some others. I suggest you might try OMEGA. Also some assemblers that can use native paired reads (OMEGA) do not like the reads paired before - I can't explain why, but it will do better if you feed it the paired fastq files rather than the already merged pairs. Maybe Brian Bushnell can answer why.
trexbob is offline   Reply With Quote
Old 01-05-2016, 12:57 PM   #4
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default

Thanks for the replies.

Rhinoceros - I do have IDBA-UD installed and played around with it a little with smaller datasets, but have not tried to assemble this dataset with it yet. Comparing output from different assemblers is something I definitely need to try.

Trexbob - I have read a little about read normalization, but have not tried it yet. That should be relatively easy to do using tools from the bbmap suite. I think I saw a thread specifically about normalization for metagenomes on this forum somewhere...It's interesting that assembly without merging overlapping pairs can have a significant impact on the resulting assembly - I will try both ways and see how the results differ.

I downloaded BWA today, and am going to compare those results to the ones I get using bbmap. I'll report back with my findings. Thanks for the suggestions so far.
A_sapidissima is offline   Reply With Quote
Old 01-05-2016, 06:49 PM   #5
vout
Junior Member
 
Location: Hong Kong

Join Date: Jun 2015
Posts: 4
Default

sapidissima -

I think it is true that the coverage insufficient to obtain a good assembly. I suggest to use "--min-count 1" and "--k-step 10" for Megahit to rescue more low coverage regions. You may do the same tricks when trying out IDBA-UD.

For the 0-coverage contigs, I suspect that because you are using relatively long reads (could be 200+ after merging?) and it is possible that a DBG assembler makes use of just some kmers of a read. BBMap outputs global alignments by default, and probably running it with "local=t" is better in your case.

rhinoceros -

Megahit is actually vastly influenced by IDBA-UD. As the developer of Megahit, I have the experience that the current version of Megahit (v1.0+) should have matched the quality IDBA-UD. Generally speaking, Megahit's contig is a bit shorter than IDBA-UD but contains fewer misassemblies. The absolute advantage of Megahit is its time-/memory-efficiency.

We are still actively maintaining & developing Megahit, so feedbacks/suggestions are most welcomed.

Last edited by vout; 01-05-2016 at 06:54 PM.
vout is offline   Reply With Quote
Old 01-05-2016, 09:09 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by trexbob View Post
Also some assemblers that can use native paired reads (OMEGA) do not like the reads paired before - I can't explain why, but it will do better if you feed it the paired fastq files rather than the already merged pairs. Maybe Brian Bushnell can answer why.
This depends three things:

1) The assembler; some (like Ray and from what I hear, Mira) work well with merged reads, others (like Allpaths) do their own merging internally, and still others (like SOAP) work better with the unmerged pairs. BBMerge is currently part of the recommended OMEGA workflow, though; in their testing, it substantially improves the assemblies. That team also made an alternative merger, though, for non-overlapping reads.

2) The merging program (and stringency setting). The latest version of BBMerge is the most accurate read merger, with the lowest false-positive rate of all overlap-based mergers. But earlier versions, particularly before mid-2015 or so, were less accurate.

3) The insert-size distribution of the reads. If they mostly don't overlap, then the fraction of false-positive merges increases. So, don't merge reads if the merge rate is low (under 20% or so).

I wish I had time, but I've never gotten around to an extensive benchmark of multiple assemblers with merged versus unmerged reads, on multiple datasets. So the results of merging for different assemblers are just things I have noticed and/or been told by others, but do not have enough data points to be conclusive. With Spades, for example, I've had datasets that did better with merging (particularly low-coverage ones), and others that did worse. I do not know whether Megahit does better or worse as a result of merging; most of my testing has focused on single-cells or isolates, while Megahit is used in our metagenome assembly pipeline that I don't directly interact with. Perhaps vout can give a recommendation?

Last edited by Brian Bushnell; 01-05-2016 at 09:13 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-06-2016, 04:43 AM   #7
trexbob
Member
 
Location: Atlanta, GA

Join Date: Feb 2013
Posts: 17
Default

Thanks Brian, I appreciate the response! It's been my experience that merging reads has not improved the datasets with OMEGA, but has helped with Megahit (our reads have been merging 70-90%).

More generally, it seems that getting a good assembly from a metagenome is not an easy task. I've been running multiple assemblers with multiple data sets, mostly low diversity, some highly-mixed eukaryote/prokaryote (according to metaphlan). All the pre-processing steps (quality trimming, merging, normalization, binning, etc), tweaking the parameters of each assembler, as well as tweaking what you feed into them, produce vastly different results. Each data set seems to have its own 'best' parameters, and even its 'best' assembler, unfortunately for us researchers.
trexbob is offline   Reply With Quote
Old 01-11-2016, 06:01 AM   #8
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default

I tried replaying last Friday, and got a message that my message would appear after moderator approval. I think it never made it that far, so I am posting a reply again...

Vout, thanks for the suggestions for changing the input parameters "--min-count 1" and --k-step 10" for Megahit. I will try those. You asked about my read lengths, and I figure it would help to provide some additional information about my data. Here is a pic of the Nextera library on the Bioanalyzer after index PCR. It is somewhat under-tagmented:
[IMG][/IMG]


The following is the raw read length distribution (sorry for the ugly Excel graphs):
[IMG][/IMG]

These are the reads after adapter removal and trimming using Q15 in bbduk:
[IMG][/IMG]

Finally, here are the read lengths after merging overlapping pairs:
[IMG][/IMG]

I am surprised at how many reads were merged given that the proportion of reads with an insert less than 300 bp is very small. However, it could be that a library like that does not cluster large fragments efficiently, so I still ended up with a lot of mergeable pairs, or something is wrong with how I did my merging.

For mapping, I discovered when using interleaved fasta files with bbwrap, only the unmerged reads were used in the mapping. I re-did the mapping by performing the mapping step separate for unmerged and merged read pairs separately, and merging the sam files. Doing this I ended up with:
- 360,392 mapped reads
- 19.278,578 unmapped reads.
- avg coverage = 2.8
- percent scaffolds with any coverage = 99.75
- percent reference bases covered = 97.79

I had 100 contigs with zero coverage out of 40,747 total contigs. ~99% of these zero coverage contigs were 250 bp or less in length. This seems like a much more reasonable result, and I often see that people elect to not even report contigs of these lengths in final assemblies. Do these mapping results seem reasonable then?
A_sapidissima is offline   Reply With Quote
Old 01-11-2016, 06:07 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by A_sapidissima View Post
Doing this I ended up with:
- 360,392 mapped reads
- 19.278,578 unmapped reads.
- avg coverage = 2.8
Sounds like you need to sequence more - a lot more

Quote:
I had 100 contigs with zero coverage out of 40,747 total contigs. ~99% of these zero coverage contigs were 250 bp or less in length. This seems like a much more reasonable result, and I often see that people elect to not even report contigs of these lengths in final assemblies. Do these mapping results seem reasonable then?
Yes, they do.
Brian Bushnell is offline   Reply With Quote
Old 01-12-2016, 08:42 AM   #10
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default

Hi Brian - Thanks for looking at my results. Yeah, I guess I'll be running the NextSeq some more over the next few weeks to generate additional data for these libraries. Its pretty amazing to me how diverse crystal clear mountain stream water can be.
A_sapidissima is offline   Reply With Quote
Old 01-12-2016, 08:45 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

May want to think HiSeq for a "lot more".

It actually may be more cost effective to get this sequenced elsewhere after you calculate the number of runs required on NextSeq.
GenoMax is offline   Reply With Quote
Old 01-12-2016, 05:40 PM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by A_sapidissima View Post
I am surprised at how many reads were merged given that the proportion of reads with an insert less than 300 bp is very small. However, it could be that a library like that does not cluster large fragments efficiently, so I still ended up with a lot of mergeable pairs, or something is wrong with how I did my merging.
I forgot to comment on this. It looks like the merging was done correctly. However, regardless of your sonication settings / bioanalyzer results / etc, you won't know the true insert size distribution until you get the sequenced pairs and merge (or map). For one thing, short sequences can greatly out-compete long sequences during cluster formation. If you sequence more, be sure to do more robust size-selection; with the current library, you are wasting quite a lot of data due to short inserts.

Also, it's hard to say because the axes are different, but your adapter-trimming results look strange... since merging results in ~200k reads in the 100bp insert size bin, adapter-trimming should also result in ~200k (or ~400k if you are counting the two reads of pairs independently) reads with that length - but your graph indicates perhaps under 50k. Which is odd. Can you post the screen output of the BBDuk run?
Brian Bushnell is offline   Reply With Quote
Old 01-13-2016, 07:12 AM   #13
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default

Hey GenoMax - yeah, we own the NextSeq so it is convenient, but farming it out to a higher throughput sequencer may be more cost effective for our study. This project is a comparative analysis among several samples, so we are trying to balance the amount of data per sample and the number of samples simultaneously. For a rigorous comparative analysis, we are coming to the realization that sequencing just isn't going to be cheap and more data is needed.

Thanks for taking a closer look at my results, Brian. I may have gotten confused and generated a histogram of data from one lane of data instead of all four lanes combined for bbduk in my post above. I'll start from scratch here again and I'll post the results of each QC step using all of the data. First I concatenated the files for each of the 4 lanes into a F and R .fq file, and ran the bbduk.sh script as follows:
Code:
$ bbduk.sh in1=Boone_sterivex_R1_raw.fq in2=Boone_sterivex_R2_raw.fq out1=Boone_sterivex_R1_Q15_k29.fq out2=Boone_sterivex_R2_Q15_k29.fq ref=/usr/local/bbmap/resources/nextera.fa qtrim=r trimq=15 k=29 mink=11 hdist=1 ktrim=r tpe tbo
I had thought that just trimming the right side of the reads was okay since the quality tends to be okay at the start of the reads, but maybe I should change that option to 'qtrim=rl'. Also, I am only doing the adapter trimming on the right side and not the left, e.g. 'ktrim=r'. Here is the output from this step:
HTML Code:
BBDuk version 35.82
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=8958m, free=8818m, used=140m

Added 143423 kmers; time: 	0.125 seconds.
Memory: max=8958m, free=8491m, used=467m

Input is being processed as paired
Started output streams:	0.017 seconds.
Processing time:   		183.722 seconds.

Input:                  	29570310 reads 		3876898301 bases.
QTrimmed:               	8761068 reads (29.63%) 	37713560 bases (0.97%)
KTrimmed:               	430993 reads (1.46%) 	13990985 bases (0.36%)
Trimmed by overlap:     	247652 reads (0.84%) 	5692612 bases (0.15%)
Result:                 	29462726 reads (99.64%) 	3819501144 bases (98.52%)

Time:   			183.878 seconds.
Reads Processed:      29570k 	160.81k reads/sec
Bases Processed:       3876m 	21.08m bases/sec
Here is a histogram of the trimmed reads:
[IMG][/IMG]


Here is the command and output of bbmerge:
Code:
$ bbmerge.sh in1=Boone_sterivex_R1_Q15_k29.fq in2=Boone_sterivex_R2_Q15_k29.fq out=Boone_sterivex_Q15_k29_merged.fq outu=Boone_sterivex_Q15_k29_unmerged.fq mininsert=20
HTML Code:
BBMerge version 8.9
Writing mergable reads merged.
Started output threads.
Total time: 148.209 seconds.

Pairs:               	14731363
Joined:              	9889176   	67.130%
Ambiguous:           	4530331   	30.753%
No Solution:         	311856   	2.117%
Too Short:           	0       	0.000%

Avg Insert:          	150.6
Standard Deviation:  	66.2
Mode:                	111

Insert range:        	20 - 293
90th percentile:     	247
75th percentile:     	202
50th percentile:     	145
25th percentile:     	96
10th percentile:     	65
Here is a read length histogram of all merged and unmerged reads together:
[IMG][/IMG]
Finally, here are just the length distribution of merged reads alone:
A_sapidissima is offline   Reply With Quote
Old 01-13-2016, 08:57 PM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmmm, this looks like the "raw" reads were already adapter-trimmed. The first histogram in that last post indicates millions of reads with lengths after trimming of 40bp-140bp. But the trimming results indicate only 1.5% of the bases were trimmed. These cannot both be true. What does the length histogram look like of the raw reads?
Brian Bushnell is offline   Reply With Quote
Old 01-14-2016, 06:29 AM   #15
A_sapidissima
Member
 
Location: West Virginia, Eastern Panhandle

Join Date: Apr 2014
Posts: 11
Default

Hey Brian - Here is a histogram of the reads as downloaded from Basespace, with no processing outside of what may have been done by the basecalling software on the NextSeq:

[IMG][/IMG]

Data for the histogram:
HTML Code:
#Reads:	29570310
#Bases:	3876898301
#Max:	151
#Min:	35
#Avg:	131.1
#Median:	150
#Mode:	150
#Std_Dev:	32.5
#Read Length Histogram:
#Length	reads	pct_reads	cum_reads	cum_pct_reads	bases	pct_bases	cum_bases	cum_pct_bases
0	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
5	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
10	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
15	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
20	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
25	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
30	0	0.000%	29570310	100.000%	0	0.000%	3876898301	100.000%
35	306855	1.038%	29570310	100.000%	11230931	0.290%	3876898301	100.000%
40	268760	0.909%	29263455	98.962%	11297484	0.291%	3865667370	99.710%
45	325316	1.100%	28994695	98.053%	15311599	0.395%	3854369886	99.419%
50	347713	1.176%	28669379	96.953%	18082799	0.466%	3839058287	99.024%
55	387908	1.312%	28321666	95.777%	22131776	0.571%	3820975488	98.558%
60	409809	1.386%	27933758	94.466%	25405243	0.655%	3798843712	97.987%
65	439474	1.486%	27523949	93.080%	29467832	0.760%	3773438469	97.331%
70	461149	1.560%	27084475	91.593%	33195498	0.856%	3743970637	96.571%
75	477018	1.613%	26623326	90.034%	36752397	0.948%	3710775139	95.715%
80	500909	1.694%	26146308	88.421%	41070670	1.059%	3674022742	94.767%
85	502759	1.700%	25645399	86.727%	43755986	1.129%	3632952072	93.708%
90	527245	1.783%	25142640	85.027%	48501710	1.251%	3589196086	92.579%
95	520416	1.760%	24615395	83.244%	50492252	1.302%	3540694376	91.328%
100	538683	1.822%	24094979	81.484%	54943605	1.417%	3490202124	90.026%
105	525914	1.779%	23556296	79.662%	56276407	1.452%	3435258519	88.608%
110	540789	1.829%	23030382	77.883%	60566193	1.562%	3378982112	87.157%
115	525258	1.776%	22489593	76.055%	61451710	1.585%	3318415919	85.595%
120	534775	1.808%	21964335	74.278%	65244167	1.683%	3256964209	84.010%
125	516625	1.747%	21429560	72.470%	65606483	1.692%	3191720042	82.327%
130	512322	1.733%	20912935	70.723%	67621417	1.744%	3126113559	80.634%
135	515765	1.744%	20400613	68.990%	70652744	1.822%	3058492142	78.890%
140	483569	1.635%	19884848	67.246%	68622099	1.770%	2987839398	77.068%
145	1633076	5.523%	19401279	65.611%	241998874	6.242%	2919217299	75.298%
150	17768203	60.088%	17768203	60.088%	2677218425	69.056%	2677218425	69.056%
155	0	0.000%	0	0.000%	0	0.000%	0	0.000%
So there are only 60% of reads that are 150 bp to start with from the beginning as output by the machine. I assumed that a range of read sizes would be normal, but perhaps this proportion of 150 bp reads is unusually poor. Given the large proportion of short reads, should I avoid trying to merge overlapping pairs as I have done already?
A_sapidissima is offline   Reply With Quote
Old 01-14-2016, 06:37 PM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yep, your reads were already adapter-trimmed. Some versions of Illumina software will do this on the machine if configured to do so. I don't recommend that, though, as it's less accurate. Re-trimming does not really hurt, but it's always best to start with the raw data.

Merging is even more useful because of the high proportion of short reads - those are the ones that overlap 100%, and thus get fully error-corrected when merged because each base was called twice. So, you should certainly do it. The reads under 150bp had adapter sequence detected by Illumina's software, meaning they had insert sizes under 150bp.

Unlike most other platforms, Illumina reads all come out from the machine the same length (in this case 150bp) unless they are explicitly trimmed during a post-processing step.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO