SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
need some help to understand the VCF file seraphin Bioinformatics 0 06-05-2013 03:44 PM
To understand Punnett Squares ardmore General 2 08-31-2011 02:03 PM
How to understand the output of mpileup like this skblazer Bioinformatics 0 12-05-2010 11:43 AM
Help me understand MAQ indexing pieffe Bioinformatics 0 06-01-2009 08:09 AM

Reply
 
Thread Tools
Old 08-16-2013, 09:22 AM   #1
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default Please help me understand what went wrong

Hello,

So, I've been working with some assemblies. Recently, I submitted our paired-end reads into MG-RAST, and discovered that only a small percentage (~5) of them were merged when FastqJoin -m 8 -p 10.

Now, I managed to get my hands on some actual QC data:



Our metagenomic samples are from deep underground, so low DNA yields are not super surprising. However, I learned that prior to sending, our samples were amplified by MDA, so

Question 1: Perhaps the low amount of DNA is somewhat surprising?



For some reason Macrogen ended up making TruSeq library from one sample, and Nextera libraries from the other samples:



To me, Sample 1 looks pretty good, I'd expect that a large percentage of the reads could be merged later on. However, this was not the case. In fact, I think this sample had the smallest percentage of merged reads in MG-RAST.

Question 2: Is there any rational explanation for this?






As far as I can tell (I'm really more of a computer guy), all the other samples look rather awful.

Question 3: Does it make sense that for these samples merging failed so hard? I mean, the insert sizes are clearly too large, yes?

Question 4: Sample 3 had the highest concentration and amount of DNA in the beginning, then all of a sudden it became rather bad. Can the blame be assigned to Macrogen, or are there other possible explanations for this?


So yeah, I'd really appreciate it if somebody with more experience could share some thoughts on this whole thing..
rhinoceros is offline   Reply With Quote
Old 08-16-2013, 09:31 AM   #2
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.
mcnelson.phd is offline   Reply With Quote
Old 08-16-2013, 10:05 AM   #3
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by mcnelson.phd View Post
Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.
You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.

Last edited by rhinoceros; 08-16-2013 at 10:09 AM.
rhinoceros is offline   Reply With Quote
Old 08-16-2013, 10:26 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,087
Default

If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.

Last edited by GenoMax; 08-16-2013 at 10:30 AM.
GenoMax is offline   Reply With Quote
Old 08-16-2013, 10:29 AM   #5
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by GenoMax View Post
If this was done 2 years ago then I am not so sure that 2 x 150 bp were available at that time.

Was nextera around 2 years ago?
By my recollection, Nextera didn't come out till about 1.5 years ago, and HiSeq Rapid Run mode didn't come out till about 1 year ago.
mcnelson.phd is offline   Reply With Quote
Old 08-16-2013, 10:34 AM   #6
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by rhinoceros View Post
You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.
Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

Also, the raw files will easily tell you the read lengths.

Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.
mcnelson.phd is offline   Reply With Quote
Old 08-16-2013, 10:40 AM   #7
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by GenoMax View Post
If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.
Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?
rhinoceros is offline   Reply With Quote
Old 08-16-2013, 10:44 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,087
Default

Quote:
Originally Posted by rhinoceros View Post
Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?
It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

Not necessarily. With good libraries/sequencing you may not lose a single base.
GenoMax is offline   Reply With Quote
Old 08-16-2013, 10:46 AM   #9
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by mcnelson.phd View Post
Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

Also, the raw files will easily tell you the read lengths.
That's good to know, thanks. Unfortunately I can't VPN from home to work (only with my work laptop and that's at work), so I have to get back to this on Monday.

Quote:
Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.
I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..

Last edited by rhinoceros; 08-16-2013 at 10:59 AM.
rhinoceros is offline   Reply With Quote
Old 08-16-2013, 10:58 AM   #10
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by GenoMax View Post
It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

Not necessarily. With good libraries/sequencing you may not lose a single base.
Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..
rhinoceros is offline   Reply With Quote
Old 08-16-2013, 11:05 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,087
Default

Quote:
Originally Posted by rhinoceros View Post
Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..
It would be enlightening to see what you find with your own QC. Perhaps there will be some other issues you will discover. If those are accounted for then MG-RAST results may improve.

You will easily be able to tell the machine/number of cycles as long as you have the "original" data. Post an ID here and we can tell otherwise.
GenoMax is offline   Reply With Quote
Old 08-16-2013, 11:20 AM   #12
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by rhinoceros View Post
I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..
For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.
mcnelson.phd is offline   Reply With Quote
Old 08-16-2013, 11:33 AM   #13
rhinoceros
Senior Member
 
Location: sub-surface moon base

Join Date: Apr 2013
Posts: 372
Default

Quote:
Originally Posted by mcnelson.phd View Post
For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.
Thanks for this very insightful post. I'm blissfully ignorant on details concerning the sequencing part of NGS. Our assemblies were done with, I think, 4 different programs. Overall, Meta-IDBA ones turned out 'the best'. Now I'm thinking I should read up on assemblers to see if there are any that can make some use of the positional information of non-overlapping paired-end reads given min and max genomic distance between the pairs..
rhinoceros is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO