Seqanswers Leaderboard Ad

**mcnelson.phd** · 08-16-2013, 08:31 AM

Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.

**rhinoceros** · 08-16-2013, 09:05 AM

Originally posted by mcnelson.phd View Post

Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.

You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.

**GenoMax** · 08-16-2013, 09:26 AM

If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.

**mcnelson.phd** · 08-16-2013, 09:29 AM

Originally posted by GenoMax View Post

If this was done 2 years ago then I am not so sure that 2 x 150 bp were available at that time.

Was nextera around 2 years ago?

By my recollection, Nextera didn't come out till about 1.5 years ago, and HiSeq Rapid Run mode didn't come out till about 1 year ago.

**mcnelson.phd** · 08-16-2013, 09:34 AM

Originally posted by rhinoceros View Post

You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.

Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

Also, the raw files will easily tell you the read lengths.

Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.

**rhinoceros** · 08-16-2013, 09:40 AM

Originally posted by GenoMax View Post

If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.

Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?

**GenoMax** · 08-16-2013, 09:44 AM

Originally posted by rhinoceros View Post

Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?

It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

Not necessarily. With good libraries/sequencing you may not lose a single base.

**rhinoceros** · 08-16-2013, 09:46 AM

Originally posted by mcnelson.phd View Post

Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

Also, the raw files will easily tell you the read lengths.

That's good to know, thanks. Unfortunately I can't VPN from home to work (only with my work laptop and that's at work), so I have to get back to this on Monday.

Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.

I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..

**rhinoceros** · 08-16-2013, 09:58 AM

Originally posted by GenoMax View Post

It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

Not necessarily. With good libraries/sequencing you may not lose a single base.

Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..

**GenoMax** · 08-16-2013, 10:05 AM

Originally posted by rhinoceros View Post

Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..

It would be enlightening to see what you find with your own QC. Perhaps there will be some other issues you will discover. If those are accounted for then MG-RAST results may improve.

You will easily be able to tell the machine/number of cycles as long as you have the "original" data. Post an ID here and we can tell otherwise.

**mcnelson.phd** · 08-16-2013, 10:20 AM

Originally posted by rhinoceros View Post

I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..

For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.

**rhinoceros** · 08-16-2013, 10:33 AM

Originally posted by mcnelson.phd View Post

For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.

Thanks for this very insightful post. I'm blissfully ignorant on details concerning the sequencing part of NGS. Our assemblies were done with, I think, 4 different programs. Overall, Meta-IDBA ones turned out 'the best'. Now I'm thinking I should read up on assemblers to see if there are any that can make some use of the positional information of non-overlapping paired-end reads given min and max genomic distance between the pairs..

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Please help me understand what went wrong

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News