Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Please help me understand what went wrong

    Hello,

    So, I've been working with some assemblies. Recently, I submitted our paired-end reads into MG-RAST, and discovered that only a small percentage (~5) of them were merged when FastqJoin -m 8 -p 10.

    Now, I managed to get my hands on some actual QC data:



    Our metagenomic samples are from deep underground, so low DNA yields are not super surprising. However, I learned that prior to sending, our samples were amplified by MDA, so

    Question 1: Perhaps the low amount of DNA is somewhat surprising?



    For some reason Macrogen ended up making TruSeq library from one sample, and Nextera libraries from the other samples:



    To me, Sample 1 looks pretty good, I'd expect that a large percentage of the reads could be merged later on. However, this was not the case. In fact, I think this sample had the smallest percentage of merged reads in MG-RAST.

    Question 2: Is there any rational explanation for this?






    As far as I can tell (I'm really more of a computer guy), all the other samples look rather awful.

    Question 3: Does it make sense that for these samples merging failed so hard? I mean, the insert sizes are clearly too large, yes?

    Question 4: Sample 3 had the highest concentration and amount of DNA in the beginning, then all of a sudden it became rather bad. Can the blame be assigned to Macrogen, or are there other possible explanations for this?


    So yeah, I'd really appreciate it if somebody with more experience could share some thoughts on this whole thing..
    savetherhino.org

  • #2
    Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

    Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.

    Comment


    • #3
      Originally posted by mcnelson.phd View Post
      Wow, lots of things going on here, but before I begin an in depth answer I'll need to ask how the libraries were sequenced. If it was HiSeq, which I presume, then was it a 2x100bp run or a 2x150bp run? Regardless, none of the inserts in your libraries appear to be small enough to overlap with a 2x150 run.

      Overall I'd say all of the libraries except Sample 5 look good as far as the Bioanalyzer traces go. You have a valid argument for one library being TruSeq with the others being Nextera, but there's technically nothing wrong with doing either.
      You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.
      Last edited by rhinoceros; 08-16-2013, 09:09 AM.
      savetherhino.org

      Comment


      • #4
        If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.
        Last edited by GenoMax; 08-16-2013, 09:30 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          If this was done 2 years ago then I am not so sure that 2 x 150 bp were available at that time.

          Was nextera around 2 years ago?
          By my recollection, Nextera didn't come out till about 1.5 years ago, and HiSeq Rapid Run mode didn't come out till about 1 year ago.

          Comment


          • #6
            Originally posted by rhinoceros View Post
            You'd think that such information was readily available. Unfortunately, these samples were done ~2 years ago, and a lot of people who were involved with the project have moved on. In MG-rast, post QC mean sequence length is ca. 105 bp when non-overlapping reads are retained (and they make the vast majority of reads), so I'm fairly certain that it was 2x150bp.
            Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

            Also, the raw files will easily tell you the read lengths.

            Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.

            Comment


            • #7
              Originally posted by GenoMax View Post
              If this was done 2 years ago then I am not sure that 2 x 150 bp was possible at that time.
              Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?
              savetherhino.org

              Comment


              • #8
                Originally posted by rhinoceros View Post
                Well the order date is July 20 2011. I think post QC mean sequence length of ~105 bp means that it had to be 2x150bp? I mean, post QC mean sequence length of 2x100bp would be a lot shorter than 100bp, no?
                It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

                Not necessarily. With good libraries/sequencing you may not lose a single base.

                Comment


                • #9
                  Originally posted by mcnelson.phd View Post
                  Do you have the raw fastq files? If so, then that will tell you if this was HiSeq/MiSeq/GA-IIx. MiSeq reads begin with @M##### where ##### is the serial ID of the instrument. I believe HiSeq are @HWI, but here's a thread on here that lists the IDs for HiSeq and GA-IIx.

                  Also, the raw files will easily tell you the read lengths.
                  That's good to know, thanks. Unfortunately I can't VPN from home to work (only with my work laptop and that's at work), so I have to get back to this on Monday.

                  Overall, I wouldn't expect these reads to overlap, so you'll have to drop that step from your analysis.
                  I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..
                  Last edited by rhinoceros; 08-16-2013, 09:59 AM.
                  savetherhino.org

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    It sounds like you have the raw data (or is it in some processed form)? What about your own QC results?

                    Not necessarily. With good libraries/sequencing you may not lose a single base.
                    Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..
                    savetherhino.org

                    Comment


                    • #11
                      Originally posted by rhinoceros View Post
                      Yeah, I have the raw data (but not at hand). I never did any QC as that was done long before I started at the job. God, I hate it so much when I can't do all the work from the start..
                      It would be enlightening to see what you find with your own QC. Perhaps there will be some other issues you will discover. If those are accounted for then MG-RAST results may improve.

                      You will easily be able to tell the machine/number of cycles as long as you have the "original" data. Post an ID here and we can tell otherwise.

                      Comment


                      • #12
                        Originally posted by rhinoceros View Post
                        I'm just wondering why they went for such big fragment size at Macrogen, considering it was obvious that the reads wouldn't overlap. I understand this might be desirable with single genomes, but with metagenomic samples it doesn't make any sense..
                        For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

                        As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

                        This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

                        As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.

                        Comment


                        • #13
                          Originally posted by mcnelson.phd View Post
                          For TruSeq libraries, as is noted in the Macrogen QC you posted, traditionally the fragment size is larger than 2x read length because you don't want the reads to overlap. It's only been fairly recent that overlapping the reads into longer single reads with higher quality has been shown to be useful.

                          As for the Nextera libraries, you really have no control over the fragment size because it's a transposon that fragments that DNA. That's why the distribution of fragments is much larger in the Bioanalyzer traces for the Nextera libraries compared to the TruSeq which is fragmented mechanically (e.g. Covaris).

                          This would go a ways to explaining why the TruSeq library had the lowest number of overlapping reads while the Nextera libraries had more. To put it simply, >98% of the TruSeq fragments being sequenced should be >300bp, so 2x100bp reads will rarely overlap. Conversely, because there's a greater chance of small fragments being generated and sequenced with Nextera, a higher proportion of the reads would be expected to overlap.

                          As I noted before, your best bet is to stop focusing on read merging, because it doesn't look like it's going to happen. Even though you're doing metagenomics, there have been a lot of papers that have used 2x100bp HiSeq data for assemblies where the reads do not overlap. I'd recommend taking a look at what they're doing instead of just relying on MG-RAST.
                          Thanks for this very insightful post. I'm blissfully ignorant on details concerning the sequencing part of NGS. Our assemblies were done with, I think, 4 different programs. Overall, Meta-IDBA ones turned out 'the best'. Now I'm thinking I should read up on assemblers to see if there are any that can make some use of the positional information of non-overlapping paired-end reads given min and max genomic distance between the pairs..
                          savetherhino.org

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X