Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    This sounds really cool!

    Now if the data is MiSeq and the 250bp reads overlap, should I perform merging first?

    Also, what coverage do you suggest the normalization to take place? is 50x enough?

    What if my genome is heterozygous. One of the genomes I work with is tetraploid, and some kmers are at a frequency of 25% of those belonging to homozygous regions. How would normalization occur?

    Thank you,
    Adrian

    Comment


    • #17
      After merging, you will have both single reads (merged) and paired reads (the ones that couldn't merge). BBNorm cannot simultaneously normalize both single and paired reads, it has to do them both separately. You can, theoretically, just do an assembly with merged reads only, but that may not be a good idea because some reads won't merge. For example, if you have a read pair like this:
      r1: ACTGTACTGAACACACACACACACACACACAC
      r2: ACACACACACACACACTGACTGGTGAACTGCA

      ...those won't merge, because the low-complexity "ACACACAC..." part can overlap at many different insert lengths, so it will be classified as ambiguous. If you throw away all unmerged reads, you might not be able to assemble these low complexity areas, because all the reads with them might have been thrown away. Of course they don't assemble well anyway so it might not matter.

      So, if you want to merge data:

      Remove adapters first.

      If you want to error-correct, do error-correction before merging (it increases the merging rate). You can error-correct without normalization using the script "ecc.sh".

      If you want to normalize and assemble using only merged reads, do normalization after merging, on just the merged reads.

      If you want to normalize and assemble using both merged and unmerged reads, do normalization before merging.

      The coverage you should target depends on the assembler you use. I find that 40x works well with Velvet and 50x for Soap. 100-125x seems to work well for AllPathsLG, but AllPathsLG does not really perform better with normalization. Other assemblers probably have different ideal coverage.

      As for polyploidy... that's tricky. Normalization may be useful, or it may make things worse. I mainly work with prokaryote and fungal data, and generally normalization improves the assemblies of haploid fungi more than diploid fungi. It probably depends on the SNP rate.

      The normalization is done using the median of the frequency of 31-mers (though you can change it to e.g. 25 with the flag 'k=25'). If you have a SNP rate of 1 in 1000, like human, then a typical 250bp read will contain zero or 1 SNPs. The read will have 220 kmers; if it has a SNP, then it will have 31 kmers of low depth and 189 kmers of normal depth, so the median will be unchanged, and the fact that there was a SNP will not affect normalization. However, if you have an organism with a very high SNP rate, you will get some reads normalized to the overall depth, and some reads normalized to the depth of one ploidy (and thus potentially quadruple the desired coverage of highly polymorphic areas in a tetraploid). That's probably not good, but the only way to tell is to try the assembly both ways. Subsampling, of course, will not have this affect.
      Last edited by Brian Bushnell; 04-20-2014, 10:23 AM. Reason: Clarified guidance on normalization/merging order

      Comment


      • #18
        Very interesting! Too bad the polymorphic genomes are a tough cookie. Indeed the one I am currently working with has at least 1 polymorphic region every 100bp, every read. So normalizations can create some serious discrepancies. On the other hand, normalizing could be an interesting method to investigate homozygous regions, since their coverage would be low.

        Another project I am currently working with is an 150mb genome of a haploid genome. Problem is that sequencing was done from a single cell, with whole genome amplification. Coverage seems a bit uneven at time. Would normalization be beneficial in this case?

        Comment


        • #19
          Just finished my first round of trimming (bbduk) using just the one adapter for starters ( GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ) and all of the PE files behaved similarly (including the much longer MiSeq PE reads). However, I also had the one set of Mate Pair data which I had already reverse comped. It was handled much more aggressively as indicated below by the output from bbduk:

          Input is being processed as paired
          Started output stream: 0.070 seconds.

          Input: 167158152 reads 16882973352 bases.
          KTrimmed: 27218034 reads (16.28%) 339257047 bases (2.01%)
          Result: 167158152 reads (100.00%) 16543716305 bases (97.99%)

          Time: 111.038 seconds.
          Reads Processed: 167m 1505.41k reads/sec
          Bases Processed: 16882m 152.05m bases/sec

          Thinking about it, I guess the proper steps would be to trim adapters from the original files, THEN reverse comp?

          Comment


          • #20
            Originally posted by AdrianP View Post
            Another project I am currently working with is an 150mb genome of a haploid genome. Problem is that sequencing was done from a single cell, with whole genome amplification. Coverage seems a bit uneven at time. Would normalization be beneficial in this case?
            BBNorm was specifically designed to handle the highly uneven coverage of single-cell amplified data, for both normalization and error-correction, when coverage can vary by over 100000x. So that would probably be the ideal application.

            Comment


            • #21
              Originally posted by jpummil View Post
              Just finished my first round of trimming (bbduk) using just the one adapter for starters ( GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG ) and all of the PE files behaved similarly (including the much longer MiSeq PE reads). However, I also had the one set of Mate Pair data which I had already reverse comped. It was handled much more aggressively as indicated below by the output from bbduk:

              Input is being processed as paired
              Started output stream: 0.070 seconds.

              Input: 167158152 reads 16882973352 bases.
              KTrimmed: 27218034 reads (16.28%) 339257047 bases (2.01%)
              Result: 167158152 reads (100.00%) 16543716305 bases (97.99%)

              Time: 111.038 seconds.
              Reads Processed: 167m 1505.41k reads/sec
              Bases Processed: 16882m 152.05m bases/sec

              Thinking about it, I guess the proper steps would be to trim adapters from the original files, THEN reverse comp?
              Long-mate-pair data generally has very high adapter rates, so don't worry about the high rate of trimming. Also, you're going to use it for scaffolding, not contig generation, so biasing the dataset by aggressively trimming won't cause holes in the assembly.

              BBDuk can find kmers whether or not they are reverse-complemented. However, the trimming direction is still important, and reverse-complementing beforehand will reverse that! First, you must know some details about the mate-pair protocol, because that will determine whether you do "ktrim=r" (find the adapter and trim everything to the right) or "ktrim=l" (trim to the left). There's also a "ktrimexclusive" flag, which will do trimming excluding, rather than including, the kmer (we use this mode for our "clip" libraries). For all of our long-mate-pair protocols, I think, we trim to the right. Without reverse-complementing first.

              Comment


              • #22
                So, a short update. Did some adapter trimming of the input files and re-ran at the same kmer (57) in Ray

                Before:
                Assembly Contigs
                # contigs (>= 0 bp) 594352
                # contigs (>= 1000 bp) 349257
                Total length (>= 0 bp) 1155057304
                Total length (>= 1000 bp) 1014358551
                # contigs 487964
                Largest contig 40714
                Total length 1117432969
                GC (%) 38.10
                N50 3249
                N75 1732
                L50 98136
                L75 216568
                # N's per 100 kbp 0.00

                After: (Mild trimming, just GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG)
                Assembly Contigs
                # contigs (>= 0 bp) 591737
                # contigs (>= 1000 bp) 348698
                Total length (>= 0 bp) 1156020758
                Total length (>= 1000 bp) 1016231498
                # contigs 486826
                Largest contig 40372
                Total length 1118838141
                GC (%) 38.10
                N50 3266
                N75 1740
                L50 97683
                L75 215624
                # N's per 100 kbp 0.00

                So, essentially, no change. I have another more aggressively trimmed version running now. Used literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,AGATCGGAAGAGC kmer=13

                This will complete by tomorrow (running on 64 core node with 512GB of memory).

                Next, I will try merging the reads together and re-running again in Ray (probably still with kmer set at 57 for comparison). Quick question...as the process for bbmerge is as follows:
                bbmerge.sh in1=<read1> in2=<read2> out=<merged reads> outu1=<unmerged1> outu2=<unmerged2>

                I expect the merged reads will then be considered a single read as input. What about the unmerged1 and unmerged2 data? Single reads as well? I doubt they'll still be accurately paired after the merging process.

                Comment


                • #23
                  Originally posted by jpummil View Post
                  So, essentially, no change.
                  Of course, the continuity is not the only way of looking at assembly quality; it's also useful to map the input reads to the assembly to determine the percent mapped (higher is better) and number of mismatches/indels (lower is better). Also, running gene prediction to try to find broken genes can sometimes help indicate assembly quality.

                  Next, I will try merging the reads together and re-running again in Ray (probably still with kmer set at 57 for comparison). Quick question...as the process for bbmerge is as follows:
                  bbmerge.sh in1=<read1> in2=<read2> out=<merged reads> outu1=<unmerged1> outu2=<unmerged2>

                  I expect the merged reads will then be considered a single read as input. What about the unmerged1 and unmerged2 data? Single reads as well? I doubt they'll still be accurately paired after the merging process.
                  The outu1 and outu2 streams will still be properly paired.

                  Comment


                  • #24
                    Hmmm...so, bbmerge had some success with three of the data sets. Two others however, merged almost nothing. I didn't apply any additional flags to the command, just went with defaults:

                    Executing jgi.BBMerge [in1=TRIM-CONS/MiSeq_PE_R1.fastq, in2=TRIM-CONS/MiSeq_PE_R2.fastq, out=TRIM-CONS-MERGED/MiSeq_PE_Merged.fastq, outu1=TRIM-CONS-MERGED/MiSeq_PE_R1_UM.fastq, outu2=TRIM-CONS-MERGED/MiSeq_PE_R2_UM.fastq]

                    BBMerge version 3.0
                    Writing mergable reads merged.
                    Started output threads.
                    Finished reading
                    Align time: 104.682 seconds.
                    Total time: 104.768 seconds.

                    Reads: 25879379
                    Joined: 15748510 60.854%
                    Ambiguous: 3327775 12.859%
                    No Solution: 6803094 26.288%
                    Avg Insert: 407.9

                    Insert range: 17 - 589
                    90th percentile: 506
                    50th percentile: 461
                    10th percentile: 190

                    Executing jgi.BBMerge [in1=TRIM-CONS/New_PE_R1.fastq, in2=TRIM-CONS/New_PE_R2.fastq, out=TRIM-CONS-MERGED/New_PE_Merged.fastq, outu1=TRIM-CONS-MERGED/New_PE_R1_UM.fastq, outu2=TRIM-CONS-MERGED/New_PE_R2_UM.fastq]

                    BBMerge version 3.0
                    Writing mergable reads merged.
                    Started output threads.
                    Finished reading
                    Align time: 260.237 seconds.
                    Total time: 260.424 seconds.

                    Reads: 75551653
                    Joined: 4281758 5.667%
                    Ambiguous: 16369683 21.667%
                    No Solution: 54900212 72.666%
                    Avg Insert: 141.1

                    Insert range: 17 - 189
                    90th percentile: 180
                    50th percentile: 146
                    10th percentile: 97

                    Executing jgi.BBMerge [in1=TRIM-CONS/Old_MP_RC01.fastq, in2=TRIM-CONS/Old_MP_RC02.fastq, out=TRIM-CONS-MERGED/Old_MP_Merged.fastq, outu1=TRIM-CONS-MERGED/Old_MP_RC01_UM.fastq, outu2=TRIM-CONS-MERGED/Old_MP_RC02_UM.fastq]

                    BBMerge version 3.0
                    Writing mergable reads merged.
                    Started output threads.
                    Finished reading
                    Align time: 149.105 seconds.
                    Total time: 149.299 seconds.

                    Reads: 83579070
                    Joined: 2561231 3.064%
                    Ambiguous: 27670694 33.107%
                    No Solution: 53347145 63.828%
                    Avg Insert: 76.8

                    Insert range: 17 - 189
                    90th percentile: 135
                    50th percentile: 72
                    10th percentile: 27
                    Executing jgi.BBMerge [in1=TRIM-CONS/Old_PE_s_1_1.fastq, in2=TRIM-CONS/Old_PE_s_1_2.fastq, out=TRIM-CONS-MERGED/Old_PE_s_1_Merged.fastq, outu1=TRIM-CONS-MERGED/Old_PE_s_1_1_UM.fastq, outu2=TRIM-CONS-MERGED/Old_PE_s_1_2_UM.fastq]

                    BBMerge version 3.0
                    Writing mergable reads merged.
                    Started output threads.
                    Changed from ASCII-33 to ASCII-64 on input quality 88 while prescanning.
                    Finished reading
                    Align time: 148.187 seconds.
                    Total time: 148.376 seconds.

                    Reads: 98578987
                    Joined: 54619897 55.407%
                    Ambiguous: 38068745 38.618%
                    No Solution: 5890345 5.975%
                    Avg Insert: 146.5

                    Insert range: 17 - 189
                    90th percentile: 175
                    50th percentile: 147
                    10th percentile: 118


                    Executing jgi.BBMerge [in1=TRIM-CONS/Old_PE_s_2_1.fastq, in2=TRIM-CONS/Old_PE_s_2_2.fastq, out=TRIM-CONS-MERGED/Old_PE_s_2_Merged.fastq, outu1=TRIM-CONS-MERGED/Old_PE_s_2_1_UM.fastq, outu2=TRIM-CONS-MERGED/Old_PE_s_2_2_UM.fastq]

                    BBMerge version 3.0
                    Writing mergable reads merged.
                    Started output threads.
                    Changed from ASCII-33 to ASCII-64 on input quality 88 while prescanning.
                    Finished reading
                    Align time: 135.654 seconds.
                    Total time: 135.844 seconds.

                    Reads: 84969978
                    Joined: 48005380 56.497%
                    Ambiguous: 31726426 37.338%
                    No Solution: 5238172 6.165%
                    Avg Insert: 146.5

                    Insert range: 17 - 189
                    90th percentile: 175
                    50th percentile: 147
                    10th percentile: 118

                    Comment


                    • #25
                      The sets that did not work probably had insert sizes that were too long, so they didn't overlap. You can determine the insert size distribution by mapping to one of your assemblies:

                      bbmap.sh -Xmx31g in1=r1.fq in2=r2.fq ref=snake.fa nodisk out=null ihist=ihist.txt mhist=mhist.txt qhist=qhist.txt bhist=bhist.txt

                      The insert size histogram will be in "ihist.txt", but you may find the other histograms useful as well. BBMerge can also produce an insert size histogram with the "hist=" flag, but it will be biased, as it will use only the reads that overlap. If you just want a general idea, you can make the mapping go faster with the flag "reads=1000000", which will just use the first million read pairs - that should be good enough for the insert size histogram.

                      If the insert sizes are short enough that the 2 libraries that don't merge well actually DO overlap, then the problem could be a error rate or adapter contamination; merging will work best of you remove adapters first, then error-correct, then merge. But I suspect it's just that those libraries mostly don't overlap.

                      Comment


                      • #26
                        Thanks Brian! This will be another interesting exercise

                        As mentioned in my first post, not a bioinformatician. So, this is all very interesting to me and I certainly appreciate your patience and sage advice.

                        Comment


                        • #27
                          So, if I am interpreting these correctly, it looks like the PE file is probably an insert length of 500, but the MP file is all over the map! Used the following commands and read all of the data, not just a sample:

                          bbmap.sh -Xmx40g in1=$DATA/New_PE_R1.fastq in2=$DATA/New_PE_R2.fastq ref=/scratch/jpummil/Snake_v2/RayOut57-TRIM-
                          CONS/Contigs.fasta nodisk out=null ihist=ihist_PE.txt mhist=mhist_PE.txt qhist=qhist_PE.txt bhist=bhist_PE.txt
                          bbmap.sh -Xmx40g in1=$DATA/Old_MP_RC01.fastq in2=$DATA/Old_MP_RC02.fastq ref=/scratch/jpummil/Snake_v2/RayOut57-T
                          RIM-CONS/Contigs.fasta nodisk out=null ihist=ihist_MP.txt mhist=mhist_MP.txt qhist=qhist_MP.txt bhist=bhist_MP.tx
                          t

                          ihist for both attached...

                          Comment


                          • #28
                            Oops, despite their seemingly tiny size, (36k and 85k), they are regarded as too big for attachments on this forum

                            Comment


                            • #29
                              Originally posted by jpummil View Post
                              Oops, despite their seemingly tiny size, (36k and 85k), they are regarded as too big for attachments on this forum
                              You can probably attach them if you gzip them first; raw text files don't work. But it's not surprising that the MP insert distribution is odd. Fragment (PE) reads tend to have a fairly tight insert size, which is generally bell-shaped and cannot really range above 800. LMP (Long mate pair) libraries tend to have wildly erratic insert-size distributions, depending on the protocol - they might have 50% under 200bp (failed), and 50% ranging from 1kbp to 8kbp.

                              BBMap should still be able to handle those. But many LMP libraries have some "inward" and some "outward" reads, so it's best to run with the "rcs=f" flag (requireCorrectStrand=false) for LMP libraries if you want accurate pairing data.

                              Comment


                              • #30
                                So, I do recall the group who sequenced the MP data mentioning an insertion size of 6600? You also mentioned a few posts back using the MP data for scaffolding...did you intend for me to omit it from the initial assemblies and perform a secondary scaffolding step with that data?

                                Ahhh...yup, gzip allowed me to attach the files. If I can figure out why the PE file in question didn't merge well, I want to correct that and then try an assembly in Ray using the merged data (as single reads) and the unmerged files as paired ends as you indicated.
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X