Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Minimum length for read trimming

    Hi everyone,
    I will be performing de novo and reference guided assembly of mitochondrial genomes using Geneious, and have a question regarding the minimum recommended read length to trim my reads to.

    The libraries I have are Nextera XT prepared from purified organellar DNA (~450 kb genome), paired-end sequenced on MiSeq V3 with 300 bp reads. I am using Trimmomatic to trim my reads, and I notice that after quality trimming (without specifying a min length) that some of the reads are very short.

    I am wondering what a good minimum length for trimming would be. I tried trimming to a min length of 250 bp, but that removes a lot of reads and seems too stringent. I am thinking a min length of 100-150 bp may be better.

    Any suggestions?

  • #2
    Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?

    Comment


    • #3
      Originally posted by GenoMax View Post
      Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
      I have posted the FastQC results for 2 different libraries, #1 and #10. For each library, I have the FastQC results for the raw untrimmed data, quality trimmed data, and quality trimmed + 250 bp minimum read length data. I will upload the 6th file in a separate reply.

      You can see that for library #1, trimming + a min read length of 250 bp leads to about half of the reads being removed, whereas for library #10 only 1/6th of the reads remain after trimming + a min read length of 250 bp (although the total reads and quality were a little lower for #10 than #1 to begin initially). Out of the 14 libraries I sequence, only #10 showed this drastic reduction in reads after trimming with a min length specified.

      As for the phiX spike-in, I'm not actually sure but I am asking our genomics center to see if they added it in. I will let you know when I find out.
      Attached Files

      Comment


      • #4
        Originally posted by GenoMax View Post
        Can you post FastQC plots of this run, if you have them available? It would be useful to see if the problem with low quality bases is due to low nucleotide diversity libraries (which leads to characteristic drops in Q-scores later in read cycles). Was a phiX spike-in used for this run?
        Below is the 6th file.
        Attached Files

        Comment


        • #5
          For a 300 cycle runs this data looks pretty good. I think you are being way too aggressive in trimming (if that is only being done on Q-score). What Q-score cut-off are you using? Even for the library #10 the median score for the last cycles is still almost >30 in raw data.

          You may want to just trim to remove adapters (if any are present) and then give the downstream analysis a try. If you expect the R1 and R2 reads to overlap in the middle then you should use BBMerge or FLASH to do that first.

          Comment


          • #6
            Hi Marisa,

            From the FastQC stats it looks like your 'raw data' has already been trimmed (probably on the MiSeq, to remove the Illumina adapters), with a min length setting of 32.

            In the supplementary data for the Trimmomatic paper, they used a min length setting of 36. You may want to set the min length slightly longer if you are doing assembly, depending on what kmer lengths you want to use for the assembly.

            Comment


            • #7
              Hi Marisa,

              The minimum length you keep should depend on the assembler you are using. Since you intend to use Geneious, there is probably no harm in keeping all the short reads since it will automatically ignore reads that are too short to be useful to it. However, if you have more data than you need (e.g. over 100 fold mitochondrial coverage), then discarding the shortest reads might be worthwhile to improve performance.

              Comment


              • #8
                Originally posted by GenoMax View Post
                For a 300 cycle runs this data looks pretty good. I think you are being way too aggressive in trimming (if that is only being done on Q-score). What Q-score cut-off are you using? Even for the library #10 the median score for the last cycles is still almost >30 in raw data.

                You may want to just trim to remove adapters (if any are present) and then give the downstream analysis a try. If you expect the R1 and R2 reads to overlap in the middle then you should use BBMerge or FLASH to do that first.
                These are the trimmomatic settings I was using
                Code:
                ILLUMINACLIP:$TRIMMOMATIC/adapters/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:250
                .

                It seems like there aren't any adapters present even before trimming, so perhaps I will just try the unclipped vs clipped data (without specifying a minimum read length) for mapping and assembly. I thought about using FLASH, but I don't necessarily expect my reads to overlap because the bioanalyzer profiles of the libraries show an average size of 1,527 bp.

                Also, I checked with the genomics facility here who made my libraries, and yes they included a 1% PhiX spike in the run.

                Thanks again for all the helpful suggestions!

                Comment


                • #9
                  Originally posted by mastal View Post
                  Hi Marisa,

                  From the FastQC stats it looks like your 'raw data' has already been trimmed (probably on the MiSeq, to remove the Illumina adapters), with a min length setting of 32.

                  In the supplementary data for the Trimmomatic paper, they used a min length setting of 36. You may want to set the min length slightly longer if you are doing assembly, depending on what kmer lengths you want to use for the assembly.
                  Thanks for the suggestion, I will try both "raw data" and min length trimmed data for assembly to see what the results look like.

                  Comment


                  • #10
                    Originally posted by Matt Kearse View Post
                    Hi Marisa,

                    The minimum length you keep should depend on the assembler you are using. Since you intend to use Geneious, there is probably no harm in keeping all the short reads since it will automatically ignore reads that are too short to be useful to it. However, if you have more data than you need (e.g. over 100 fold mitochondrial coverage), then discarding the shortest reads might be worthwhile to improve performance.
                    I do have a ridiculous amount of coverage, even after trimming with a min length specified (~400X for most libraries), so perhaps some less stringent trimming will improve performance.

                    Comment


                    • #11
                      Originally posted by Marisa_Miller View Post
                      I don't necessarily expect my reads to overlap because the bioanalyzer profiles of the libraries show an average size of 1,527 bp.
                      I would not trust that number. Perhaps molecules in the library have that kind of size, but they would not be expected to amplify and cluster correctly on the sequencing machine; I am told that 800bp is kind of the limit. It's best to make a draft assembly and map to it to determine the real insert size distribution. You can alternately get a quick idea of the insert distribution by merging, which is really fast -

                      bbmerge.sh in=reads.fq ihist=ihist.txt reads=100000

                      ...will take about a second. The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging.

                      Oh, and I agree with GenoMax that your trimming is too aggressive. Aggressive trimming can cause bias, since some bases and motifs yield lower quality scores, which can cause a worse overall assembly. I don't typically trim to over Q16, and normally stay at Q10 or below.

                      Also, bear in mind that too much coverage can cause poor assembly (depending on the assembler), so subsampling / normalization to 100x or lower will often improve your results.

                      Comment


                      • #12
                        Originally posted by Marisa_Miller View Post
                        Hi everyone,
                        I will be performing de novo and reference guided assembly of mitochondrial genomes using Geneious, and have a question regarding the minimum recommended read length to trim my reads to.

                        The libraries I have are Nextera XT prepared from purified organellar DNA (~450 kb genome), paired-end sequenced on MiSeq V3 with 300 bp reads. I am using Trimmomatic to trim my reads, and I notice that after quality trimming (without specifying a min length) that some of the reads are very short.

                        I am wondering what a good minimum length for trimming would be. I tried trimming to a min length of 250 bp, but that removes a lot of reads and seems too stringent. I am thinking a min length of 100-150 bp may be better.

                        Any suggestions?
                        for trimming phred scores, trim at 5 not 20.
                        see here http://genomebio.org/nail-in-the-qua...imming-coffin/
                        and paper here :

                        Comment


                        • #13
                          I would not call that conclusive evidence that 5 is a good value for trimming. It highly depends on what you're doing, the trimming algorithm you use, your read length, read quality profile, coverage depth, and a variety of other factors. Not to mention that the paper's granularity was too coarse.

                          Comment


                          • #14
                            Originally posted by Brian Bushnell View Post
                            I would not trust that number. Perhaps molecules in the library have that kind of size, but they would not be expected to amplify and cluster correctly on the sequencing machine; I am told that 800bp is kind of the limit. It's best to make a draft assembly and map to it to determine the real insert size distribution. You can alternately get a quick idea of the insert distribution by merging, which is really fast -

                            bbmerge.sh in=reads.fq ihist=ihist.txt reads=100000

                            ...will take about a second. The fraction joined and the position of the peak in the graph will make it clear what the real distribution is like. If the graph is still rising then abruptly drops to zero just before 2x(read length) then the insert sizes are generally too long for merging.

                            Oh, and I agree with GenoMax that your trimming is too aggressive. Aggressive trimming can cause bias, since some bases and motifs yield lower quality scores, which can cause a worse overall assembly. I don't typically trim to over Q16, and normally stay at Q10 or below.

                            Also, bear in mind that too much coverage can cause poor assembly (depending on the assembler), so subsampling / normalization to 100x or lower will often improve your results.
                            Hi Brian,
                            Thanks for the useful information. I quality trimmed the reads with a lower score limit (Q10) and did not specify a minimum read length.

                            I merged the reads with your program, and have attached a few files for you to take a look at. It looks like most of the insert sizes of the libraries have a mean of 250-330bp. So I guess it looks like I should try to merge the reads before assembly. Do you have any suggestions on using bbmerge?
                            Attached Files

                            Comment


                            • #15
                              Originally posted by Marisa_Miller View Post
                              Do you have any suggestions on using bbmerge?
                              Check out the BBMerge thread for nuggets of knowledge: http://seqanswers.com/forums/showthread.php?t=43906

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X