Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastX Trimmer takes too long and how to do multiple files?

    Hi everyone,

    So I am doing 100 bp PE reads, and I want to trim off the first 15 bp's. I want to do this because when I check the distribution of GCTA, they vary quite a bit between the first 15 bp's, but then are all 25% after that. Although the Phred Score is still quite high (above 30), a Tech Support person at Illumina mentioned that I would probably get better results by trimming it.

    I noticed that with tophat, you cannot use an option telling it to start at the 15th base, so the only option appears to be trimming. I tried to use fastx_trimmer, but it appears to be taking a LONG time, and I can only get it to do one file at a time. Is there another software that would be faster, or another way of going about this? Also, can I do more than one at a time?

    Thanks so much for any advice!

  • #2
    You could find the call to bowtie that tophat makes and edit it to pass additional parameters taken from an environment variable, say, BowtieExtraTophatOptions

    then set BowtieExtraTophatOptions to `--trim5 15` before calling Tophat

    Just remember to edit TopHat again the next time you get an updated version (`patch` is your friend).

    Comment


    • #3
      Trimmomatic is much faster than Fastx. You should give it a try.
      However, I am not sure your mapping will improve that much. If you used random hexamers in your reverse transcription, the bias in the first 10 bp or so is expected.

      Comment


      • #4
        The fastx_toolkit fastq parser is 10X slower than what it should be. You may try my program: https://github.com/lh3/seqtk

        seqtk trimfq -b15

        Sorry for the self promotion.

        Comment


        • #5
          Thank you for the advice and I apprecaite the links!

          I will try both of these programs, but before I do, I ran it without my trimming and I obtained 90% read alignment, according to samtools flagstat. Should I still trim and re-run it? Is it probably that I would obtain better results?

          Comment


          • #6
            You would get better and more reliable results by trimming the first 10 to 15 bases because as mentioned above there is an intrinsic bias introduced by the RNA library prep, likely to cause some misalignments depending on the length of the seed used by the aligner though these might be filtered out anyway by limiting the number of mismatches allowed. Still, no harm in trimming.

            If the program can't run on several files at once, perhaps you could run several instances of the program in separate terminals provided you have enough cpu and ram.
            Last edited by Kennels; 04-01-2012, 06:41 PM.

            Comment


            • #7
              Originally posted by Kennels View Post
              You would get better and more reliable results by trimming the first 10 to 15 bases because as mentioned above there is an intrinsic bias introduced by the RNA library prep, likely to cause some misalignments depending on the length of the seed used by the aligner though these might be filtered out anyway by limiting the number of mismatches allowed. Still, no harm in trimming.
              I disagree. I see little difference in my number of mapped reads whether I trim or not. There is some harm in trimming since you lose a good portion of your read depending on length. I can see increases in multimapped reads and in RNA-seq, a reduction in the number of reads split across an exon-exon junction.

              Comment


              • #8
                Originally posted by pbluescript View Post
                I disagree. I see little difference in my number of mapped reads whether I trim or not. There is some harm in trimming since you lose a good portion of your read depending on length. I can see increases in multimapped reads and in RNA-seq, a reduction in the number of reads split across an exon-exon junction.
                I stand corrected - I mistook bias as bad calls. Thanks for the correction @pbluescript.

                You will still get correct alignments - just that the interpretation should take into account this bias. I guess it depends on what you want to do downstream.

                Comment


                • #9
                  Originally posted by Kennels View Post
                  I stand corrected - I mistook bias as bad calls. Thanks for the correction @pbluescript.

                  You will still get correct alignments - just that the interpretation should take into account this bias. I guess it depends on what you want to do downstream.
                  Thanks very much! How should I take this bias into account? I want to do gene expression profiling with Cufflinks.

                  Also, is there a quick and dirty way to check the differences in gene expression between two conditions with the output of Tophat? Cufflinks will take me a couple days with my computing power, and I only have til tomorrow to tell the Sequencing department that I want a lane for my duplicates or I will have to wait til May. I was thinking of sorting and indexing accepted_hits.bam and then loading it into IGV and taking a peak, although I'm not sure of how to quantify any differences.

                  Comment


                  • #10
                    Originally posted by billstevens View Post

                    Also, can I do more than one at a time?
                    To apply trimmomatic over multiple files you can try using a bash script with recursion and substring replacement; this example below can be used with paired end data where filenames for reads 1 and reads 2 have "R1" or "R2" in them, respectively. Substitute your own trimmomatic parameters as needed!

                    #!/bin/csh
                    for f in <path_to_your_data>*R1*
                    do
                    java -classpath <path_to_.jar>trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE -threads 12 -phred33 -trimlog trimmomatic.log $f ${f/R1/R2} $f.PairedTrim.fastq $f.UnpairedTrim.fastq ${f/R1/R2}.PairedTrim.fastq ${f/R1/R2}.UnpairedTrim.fastq LEADING:30 TRAILING:28 SLIDINGWINDOW:4:28 MINLEN:40
                    done
                    Last edited by safay; 08-18-2012, 12:16 AM.

                    Comment


                    • #11
                      I know this is a very old thread, but thought I'd add an answer that helps answer part of the OP's original question.

                      Here's a for loop (in bash) to iterate through FASTQ files in a directory, run fastx_trimmer on them and rename the outptut.

                      Code:
                      for file in `ls /path/to_your/files/*.fastq`
                      do
                          newname=`basename $file | sed -e "s/.fastq/_trimmed.fastq/"`
                          /path_to/fastx_trimmer -Q33 -f 15 -z -i $file > /path/to_your/output/"$newname.gz"
                          echo $file
                          echo $newname
                      done
                      Here's the brief explanation, in hopes of making this whole thing more clear:

                      For each file (i) that matches "*.fastq" (this command will match any files that end with .fastq):

                      1. Use the basename command (to obtain the filename) and, using sed, remove the .fastq suffix and replace with "_trimmed.fastq"

                      2. Run fastx_trimmer with PHRED score +33 (-Q 33; required for Illumina FASTQ files), trim the first 15 bases (-f 15) and gzip the output (-z) for the input file (-i $file) specified. Then, redirect fastx_trimmer stdout to your desired directory and rename the file using the "newname" variable and add ".gz" extension to that name.

                      4. Print the original file name variable ($file) to the screen for verification that this is working as intended.

                      5. Print the new file name variable ($newname) to the screen for verification that this is working as intended..

                      This will process each file, one at a time. The output on the screen after each loop should be the original file name and then the trimmed file name, (without the addition of the ".gz" extension).

                      To get even fancier (I haven't tested this), I think you can use gzipped fastq files (.fastq.gz) as an input if you change the fourth line of the for loop above to:

                      gunzip -c $file | /path_to/fastx_trimmer -Q33 -f 15 -z -i $file > /path/to_your/output/"$newname.gz"

                      What that should do is send the output (stdout) of the gunzip command to the input (stdin) of fastx_trimmer. From what I've read, fastx_trimmer accepts input from stdin, so...
                      Last edited by kubu4; 02-25-2015, 04:46 PM. Reason: add code tags

                      Comment


                      • #12
                        Step 0. 'bash' for those using a different shell. My boss, for example, likes tcsh. :-)

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X