Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cutadapt trimming for multiple files

    Hi,

    First, I'm sorry if I missed this but I couldn't find another thread about this issue:

    I would like to run cutadapt on over 200 different .fastq files, so I clearly need to automate this process but I am not sure how to do this.

    What is the best way to tell cutadapt to remove the adaptors on all files within a directory and create an output for each of them? The manual only mentions running two files at one (if paired end sequencing was done)

    In short, I need to find all "*.fastq", trim sequences and output as trim.fastq

    Thanks!

  • #2
    You can use one of the shell script ideas here to iterate over the set of files you have: http://stackoverflow.com/questions/1...-files-in-unix
    and http://stackoverflow.com/questions/1...s-in-directory

    If you are submitting to a job scheduler on a cluster a similar loop can be used.

    Comment


    • #3
      I bet you could do something snazzy with gnu parallel.

      This tutorial follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related  examples (align with BWA, Samtools, etc.. ).  

      Comment


      • #4
        Hi - for starters, I am a completely newbie at this, so bare with me if it's basic.

        I understand how to do a for loop with one object, but if you have paired-end reads, trim-galore takes two files as input, and I just can't figure out how to make a for loop including a list of pairs.... Would be easy if trim-galore accepted wildcards in the two filenames, but I am not sure of it does? Or is it possible to specify two objects to use as input? i.e.
        for read1 in /*/xxxxR1.fastq
        for read2 in /*/xxxxR2.fastq
        do trim-galore etc etc "$read1$ "$read2"
        done

        Any thoughts?

        Comment


        • #5
          You can deduce the read2 variable from read1:

          Code:
          for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done

          Comment


          • #6
            Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.

            Comment


            • #7
              Originally posted by Michael.Ante View Post
              You can deduce the read2 variable from read1:

              Code:
              for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done
              That makes sense - and even better, it seems to be working. Thank you so much!

              Comment


              • #8
                Originally posted by mgogol View Post
                Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.
                In case you want to speed up a bit, you can use such a file as input for GNU parallel:

                Code:
                parallel -j 8 --colsep '\t'  trim-galore etc etc {1} {2} :::: read-pairs-location.tsv
                Just adjust the number of threads (-j) according to your CPUs.

                Comment


                • #9
                  Hello Michael,

                  I am trying to do exactly as you suggested just need a bit of help with that code.

                  what is --colsep and when you say '\t' means that you read-pairs-location.tsv is tab separated if it was a CSV you would write --colsep ',' please correct me if I am wrong. Also it means that it will use 8 threads for each of the trim_galore command?

                  I am currently testing on a Quad code i7 on my mac that has hyper threading so in total I have 8 cores 4 physical and 4 virtual so I can use -j8 although I need to test for what value of thread I get the best performance as using maximum threads generally gives lower performance.

                  Thanks,
                  Yaseen

                  Comment


                  • #10
                    That --colsep option should work. Give it a try Otherwise convert your file into tsv format and then use the original \t option.

                    Having 8 cores does not mean that you will be able to use them efficiently. They are connected to your storage subsystem through a common bus which can only allow a certain amount of data to be read/written.

                    Again experiment with a small subset of data (test file) starting with 4 cores and go up and down in number to see what number works best for your setup.

                    Comment


                    • #11
                      Thank you GenoMax.

                      I ran the following command:

                      parallel -j 3 --colsep '\t' trim_galore --output_dir tg/3 --paired --fastqc -a CTGTCTCTTATA -a2 CTGTCTCTTATA {1} {2} :::: tg.tsv

                      Is there a way that I can give a name to this command? and how would i check the number of threads being used. I am using a mac and when I check the activity monitor i cannot see 3 i.e. number of threads. Had i named this command it would have shown up in the activity monitor and or using top? Indeed this command is running as it is generating output. The tg.tsv has 4 files names with their complete paths i.e. 2 paired end samples.

                      Can someone please advise? thanks

                      Comment


                      • #12
                        In my activity monitor when i run the above command its shows gzip command with 1 thread as i think it uncompressed the fastq.gz file initially? Whats the best to see the number of threads run by this command on a mac or a linux. I have a mac.
                        Can I name this command to something so i can quickly find it in top etc?

                        Comment


                        • #13
                          When you run "top" processes that are actively consuming CPU cycles should show up at the top of the list. You can also use the activity monitor: https://support.apple.com/en-us/HT201464

                          Your file contains the two file name pairs separated by a tab on each line, right?

                          Comment


                          • #14
                            When I do a top or a activity monitor it shows gzip most of the time and the number of threads as 1 even when i run with 3 threads.

                            Yes tsv in a text editor, with two reads each separated by a tab. I did not put two samples as my mac gives me out of space error so for testing I can only use 1 sample with paired end read. I then labelled the extension of the file with .tsv The below is the section how my tab separated file looks like, the gap between the two files is a tab.

                            /Users/Yaseen/S1R1.fastq.gz /Users/Yaseen/S1R2.fastq.gz

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Advancing Precision Medicine for Rare Diseases in Children
                              by seqadmin




                              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                              12-16-2024, 07:57 AM
                            • seqadmin
                              Recent Advances in Sequencing Technologies
                              by seqadmin



                              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                              Long-Read Sequencing
                              Long-read sequencing has seen remarkable advancements,...
                              12-02-2024, 01:49 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 12-17-2024, 10:28 AM
                            0 responses
                            26 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-13-2024, 08:24 AM
                            0 responses
                            43 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-12-2024, 07:41 AM
                            0 responses
                            29 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-11-2024, 07:45 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X