Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming reads messes up PE fastqs? any ideas please?

    Hi,

    I am in a bit of a pickle after trimming some adapters from a heavily contaminated PE library.
    I think the problem is caused by the trimming (CutAdapt) removing some reads entirely from R1 and R2 fastqs, meaning that there are now some unpaired reads withe th 2 read files now out of sync...

    The first problem this lead to was with bwa sampe inferring massive insert sizes, but I can turn off that behaviour by using the -A parameter.

    This then leads to some weird validation warnings using picard to convert sam to bam (ignored using VALIDATION_STRINGENCY=LENIENT).
    I then get the same lines erroring at each picard step, but the 'leniancy-avoiding-approach' then falls down at GATK IndelRealigner (using -S SILENT) when I get the following message and cannot progress:

    Error caching SAM record (null), which is usually caused by malformed SAM/BAM files in which multiple identical copies of a read are present.

    Without trimming these reads, they go through the pipeline nicely.
    By trimming them, for those samples that dont throw up validation errors, I get a much better alignment, utilising more reads than untrimmed.

    Can anyone suggest how I can get around this ?
    I want to keep the trimmed reads, and hopefully use the reads that have lost their mates, but dont know how to go about that ... does anyone have any ides?

    Thanks
    Last edited by swNGS; 05-08-2012, 07:54 AM.

  • #2
    If you use trimmomatic for read trimming, your paired fastq files will be in the right order and you will get seperate files for those reads that lost their mates in the trimming.

    Comment


    • #3
      Trim Galore!, a wrapper around Cutadapt, can also take care of the out-of-sync problem of paired-end trimming in --paired mode, very similar to Trimmomatic.

      Comment


      • #4
        TrimGalore looks very promising... Although I am always a little skeptical of tools with an exclamation mark in its name ... But I am happy to let that prejudice go if it does the job!

        I did have a go at Trimmomatic, but could not figure out how to run it (afte. following instructions from the website). Any suggestions on that front would be helpful

        Comment


        • #5
          So Trim Galore looks to be doing exactly what I want with one caveat...

          I want to be able to trim a different adapter from each read... ie adapter1 from read 1 and adapter2 from read 2. I could do these two steps separately using CutAdapt on R1 then R2, but with Trim_Galore, I cant (as far as I can see) supply a different sequence to trim for each read.

          A workaround would be to run cutadapt sepqrately for each read, then hack the 'trimmed' files into the trim_galore script, which is a bit beyond my meagre perl skills

          Does anyone have other ideas?

          Thanks

          Comment


          • #6
            I can adapt it so that you can supply two different adapters first thing tomorrow morning if that's not too late?

            By the way, did you use the standard Illumina adapters? In this case running the default option should work just fine (as the sequence on both ends starts off the same).

            Comment


            • #7
              that would be truly fantasitc

              I didnt use the standard illumina adapters since the library prep is from Haloplex and there is some additional 'stuff' to trim off that is different at each end.

              Alos, I might have missed it, but how do I get it to generate a log file?
              i tried piping the output to a file but only got this:

              Length cut-off for read 1: 35 bp (default)
              Length cut-off for read 2: 35 bp (default)
              Writing final adapter and quality trimmed output to /home/ngsuser/NGS/DATA/raw_data/3_demultiplexed/s_G1_L001_R1.fastq_split_2_trimmed.fq

              Length cut-off for read 1: 35 bp (default)
              Length cut-off for read 2: 35 bp (default)
              Writing final adapter and quality trimmed output to /home/ngsuser/NGS/DATA/raw_data/3_demultiplexed/s_G1_L001_R2.fastq_split_2_trimmed.fq


              I'm running trim_galore as follows:

              trim_galore --paired --retain_unpaired --quality 15 --fastqc_args "--outdir $current_directory/logs/CutAdapt/" -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ${input_fastq_path}/${input_fastq_R1} ${input_fastq_path}/${input_fastq_R2} > $current_directory/logs/CutAdapt/$sample_name'_R1_log'

              Many thanks

              Comment


              • #8
                I have now added an option '-a2/--adapter2' which only affects read 2 of paired-end pairs. The new version of Trim Galore (v0.2.2) is available here, you might have to force a cache refresh of the website if it still shows the older version.

                Could you try it out and let me know if something isn't working as expected?

                Comment


                • #9
                  Forgot to mention that Trim Galore prints a report for each file automatically, you don't have to redirect the output anywhere. The read 2 report contains information about the validation part right at the end as well, i.e. the number of reads that failed to be validated etc. .

                  And sorry about the exclamation mark, it wasn't planned originally. Do you think a question mark would be more appropriate? :P

                  Comment


                  • #10
                    This seems to work with no issues, thanks very much.

                    With regard to the log file, trim_galore's default output is to create it in the same folder as the original fastq. Is there any way to output it to anywhere else (I could fairly easily get my script to copy the trimming_report to a different location), but wondered if your script had an 'output-to' option for the report ?

                    Thanks,

                    Chris

                    Comment


                    • #11
                      Hi Chris,

                      it doesn't have that option right now (maybe in the next release). For the moment you could just locate the following line:

                      Code:
                      open (REPORT,'>',$report) ...
                      and change it to

                      Code:
                      open (REPORT,'>',"/your/path/$report")

                      Comment


                      • #12
                        Originally posted by swNGS View Post
                        I did have a go at Trimmomatic, but could not figure out how to run it (afte. following instructions from the website). Any suggestions on that front would be helpful
                        If you're still interested in trimmomatic, can you contact me (email on the trimmomatic site) and i will try to help get you going.

                        Comment


                        • #13
                          TrimGalore allowing mismatches

                          Hi,

                          How would one change the cutadapt parameter for allowing mismatches through the TrimGalore command-line, or is this not possible currently?

                          I think the cutadapt parameter is -e which specifies the ERROR_RATE allowed in a adapter match; I would like to change that when running TrimGalore.


                          Thanks!

                          Comment


                          • #14
                            Originally posted by vkpilla View Post
                            Hi,

                            How would one change the cutadapt parameter for allowing mismatches through the TrimGalore command-line, or is this not possible currently?

                            I think the cutadapt parameter is -e which specifies the ERROR_RATE allowed in a adapter match; I would like to change that when running TrimGalore.


                            Thanks!
                            Hi vkpilla,

                            I have just added an option '-e ERROR RATE' to Trim Galore which should now work in the way you need it (you can download it from here: http://www.bioinformatics.babraham.a...s/trim_galore/)

                            Cheers,
                            Felix
                            Last edited by fkrueger; 07-31-2012, 07:21 AM. Reason: fixed broken link

                            Comment


                            • #15
                              As a fairly general solution to the "out of sync" fastq issue, the following commands work okay for me. However it requires cdbtools (cdbfasta and cdbyank). Also, these presume you do not have the /1 and /2 suffix after the read name to denote read number. Further that each of the the fields of a fastq record are on a single line.

                              Code:
                              grep -hn '^@' R1_in.clipped.fq R2_in.clipped.fq                         | \
                                 perl -F: -ane 'print if ($F[0]-1)%4==0;'                                   | \
                                 perl     -ane 'my ($read_name)=m{^\d+:@([^/ ]+)[/ ]}; print $read_name,"\n";'  | \
                                                                                   sort                         | \
                                                                                   uniq -d                        > paired.list
                              
                              
                              cdbfasta -Q R1_in.clipped.fq
                              cdbyank  R1_in.clipped.cidx < paired.list > R1_in.synced.fq
                              
                              cdbfasta -Q R2_in.clipped.fq
                              cdbyank  R2_in.clipped.cidx < paired.list > R2_in.synced.fq
                              If your fastq file is using /1 and /2 suffixes to the read names to denote read number, this should work:

                              Code:
                              grep -hn '^@' R1_in.clipped.fq R2_in.clipped.fq                         | \
                                 perl -F: -ane 'print if ($F[0]-1)%4==0;'                                   | \
                                 perl     -ane 'my ($read_name)=m{^\d+:@([^/ ]+)[/ ]}; print $read_name,"\n";'  | \
                                                                                   sort                       | \
                                                                                   uniq -d                        > paired.list
                              
                              sed -e 's%.*%&/1%' paired.list > R1_sync.list
                              sed -e 's%.*%&/2%' paired.list > R2_sync.list
                              
                              cdbfasta -Q R1_in.clipped.fq
                              cdbyank  R1_in.clipped.cidx < R1_sync.list > R1_in.synced.fq
                              
                              cdbfasta -Q R2_in.clipped.fq
                              cdbyank  R2_in.clipped.cidx < R2_sync.list > R2_in.synced.fq
                              I actually like fastx_clipper. Seem less convoluted than Trimmomatic. But it does require re-syncing of PE reads after running.

                              --
                              Phillip

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X