Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Protaeus
    Member
    • Aug 2010
    • 21

    paired end fastq format in bwa

    In some examples that I've read for using bwa to analyze paired end data, a fastq for each member of the pair is included (in other words, R1.fastq and R2.fastq). Will bwa handle paired end data that is in a single fastq? The reads are denoted with \1 and \2.
  • dawe
    Senior Member
    • Apr 2009
    • 258

    #2
    Originally posted by Protaeus View Post
    In some examples that I've read for using bwa to analyze paired end data, a fastq for each member of the pair is included (in other words, R1.fastq and R2.fastq). Will bwa handle paired end data that is in a single fastq? The reads are denoted with \1 and \2.
    AFAIK no, it won't. You may separate reads into two different files, I guess with

    Code:
    $ grep -A2 ^@*1 filein.fq > reads_1.fq
    $ grep -A2 ^@*2 filein.fq > reads_2.fq
    d

    Comment

    • kmcarr
      Senior Member
      • May 2008
      • 1181

      #3
      Originally posted by dawe View Post
      AFAIK no, it won't. You may separate reads into two different files, I guess with

      Code:
      $ grep -A2 ^@*1 filein.fq > reads_1.fq
      $ grep -A2 ^@*2 filein.fq > reads_2.fq
      d
      Not quite. First, FASTQ sets are four lines long so you have to collect the matched line and the 3 following (-A3). Your regular expression means "match 0 or more "@" at the beginning of a line, followed by a 1 (or 2). You need to specify an "@" followed by 0 or more of any character (.*). You are also not anchoring the 1 or 2 to the end of the line. Finally need to enclose the regular expression in quotes. To get what you intended it should be:

      Code:
      $ grep -A3 ^"@.*1"$ filein.fq > reads_1.fq
      $ grep -A3 ^"@.*2"$ filein.fq > reads_2.fq
      There is however a hidden gotcha in this method. @, 1 and 2 are valid characters for the quality string if the FASTQ is Sanger (or Illumina prior to 1.5). This means that your grep could match a quality string and then write it and the next three lines as a FASTQ block. This will cause whatever program was trying to parse this to puke (from personal experience).

      In a random FASTQ file of ~20m reads I found 511 quality strings which were matched by these grep patterns. An incredibly small fraction to be sure but you need one to screw up your FASTQ file.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        For the reasons kmcarr gives (and other issues like this), personally I'd use a simple script using Biopython, BioPerl or similar rather than grep.

        Comment

        • dawe
          Senior Member
          • Apr 2009
          • 258

          #5
          Originally posted by maubp View Post
          For the reasons kmcarr gives (and other issues like this), personally I'd use a simple script using Biopython, BioPerl or similar rather than grep.
          I wrote the wrong grep expression, my bad. Indeed I used to grep @XXXX where XXXX is my machine ID for most of the operations... Also, bwa doesn't use quality for alignment (so it will work with A1 or A3).
          Nevertheless, I believe grep is much faster than any bioperl/biopython script.

          d

          Comment

          • barak
            Junior Member
            • Jun 2010
            • 9

            #6
            Hi. Just found this post in the GATK forum: http://gatkforums.broadinstitute.org...o-fastq-format
            Essentially, you can use BWA with interleaved BAM files containing info from both pairs. I know that was not exactly the question, but it is related, and hopefully will save time for some (as with my case).

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            37 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            100 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            121 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            113 views
            0 reactions
            Last Post SEQadmin2  
            Working...