Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate read names - BWA mem - paired reads have different names

    Hi,

    running BWA mem (- PE; - Illumina), I'm getting the following error (replaced the ids):



    [mem_sam_pe] paired reads have different names: "XXX:5:YYY:1:11102:4257:13510", "XXX:5:YYY:1:11102:15792:1058"

    I checked the fastq file and found out that each read name is duplicated 7 times in the file (exact same name). However, the order of the read names is not matching between the pairs (see bold positions).

    Example:

    > grep -n "XXX:5:YYY:1:11102:4257:13510" R1.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    962773:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1164149:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA

    > grep "XXX:5:YYY:1:11102:4257:13510" R2.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1028309:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1229685:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA


    Is it ok for a fastq file to have multiple reads with the same read name?
    If not, could this be a problem of BCL conversion?
    How can I fix it?


    Thanks for your help,
    Stephan


    PS: bwa mem command:

    bwa mem -t 40 -v 1 hg19.fa R1.fastq R2.fastq > aln.sam

  • #2
    Fastq headers should always start with an "@" so what you have is not following the standard. Have you asked the folks who gave you this data as to whether it has been post-processed in some way? And there should be no duplicates (let alone multiples) in raw sequence files, as far as the fastq header ID's are concerned.
    Last edited by GenoMax; 02-02-2016, 06:44 AM.

    Comment


    • #3
      Hi,

      that's not the problem. See "head" result (Sequence and quality trimmed) and also the grep result I posted.

      > head R1.fastq
      @XXX:5:YYY:1:11101:12923:1051 1:N:0:AGGCAGAA+NCGATCTA
      CTT...TTC
      +
      AAA...</<
      @XXX:5:YYY:1:11101:4797:1055 1:N:0:AGGCAGAA+NCGATCTA
      ACC...CTA
      +
      AAA...<A/


      Thanks,
      Stephan

      Comment


      • #4
        My apologies.

        If the order of the reads in your files is messed up then you can "re-pair" the order of reads using the repair tool from BBMap suite like follows:

        Code:
        $ repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
        That said each fastq sequence header should be unique in every sequence file. If that is not the case then there is something wrong with this data.

        Comment


        • #5
          Thanks for you reply.

          I was also suspecting that the raw file is not ok.

          Best regards,
          Stephan

          Comment


          • #6
            If the sequence/Q-scores are identical for those 7 copies then you could potentially keep just one and throw away other 6.

            I am puzzled by how this could have happened though. No logical explanation comes to mind.

            Comment


            • #7
              It happened to me twice and a new demultiplexing fixed the problem. I suspect there is something to do with the number of threads to write fastq data. Best, Daniele

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X