Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improperly paired mates

    I have been attempting to map paired end data from 2x76nt sequencing runs of Solexa. I trim reads using scripts written by myself and others to remove any reads with poor quality and trim away adaptor sequences, then discard reads that are <18 nt after trimming. After running alignments with tophat and uploading to the UCSC genome browser I find that mated pairs align very close to each other but are not paired. When I click on one of the mates the description of the read says "

    Read name: HWUSI-EAS1712:4:1:827:1321#0
    Position: chr2:25159536-25159568
    Band: 2qA3
    Genomic Size: 33
    Alignment Quality: 255
    CIGAR string: 33M (33 (mis)Match)
    Tags: NM:0 NH:1 XS:-
    Flags: 0x49:
    (0x40) Read 1 of pair | (0x08) Mate is unmapped | (0x01) Not properly paired

    "


    but aligned above was also another read with the same ID

    "
    Read name: HWUSI-EAS1712:4:1:827:1321#0
    Position: chr2:25159535-25159563
    Band: 2qA3
    Genomic Size: 29
    Alignment Quality: 255
    CIGAR string: 29M (29 (mis)Match)
    Tags: NM:0 NH:1 XS:-
    Flags: 0x99:
    (0x80) Read 2 of pair | (0x10) Read is on '-' strand | (0x08) Mate is unmapped | (0x01) Not properly paired

    "

    Why are my mates not paired? Is this due to the fact that the reads in my -1 and -2 fastq files are not in the same order? If that is the problem, can anyone suggest a way to put my reads in the same order, since now after trimming there are differing numbers of reads in the -1 and -2 files??

    Thanks for any help you'd be able to offer.
    -Mike

  • #2
    Hi Mike,

    I had this problem when mapping some SOLiD data with TopHat. The unmatched reads were all reverse reads, so it was relatively straightforward to solve by sorting with the matched _2 reads first, then appending the unmatched reverse reads in the _2 file.

    To get the reads sorted in the same order, I did the following:
    1. Convert fasta format files to tab-delimited files (fasta2tab.py)
    2. Sort files
    3. Print matched _1 and _2 reads
    4. Print unmatched _2 reads
    5. Join matched_reads and unmatched_2 reads

    Code:
    ## Print matched _1 and _2 reads
    awk 'NR==FNR {a[$1]=$1;next} $1 in a {print $0; next}' 2_reads 1_reads > matched_reads_1
    awk 'NR==FNR {a[$1]=$1;next} $1 in a {print $0; next}' 1_reads 2_reads > matched_reads_2
    
    ## Print unmatched _2 reads
    awk 'NR==FNR {a[$1]=$1;next} !($1 in a) {print $0; next}' matched_reads 2_reads > unmatched_reads_2
    
    ## Join matched_reads and unmatched_2 reads
    cat matched_reads_2 unmatched_reads_2 > all_reads_2
    If there are also unmatched forward reads, I'm not sure how you could have these all in TopHat.

    Comment


    • #3
      AdamB, I couldn't get your awk commands to work but I was able to write something that sorts two files of paired fastq reads. Hopefully this helps someone else. Testing tophat now... I will report if it doesn't work still. This is very memory-intensive. sorry about that.



      #!/usr/bin/perl
      use warnings;
      use strict;

      my $file1 = $ARGV[0];
      my $file2 = $ARGV[1];

      my %SEQS;

      open (FI1, "<".$file1) ||die;
      open (FI2, "<".$file2) ||die;

      my $header = "";
      my $n =0;
      my @inone = ();
      my $pairn = "";
      while (<FI1>)
      {
      chomp;
      if ($n==0)
      {
      my $headeri = $_;
      $headeri =~ s/^\@//g;
      ($header, $pairn) = split (/\#/, $headeri);
      push @inone, $header;
      }elsif ($n==1)
      {
      $SEQS{$header}{1}{seq} = $_;
      }elsif ($n==2) {}#do nothing
      elsif ($n==3)
      {
      $SEQS{$header}{1}{qual} = $_;
      $n=0;
      next;
      }
      $n++;
      }
      my %both;
      my @intwo = ();
      while (<FI2>)
      {
      chomp;
      if ($n==0)
      {
      my $headeri = $_;
      $headeri =~ s/^\@//g;
      ($header, $pairn) = split (/\#/, $headeri);
      push @intwo, $header;
      if ($SEQS{$header})
      {
      $both{$header}=1;
      }
      }elsif ($n==1)
      {
      $SEQS{$header}{2}{seq} = $_;
      }elsif ($n==2) {}#do nothing

      elsif ($n==3)
      {
      $SEQS{$header}{2}{qual} = $_;
      $n=0;
      next;
      }
      $n++;
      }
      my $nsame = scalar (keys %both);
      print STDERR "there are $nsame reads that appear in both fastq files\n";
      open (OUT1, ">".$file1.".sorted");
      open (OUT2, ">".$file2.".sorted");
      foreach my $name (keys %both) #reads that appear in both
      {
      print OUT1 "@".$name."#0/1\n".$SEQS{$name}{1}{seq}."\n+".$name."#0/1\n".$SEQS{$name}{1}{qual}."\n";
      print OUT2 "@".$name."#0/2\n".$SEQS{$name}{2}{seq}."\n+".$name."#0/2\n".$SEQS{$name}{2}{qual}."\n";
      }

      foreach my $name (@inone)
      {
      unless ($both{$name})
      {
      print OUT1 "@".$name."#0/1\n".$SEQS{$name}{1}{seq}."\n+".$name."#0/1\n".$SEQS{$name}{1}{qual}."\n";
      }
      }

      foreach my $name (@intwo)
      {
      unless ($both{$name})
      {
      print OUT2 "@".$name."#0/2\n".$SEQS{$name}{2}{seq}."\n+".$name."#0/2\n".$SEQS{$name}{2}{qual}."\n";
      }
      }

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X