Seqanswers Leaderboard Ad

**AdamB** · 02-07-2011, 03:11 AM

Hi Mike,

I had this problem when mapping some SOLiD data with TopHat. The unmatched reads were all reverse reads, so it was relatively straightforward to solve by sorting with the matched _2 reads first, then appending the unmatched reverse reads in the _2 file.

To get the reads sorted in the same order, I did the following:
1. Convert fasta format files to tab-delimited files (fasta2tab.py)
2. Sort files
3. Print matched _1 and _2 reads
4. Print unmatched _2 reads
5. Join matched_reads and unmatched_2 reads

Code:

## Print matched _1 and _2 reads
awk 'NR==FNR {a[$1]=$1;next} $1 in a {print $0; next}' 2_reads 1_reads > matched_reads_1
awk 'NR==FNR {a[$1]=$1;next} $1 in a {print $0; next}' 1_reads 2_reads > matched_reads_2

## Print unmatched _2 reads
awk 'NR==FNR {a[$1]=$1;next} !($1 in a) {print $0; next}' matched_reads 2_reads > unmatched_reads_2

## Join matched_reads and unmatched_2 reads
cat matched_reads_2 unmatched_reads_2 > all_reads_2

If there are also unmatched forward reads, I'm not sure how you could have these all in TopHat.

**spikesd17** · 02-07-2011, 12:17 PM

AdamB, I couldn't get your awk commands to work but I was able to write something that sorts two files of paired fastq reads. Hopefully this helps someone else. Testing tophat now... I will report if it doesn't work still. This is very memory-intensive. sorry about that.

#!/usr/bin/perl
use warnings;
use strict;

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];

my %SEQS;

open (FI1, "<".$file1) ||die;
open (FI2, "<".$file2) ||die;

my $header = "";
my $n =0;
my @inone = ();
my $pairn = "";
while (<FI1>)
{
chomp;
if ($n==0)
{
my $headeri = $_;
$headeri =~ s/^\@//g;
($header, $pairn) = split (/\#/, $headeri);
push @inone, $header;
}elsif ($n==1)
{
$SEQS{$header}{1}{seq} = $_;
}elsif ($n==2) {}#do nothing
elsif ($n==3)
{
$SEQS{$header}{1}{qual} = $_;
$n=0;
next;
}
$n++;
}
my %both;
my @intwo = ();
while (<FI2>)
{
chomp;
if ($n==0)
{
my $headeri = $_;
$headeri =~ s/^\@//g;
($header, $pairn) = split (/\#/, $headeri);
push @intwo, $header;
if ($SEQS{$header})
{
$both{$header}=1;
}
}elsif ($n==1)
{
$SEQS{$header}{2}{seq} = $_;
}elsif ($n==2) {}#do nothing

elsif ($n==3)
{
$SEQS{$header}{2}{qual} = $_;
$n=0;
next;
}
$n++;
}
my $nsame = scalar (keys %both);
print STDERR "there are $nsame reads that appear in both fastq files\n";
open (OUT1, ">".$file1.".sorted");
open (OUT2, ">".$file2.".sorted");
foreach my $name (keys %both) #reads that appear in both
{
print OUT1 "@".$name."#0/1\n".$SEQS{$name}{1}{seq}."\n+".$name."#0/1\n".$SEQS{$name}{1}{qual}."\n";
print OUT2 "@".$name."#0/2\n".$SEQS{$name}{2}{seq}."\n+".$name."#0/2\n".$SEQS{$name}{2}{qual}."\n";
}

foreach my $name (@inone)
{
unless ($both{$name})
{
print OUT1 "@".$name."#0/1\n".$SEQS{$name}{1}{seq}."\n+".$name."#0/1\n".$SEQS{$name}{1}{qual}."\n";
}
}

foreach my $name (@intwo)
{
unless ($both{$name})
{
print OUT2 "@".$name."#0/2\n".$SEQS{$name}{2}{seq}."\n+".$name."#0/2\n".$SEQS{$name}{2}{qual}."\n";
}
}

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Improperly paired mates

Comment

Comment

Latest Articles

ad_right_rmr

News