Seqanswers Leaderboard Ad

**simonandrews** · 11-13-2010, 01:24 AM

We do this routinely, but not from SAM files so our code isn't directly applicable. Basically all we do is read through the aligned file recording all of the sequence IDs we see. Then we go back through the original fastq file printing out any entries for which we didn't see the ID in the aligned file.

**KevinLam** · 11-13-2010, 06:22 AM

Check out http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow
I believe you can use unmapped as a flag (i think it's 4) filter
and pipe it thru bamtofastq
for what you need.

**fhb** · 11-13-2010, 05:51 PM

Originally posted by simonandrews View Post

We do this routinely, but not from SAM files so our code isn't directly applicable. Basically all we do is read through the aligned file recording all of the sequence IDs we see. Then we go back through the original fastq file printing out any entries for which we didn't see the ID in the aligned file.

Hi Simon,
Thanks very much. I was able to isolate all the sequence IDs, but when it came to matching the IDs between the file from the alignment output and the fastq, I could not do it. As a beginner in the bioinformatics field, I have tried to do it using simple bash commands. ie: grep, cut, etc. Could you please share the part of your code that does the matching and prints what is not matched?

I do appreciate your first reply .
Thanks in advance.
Fernando

**fhb** · 11-13-2010, 05:52 PM

Hi Kevin,
as I mentioned before, TopHat does not print non aligned reads.
Thanks for the link.
Fernando

**simonandrews** · 11-14-2010, 01:44 AM

Originally posted by fhb View Post

Hi Simon,
Thanks very much. I was able to isolate all the sequence IDs, but when it came to matching the IDs between the file from the alignment output and the fastq, I could not do it. As a beginner in the bioinformatics field, I have tried to do it using simple bash commands. ie: grep, cut, etc. Could you please share the part of your code that does the matching and prints what is not matched?

I do appreciate your first reply .
Thanks in advance.
Fernando

Fernando,

I've just tried this code on a tophat file I had lying around and it seemed to filter out the unmapped reads OK. It takes the name of a SAM and fastq file on the command line and returns the unmapped reads on stdout.

eg

script.pl accepted_hits.sam s_1_sequence.txt > unmapped.txt

Code:

#!perl
use warnings;
use strict;

my ($sam_file,$fastq_file) = @ARGV;

my $ids = read_sam($sam_file);

filter_fastq($ids,$fastq_file);

sub filter_fastq {

    warn "Filtering FastQ file\n";

    my ($ids,$fastq) = @_;

    open (IN,$fastq) or die $!;

    while (1) {
	my $id1 = <IN>;
	my $seq = <IN>;
	my $id2 = <IN>;
	my $qual = <IN>;

	last unless ($qual);

	my $match_id = substr($id1,1);
	chomp $match_id;
	$match_id =~ s/\/\d$//;

	print $id1,$seq,$id2,$qual unless (exists $ids->{$match_id});
    }
}


sub read_sam {

    warn "Reading found ids\n";
    my ($sam) = @_;

    my %ids;

    open (IN,$sam) or die $!;
    while (<IN>) {
	next if (/^\@/);
	my $id = (split/\t/)[0];
	$ids{$id} = 1;
    }

    close IN;
    return \%ids;
}

**fhb** · 11-14-2010, 07:06 AM

Simon,
thanks very much for one more time sharing one of your helpful scripts. I appreciate it.
Best,
Fernando

**M&M** · 11-18-2011, 07:54 AM

Thanks Simon! I found this script handy as well.

**swbarnes2** · 11-18-2011, 09:24 AM

Picard is another program that will take a .bam and make a fastq from it.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

help to create fastq file with non-aligned reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News