Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help to create fastq file with non-aligned reads

    Hi everyone,
    So far I have been reading posts only, and it has been really helpful.

    I am using TopHat to align Illumina sequences (SE 100bases) from a polyA+mRNA library. I would like to have a fastq file with the reads that were not aligned. TopHat does not have an option to output that (to my knowledge).

    Can anyone, please, indicate a way of creating that file? May be pointing directions or sharing a code.

    Thanks in advance.

  • #2
    We do this routinely, but not from SAM files so our code isn't directly applicable. Basically all we do is read through the aligned file recording all of the sequence IDs we see. Then we go back through the original fastq file printing out any entries for which we didn't see the ID in the aligned file.

    Comment


    • #3
      Check out http://code.google.com/p/hydra-sv/wiki/TypicalWorkflow
      I believe you can use unmapped as a flag (i think it's 4) filter
      and pipe it thru bamtofastq
      for what you need.
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Originally posted by simonandrews View Post
        We do this routinely, but not from SAM files so our code isn't directly applicable. Basically all we do is read through the aligned file recording all of the sequence IDs we see. Then we go back through the original fastq file printing out any entries for which we didn't see the ID in the aligned file.
        Hi Simon,
        Thanks very much. I was able to isolate all the sequence IDs, but when it came to matching the IDs between the file from the alignment output and the fastq, I could not do it. As a beginner in the bioinformatics field, I have tried to do it using simple bash commands. ie: grep, cut, etc. Could you please share the part of your code that does the matching and prints what is not matched?

        I do appreciate your first reply .
        Thanks in advance.
        Fernando

        Comment


        • #5
          Hi Kevin,
          as I mentioned before, TopHat does not print non aligned reads.
          Thanks for the link.
          Fernando

          Comment


          • #6
            Originally posted by fhb View Post
            Hi Simon,
            Thanks very much. I was able to isolate all the sequence IDs, but when it came to matching the IDs between the file from the alignment output and the fastq, I could not do it. As a beginner in the bioinformatics field, I have tried to do it using simple bash commands. ie: grep, cut, etc. Could you please share the part of your code that does the matching and prints what is not matched?

            I do appreciate your first reply .
            Thanks in advance.
            Fernando
            Fernando,

            I've just tried this code on a tophat file I had lying around and it seemed to filter out the unmapped reads OK. It takes the name of a SAM and fastq file on the command line and returns the unmapped reads on stdout.

            eg

            script.pl accepted_hits.sam s_1_sequence.txt > unmapped.txt

            Code:
            #!perl
            use warnings;
            use strict;
            
            my ($sam_file,$fastq_file) = @ARGV;
            
            my $ids = read_sam($sam_file);
            
            filter_fastq($ids,$fastq_file);
            
            sub filter_fastq {
            
                warn "Filtering FastQ file\n";
            
                my ($ids,$fastq) = @_;
            
                open (IN,$fastq) or die $!;
            
                while (1) {
            	my $id1 = <IN>;
            	my $seq = <IN>;
            	my $id2 = <IN>;
            	my $qual = <IN>;
            
            	last unless ($qual);
            
            	my $match_id = substr($id1,1);
            	chomp $match_id;
            	$match_id =~ s/\/\d$//;
            
            	print $id1,$seq,$id2,$qual unless (exists $ids->{$match_id});
                }
            }
            
            
            sub read_sam {
            
                warn "Reading found ids\n";
                my ($sam) = @_;
            
                my %ids;
            
                open (IN,$sam) or die $!;
                while (<IN>) {
            	next if (/^\@/);
            	my $id = (split/\t/)[0];
            	$ids{$id} = 1;
                }
            
                close IN;
                return \%ids;
            }

            Comment


            • #7
              Simon,
              thanks very much for one more time sharing one of your helpful scripts. I appreciate it.
              Best,
              Fernando

              Comment


              • #8
                Thanks Simon! I found this script handy as well.

                Comment


                • #9
                  Picard is another program that will take a .bam and make a fastq from it.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X