Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Un-align reads in BWA

    Hi All,
    I have RNA-seq data from human tissue samples. I want to align to RefSeq and then to UCSC and then to human genome in sequential order using BWA. But how can I store unalign sequences? For examples, unalign sequences from RefSeq reference alignment (which will be input sequences for UCSC)?
    Thank you so much for your help,
    Rakesh

  • #2
    Normally unaligned reads are stored in FASTQ, although you can also store unaligned reads in SAM/BAM which is supported by some alignment tools (including BWA).

    See also my blog post which might be of interest:
    I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format . I will try to explain, as thi...


    Are you asking in general how to go from aligned reads in SAM/BAM to unaligned reads?

    Comment


    • #3
      Hi Peter,

      Thanks for your detailed response and very interesting blog.

      Specifically, I want to align my RNA-seq data (in fastq format) using BWA with RefSeq reference database and at same time I want to store un-align reads (in fastq or SAM format) in separate file. As BWA does not have any option to store un-align reads. Can you please guide me?

      Thanks again for your kind help,

      Rakesh

      Comment


      • #4
        This is a perl script that will take your bam file and your original fastq file and generate an fastq file containing only your unaligned reads.

        One caveat, you can't have any unaligned reads in your bam file.

        I should also note that I did not write this script. If I remember correctly it was written by Simon Anders, the author of DESeq, DEXseq, and other useful things. I think he posted it on here sometime ago, but I have no idea which thread anymore.


        Code:
        #!/usr/bin/perl
        use warnings;
        use strict;
        
        my ($fastq,$sam,$outfile) = @ARGV;
        
        unless ($outfile) {
          die "Usage is filter_unmapped_reads.pl [FastQ file] [SAM File] [File for unmapped reads]\n";
        }
        
        if (-e $outfile) {
          die "Won't overwrite an existing file, delete it first!";
        }
        
        open (FASTQ,$fastq) or die "Can't open fastq file: $!";
        open (SAM,$sam) or die "Can't open SAM file: $!";
        open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";
        
        my $ids = read_ids();
        
        filter_fastq($ids);
        
        close OUT or die "Can't write to $outfile: $!";
        
        
        sub filter_fastq {
        
          warn "Filtering FastQ file\n";
        
          my ($ids) = @_;
        
          while (<FASTQ>) {
        
            if (/^@(\S+)/) {
              my $id = $1;
        
              # Remove the end designator from paired end reads
              $id =~ s/\/\d+$//;
        
              my $seq = <FASTQ>;
              my $id2 = <FASTQ>;
              my $qual = <FASTQ>;
        
        
              unless (exists $ids->{$id}) {
        	print OUT $_,$seq,$id2,$qual;
              }
            }
            else {
              warn "Line '$_' should have been an id line, but wasn't\n";
            }
        
          }
        
        }
        
        
        sub read_ids {
        
          warn "Collecting mapped ids\n";
        
          my $ids;
        
          while (<SAM>) {
        
            next if (/^@/);
            my ($id) = split(/\t/);
            $ids->{$id} = 1;
          }
        
          return $ids;
        }
        To use:

        Code:
        $ perl unaligned_reads.pl <in.fastq> <in.bam> <out.fastq>

        Comment


        • #5
          Ah, here is the thread where Simon originally posted it:

          Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

          Comment


          • #6
            Hi Chadn,

            Thank you so much for your detailed response.

            Best wishes,

            Rakesh

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X