Getting sequences into a bam file - psl2sam.pl

Adamo

Member

Join Date: Jun 2010
Posts: 28

Getting sequences into a bam file - psl2sam.pl

07-16-2010, 04:06 AM

Hello,

I have a psl file, and convert it into sam format using psl2sam.pl (samtools).
I'd like to know if there is a way of getting the target sequence in the sam file in place of the "*" placeholder?
pslx format doesn't work either.

I have found the code of some script that should do the job, but it doesn't work. Here is the script, if someone wants to see it:

Code:

#!/usr/bin/perl -w

# Contact: nlw
#this script combines a sam file with the sequences from a fasta file.  originally designed as a second step when converting blat (psl) to sam.


$| = 1;

## SJC: Generic debug variable and print.
my $DEBUG = 1;
sub kvetch{
    my $foo = shift || '';
    print STDERR $foo . "\n" if $DEBUG;
}


use strict;
use warnings;
use Getopt::Std;
use File::Copy;
use Switch;

my $lines_read = 0;
my $lines_written = 0;
my @filearray;
my $lines_in_file = 0;
my $start = time();
my @buffer;

($lines_read, $lines_written, $lines_in_file) = &bowtie2sam;
my $stop = time();
my $time = $stop-$start;
print STDERR "Done. Processed in " . $time . " sec \n";

exit;

sub bowtie2sam {
  my %opts = ();
  die("Usage: tab_merge_sam.pl <seq> <sam>\n") if (@ARGV == 0 && -t STDIN);
  my $seq_file = $ARGV[0];
  my $sam_file = $ARGV[1];

  my $tmp_seq = $seq_file . ".tmp";
  my $tmp_sam = $sam_file . ".tmp";
  my $merged_sam = $sam_file . ".seq_merged";

  print STDERR "Sorting sequence file \"$seq_file\"... ";
  `grep -v sequence $seq_file | sort > $tmp_seq`;
 
  print STDERR "Done.\n";

  print STDERR "Sorting sam file \"$sam_file\"... ";
  `sort $sam_file > $tmp_sam`;

  print STDERR "Done.\n";

  open(SAM, $tmp_sam);
  open(SEQ, $tmp_seq);


  my $wc_seq = `wc -l $tmp_seq`;
  my $lines_in_seq = (split(" ", $wc_seq))[0];
  my $lines_read_sam = 0;
  my $lines_written = 0;
  my $read_not_found = 0;
  my $reads_matched = 0;
  my $lines_read_seq = 0;

  print STDERR "there are " . $lines_in_seq . " lines in the file \"" . $seq_file . "\"\n";
  print STDERR "Merging...\n";
  my $sam_id = "";

  #there may be multiple alignments reported
  #for each of the reads in the tab file, find them in the sam file, add the hit counter

  while (<SEQ>) {
      $lines_read_seq++;
      my $seq_line = $_;
      chomp($seq_line);
      $seq_line =~ s/\s+/\t/g;
      my ($seq_id, $read_sequence) = split(/\t/,$seq_line);

      my $seq_found = 0;
      my $too_big = 0;
      my $seq_hits = 0; #counter for number of alignments for each sequence
      #print STDERR "line $lines_read_seq seq is $seq_id\n";
      my $flag = 0;
      while (!$flag) {
          last if eof(SAM);
          my $sam_line = <SAM>;  
          next if ($sam_line =~ /^\s*$/);
          my $line_length = length($sam_line);
          chomp($sam_line);
          $sam_line =~ s/\s+/\t/g;
          my @sam = split(/\t/, $sam_line);
          my $sam_id = $sam[0];
          my $strand = ($sam[1] == 0);
          #if ($sam_line =~ m/\Q$seq_id/) {  
          $lines_read_sam++;
          #print STDERR "line $lines_read_sam sam is $sam_id\n";
          if ($sam_id eq $seq_id) {
              #add sequence to sam array
              #$sam[9] = $strand ? $read_sequence : reverse_complement($read_sequence) ;
              if ($strand) {
                  $sam[9] = $read_sequence;
              } else {
                  #print STDERR "rc for read $sam_id at line $lines_read_sam\n";
                  $sam[9] = reverse_complement($read_sequence);
              }
              $seq_hits++;  
              $sam[11] .= " HI:i:" . $seq_hits;
              $reads_matched++;
              #print STDERR "matched $seq_id at line $lines_read_sam of sam file\n";
              print STDOUT join("\t", @sam) . "\n";      
          } elsif ($sam_id gt $seq_id) {
              $lines_read_sam--;
              #backup
              seek(SAM,-$line_length,1);
              #print STDERR "backing up to line $lines_read_sam of sam file\n";
              $flag = 1;
          } else {
              print STDOUT join("\t", @sam) . "\n";      
              die;
          }  
      }
  }
  print STDERR "++++++++++++++++++++++++++++++++++\n";
  print STDERR "SAM read: $lines_read_sam\n";
  print STDERR "reads matched: $reads_matched\n";
  print STDERR "reads not found: $read_not_found\n";
  print STDERR "++++++++++++++++++++++++++++++++++\n";
  close(SAM);
  close(SEQ);
  unlink ($tmp_sam);
  unlink ($tmp_seq);
}

sub reverse_complement {
    my ($sequence) = @_;
    my $rc_sequence = "";
    my @array = split(//, $sequence);
    foreach my $c (@array) {
        switch ($c) {
            case "A" {$c = "T"}
            case "T" {$c = "A"}
            case "G" {$c = "C"}
            case "C" {$c = "G"}
        }
        $rc_sequence = $c . $rc_sequence
    }
    return $rc_sequence;
}

Thanks in advance!

Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Getting sequences into a bam file - psl2sam.pl

Latest Articles

ad_right_rmr

News