Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks ECO.. it was original. (=

    In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
    The more you know, the more you know you don't know. —Aristotle

    Comment


    • #17
      Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

      Here's some sample usage:

      rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
      Processing file s_8_1_export.txt...
      Total collapsed unique aligned reads 656
      Completed processing file s_8_1_export.txt


      Hope it helps!

      Code:
      #!/usr/bin/perl
      
      #
       # Created by Richard de Borja on 2011-06-10.
       # Copyright (C) 2011 The Ontario Institute for Cancer Research
       #
       # This program is free software; you can redistribute it and/or modify
       # it under the terms of the GNU General Public License as published by
       # the Free Software Foundation; either version 3 of the License, or (at
       # your option) any later version.
       #
       # This program is distributed in the hope that it will be useful, but
       # WITHOUT ANY WARRANTY; without even the implied warranty of
       # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       # General Public License for more details.
       #
       # You should have received a copy of the GNU General Public License
       # along with this program; if not, write to the Free Software
       # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
       # MA 02110-1301, USA.
      #
      # $Id export2bed.pl rdeborja$
      
      use strict;
      use Carp;
      use Getopt::Long;
      use Pod::Usage;
      use Data::Dumper;
      
      # command line default arguments
      our %opts = (
      	# list of arguments and default values
      	"bed"               => undef,
      	"length"            => 101,
      
      );
      
      our %_exportfield = (
          # list of columns for export file
          # See CASAVA 1.7 User Guide page 61
          "machine"           => 0,
          "run_number"        => 1,
          "lane"              => 2,
          "tile"              => 3,
          "x_coord_cluster"   => 4,
          "y_coord_cluster"   => 5,
          "index_seq"         => 6,
          "read_number"       => 7,
          "sequence"          => 8,
          "quality"           => 9,
          "match_chr"         => 10,
          "match_contig"      => 11,
          "match_position"    => 12,
          "match_strand"      => 13,
          "match_desc"        => 14,
          "single_read_score" => 15,
          "paired_read_score" => 16,
          "partner_chr"       => 17,
          "partner_contig"    => 18,
          "partner_offset"    => 19,
          "partner_strand"    => 20,
          "filtering"         => 21,
      );
      
      my $result = main();
      exit $result;
      
      ##############################
      # Main
      ##############################
      sub main {
      	# get command line arguments
      	GetOptions(
      		\%opts,
      		"help|?",
      		"man",
      		"bed=s",
      		"length:i",
      	) or pod2usage(64);		# 64=>EX_USAGE
      
      	pod2usage(1) if $opts{'help'};
      	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};
      
      	while(my ($arg, $value) = each(%opts)) {
      		if(!defined $value) {
      			print "ERROR: Missing argument $arg\n";
      			pod2usage(128);
      		}
      	}
          
          my $read_length = $opts{'length'};
          my %unique_read_hash;
          my $output_file = $opts{'bed'};
          
          foreach my $file (@ARGV) {
              print "Processing file $file...\n";
              my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
              print "Completed processing file $file\n";
          }
          
          my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);
      
      	return 0;
      }
      
      ##############################
      # Functions
      ##############################
      
      sub process_export_file {
          my $export_file = shift;
          my $read_length = shift;
          my $unique_read_hash = shift;
          my $output_file;
          
          my $output_file = $export_file;
          $output_file =~ s/txt/bed/;
          
          open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
          
          while(my $line = <$ifh>) {
              $line =~ s/^\s+//;
              $line =~ s/\s+$//;
              
              my @input_line = split(/\t/, $line);
              
              my $chr = $input_line[$_exportfield{'match_chr'}];
              my $start = $input_line[$_exportfield{'match_position'}];
              my $end = $start + $read_length - 1;
              my $pf = $input_line[$_exportfield{'filtering'}];
              
              # Skip any reads that do not align, do not uniquely align, or pass 
              # the Illumina filtering
              if (($chr =~ /chr/) && ($pf eq "Y")) {
                  $chr =~ s/\.fa//g;
                  my $bed_line = join("_", $chr, $start, $end);
                  $unique_read_hash->{$bed_line} = 1;
              } else {
                  next;
              }
          }
          close($ifh);
          
          #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);
      
          my $count = scalar(keys(%$unique_read_hash));
          print "Total collapsed unique aligned reads $count\n";
          return 0;
      }
      
      
      sub print_bed_file {
          my $bed_file = shift;
          my $read_hash = shift;
          
          open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
          foreach my $bed_entry (sort(keys(%$read_hash))) {
              my ($chr, $start, $end) = split("_", $bed_entry);
              print {$ofh} "$chr\t$start\t$end\n";
          }
          close($ofh);
      }
      
      __END__
      
      =head1 NAME
      
      export2bed.pl
      
      =head1 SYNOPSIS
      
      B<export2bed.pl> [options] [file ...]
      
      	Options:
      	--help			brief help message
      	--man			full documentation
      	--bed           name of output BED file
      	--length        read length (optional: default 101bp)
      
      =head1 OPTIONS
      
      =over 8
      
      =item B<--help>
      
      Print a brief help message and exit.
      
      =item B<--man>
      
      Print the manual page.
      
      =item B<--bed>
      
      Name of output BED file
      
      =item B<--length>
      
      Read length (optional, default is 101bp)
      
      =back
      
      =head1 DESCRIPTION
      
      B<export2bed.pl> convert an Illumina export.txt file to a BED file
      
      =head1 EXAMPLES
      
      export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt
      
      =head1 AUTHORS
      
      The Ontario Institute for Cancer Research
      
      =head1 SEE ALSO
      
      =cut
      Originally posted by joseph View Post
      Can anybody share an algorithm that takes eland files and make bed files out of them?
      Thanks
      Joseph

      Comment


      • #18
        Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

        Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

        perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

        Code:
        #!/usr/bin/perl
        # Program to convert eland export format to BED format
        # for running MACS
        # Chris Seidel, June 2009
        #
        # Requires tab delim file of chromosome names in the format:
        # chr_name chr_length
        # corrects for alignments that go off the ends of the chrs
        # negative bases are trimmed to 1, 
        # bases > chr_length are set to chr_length
        # (I know the former exist, I don't know if the latter exist)
        # results are not sorted, but can be sorted in linux by:
        # sort -o infile.bed -k1,1 -k2,2n infile.bed
        # (sort in place, first column, then by second column numeric)
        
        use File::Basename;
        
        die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
        
        # get info on chromosomes
        open(cmap, $ARGV[0]) || die("no chromosome length file!");
        %chrmap = {};
        while($line = <cmap>){
            chomp($line);
            ($chrom, $size) = split(/\t/, $line);
            $chrmap{$chrom} = $chrom;
            $chrsize{$chrom} = $size;
        }
        
        # open input file
        open(fp, $ARGV[1]) || die("can't open eland file");
        
        $lines = 0;
        while(<fp>){
            chop;
            ++$lines;
            @bits = split(/\t/);
            # skip reads that didn't pass filtering
            next if($bits[21] eq "N");
            # get match name
            $seqname = $bits[10];
            # skip No Matches or QC failures
            # next if($seqname =~ /NM|QC/);
            # skip repeat matches
            # next if($seqname =~ /\d+:\d+:\d+/);
            # we're only interested in sequences that match our chrs
        
            $chrom = $bits[10];
            $seqlen = length($bits[8]);
            next unless(exists($chrmap{$chrom}));
            $start = $bits[12];
            $end = $start + $seqlen - 1;
            $strand = $bits[13];
            #print "start:$start end:$end strand:$strand\n";
        
            # parse match descriptor
            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
            # skip reads beyond a certain threshold
            next if($n > 2);
            $read_code = "U".$n;
        
            # correct for alignments off the chromosome ends
            if( $start <= 0 ){
           print STDERR "start less than or equal to 0:   ", $start, "\n";
           print STDERR join("\t", @bits), "\n";
           $start = 1;
            }
        
            if($end > $chrsize{$seqname}){
           print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
           print STDERR join("\t", @bits), "\n";
           $end = $chrsize{$seqname};
            }
        
            if($strand eq "F"){
           $strand = "+";
           $color = "0,0,255";
            }
            else{
           $strand = "-";
           $color = "255,0,0";
            }
        
            $score = 1;
            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
            my $chr = $chrmap{$seqname};
            $chr =~ s/\.fa$//g;
            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
        
            # give some feedback
            print STDERR "$lines processed\n" if(!($lines % 100000));
        }

        Comment


        • #19
          Xenopus table

          Originally posted by mgogol View Post
          Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

          Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

          perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

          .........

          [/CODE]
          Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

          Comment


          • #20
            Originally posted by apfejes View Post

            The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
            Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

            Comment


            • #21
              The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

              Comment


              • #22
                Originally posted by mgogol View Post
                The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.
                Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

                Comment


                • #23
                  Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

                  Comment


                  • #24
                    Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look
                    Attached Files

                    Comment


                    • #25
                      Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

                      Code:
                      #!/usr/bin/perl
                      # Program to convert eland export format to BED format
                      # for running MACS
                      # Chris Seidel, June 2009
                      #
                      # Requires tab delim file of chromosome names in the format:
                      # chr_name chr_length
                      # corrects for alignments that go off the ends of the chrs
                      # negative bases are trimmed to 1, 
                      # bases > chr_length are set to chr_length
                      # (I know the former exist, I don't know if the latter exist)
                      # results are not sorted, but can be sorted in linux by:
                      # sort -o infile.bed -k1,1 -k2,2n infile.bed
                      # (sort in place, first column, then by second column numeric)
                      
                      use File::Basename;
                      
                      die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
                      
                      # get info on chromosomes
                      open(cmap, $ARGV[0]) || die("no chromosome length file!");
                      %chrmap = {};
                      while($line = <cmap>){
                          chomp($line);
                          ($chrom, $size) = split(/\t/, $line);
                          $chrmap{$chrom} = $chrom;
                          $chrsize{$chrom} = $size;
                      }
                      
                      # open input file
                      open(fp, $ARGV[1]) || die("can't open eland file");
                      
                      $lines = 0;
                      while(<fp>){
                          chop;
                          ++$lines;
                          @bits = split(/\t/);
                          # skip reads that didn't pass filtering
                          next if($bits[21] eq "N");
                          # get match name
                          $seqname = $bits[11];
                          # skip No Matches or QC failures
                          # next if($seqname =~ /NM|QC/);
                          # skip repeat matches
                          # next if($seqname =~ /\d+:\d+:\d+/);
                          # we're only interested in sequences that match our chrs
                      
                          $chrom = $bits[11];
                          $seqlen = length($bits[8]);
                          next unless(exists($chrmap{$chrom}));
                          $start = $bits[12];
                          $end = $start + $seqlen - 1;
                          $strand = $bits[13];
                          #print "start:$start end:$end strand:$strand\n";
                      
                          # parse match descriptor
                          $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                          # skip reads beyond a certain threshold
                          next if($n > 2);
                          $read_code = "U".$n;
                      
                          # correct for alignments off the chromosome ends
                          if( $start <= 0 ){
                         print STDERR "start less than or equal to 0:   ", $start, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $start = 1;
                          }
                      
                          if($end > $chrsize{$seqname}){
                         print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $end = $chrsize{$seqname};
                          }
                      
                          if($strand eq "F"){
                         $strand = "+";
                         $color = "0,0,255";
                          }
                          else{
                         $strand = "-";
                         $color = "255,0,0";
                          }
                      
                          $score = 1;
                          $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                          my $chr = $chrmap{$seqname};
                          $chr =~ s/\.fa$//g;
                          print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                      
                          # give some feedback
                          print STDERR "$lines processed\n" if(!($lines % 100000));
                      }

                      Comment


                      • #26
                        It works now! Thank you. I should have figured it out myself, of course....

                        Comment


                        • #27
                          Yay! That's okay. Glad it works now.

                          Comment


                          • #28
                            #!/usr/bin/perl
                            #modified by zzz2010, no need chromosome length file
                            # Program to convert eland export format to BED format
                            # for running MACS
                            # Chris Seidel, June 2009
                            #
                            # Requires tab delim file of chromosome names in the format:
                            # chr_name chr_length
                            # corrects for alignments that go off the ends of the chrs
                            # negative bases are trimmed to 1,
                            # bases > chr_length are set to chr_length
                            # (I know the former exist, I don't know if the latter exist)
                            # results are not sorted, but can be sorted in linux by:
                            # sort -o infile.bed -k1,1 -k2,2n infile.bed
                            # (sort in place, first column, then by second column numeric)

                            use File::Basename;

                            die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);



                            # open input file
                            open(fp, $ARGV[0]) || die("can't open eland file");

                            $lines = 0;
                            while(<fp>){
                            chop;
                            ++$lines;
                            @bits = split(/\t/);
                            # skip reads that didn't pass filtering
                            next if($bits[21] eq "N");
                            # get match name
                            $seqname = $bits[10];
                            # skip No Matches or QC failures
                            # next if($seqname =~ /NM|QC/);
                            # skip repeat matches
                            # next if($seqname =~ /\d+:\d+:\d+/);
                            # we're only interested in sequences that match our chrs

                            $chrom = $bits[10];
                            $seqlen = length($bits[8]);
                            $start = $bits[12];
                            $end = $start + $seqlen - 1;
                            $strand = $bits[13];
                            #print "start:$start end:$end strand:$strand\n";

                            # parse match descriptor
                            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                            # skip reads beyond a certain threshold
                            next if($n > 2);
                            $read_code = "U".$n;

                            # correct for alignments off the chromosome ends
                            if( $start <= 0 ){
                            next;
                            }



                            if($strand eq "F"){
                            $strand = "+";
                            $color = "0,0,255";
                            }
                            else{
                            $strand = "-";
                            $color = "255,0,0";
                            }

                            $score = 1;
                            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                            my $chr = $seqname;
                            $chr =~ s/\.fa$//g;
                            $chr =~ s/[\S]+[Cc]hr/chr/g;
                            if (index($chr, "chr") != -1)
                            {
                            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                            }
                            # give some feedback
                            print STDERR "$lines processed\n" if(!($lines % 100000));
                            }

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Advancing Precision Medicine for Rare Diseases in Children
                              by seqadmin




                              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                              12-16-2024, 07:57 AM
                            • seqadmin
                              Recent Advances in Sequencing Technologies
                              by seqadmin



                              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                              Long-Read Sequencing
                              Long-read sequencing has seen remarkable advancements,...
                              12-02-2024, 01:49 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 12-17-2024, 10:28 AM
                            0 responses
                            26 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-13-2024, 08:24 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-12-2024, 07:41 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-11-2024, 07:45 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X