Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • apfejes
    Senior Member
    • Feb 2008
    • 236

    #16
    Thanks ECO.. it was original. (=

    In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
    The more you know, the more you know you don't know. —Aristotle

    Comment

    • rdeborja
      Member
      • Aug 2008
      • 42

      #17
      Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

      Here's some sample usage:

      rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
      Processing file s_8_1_export.txt...
      Total collapsed unique aligned reads 656
      Completed processing file s_8_1_export.txt


      Hope it helps!

      Code:
      #!/usr/bin/perl
      
      #
       # Created by Richard de Borja on 2011-06-10.
       # Copyright (C) 2011 The Ontario Institute for Cancer Research
       #
       # This program is free software; you can redistribute it and/or modify
       # it under the terms of the GNU General Public License as published by
       # the Free Software Foundation; either version 3 of the License, or (at
       # your option) any later version.
       #
       # This program is distributed in the hope that it will be useful, but
       # WITHOUT ANY WARRANTY; without even the implied warranty of
       # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       # General Public License for more details.
       #
       # You should have received a copy of the GNU General Public License
       # along with this program; if not, write to the Free Software
       # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
       # MA 02110-1301, USA.
      #
      # $Id export2bed.pl rdeborja$
      
      use strict;
      use Carp;
      use Getopt::Long;
      use Pod::Usage;
      use Data::Dumper;
      
      # command line default arguments
      our %opts = (
      	# list of arguments and default values
      	"bed"               => undef,
      	"length"            => 101,
      
      );
      
      our %_exportfield = (
          # list of columns for export file
          # See CASAVA 1.7 User Guide page 61
          "machine"           => 0,
          "run_number"        => 1,
          "lane"              => 2,
          "tile"              => 3,
          "x_coord_cluster"   => 4,
          "y_coord_cluster"   => 5,
          "index_seq"         => 6,
          "read_number"       => 7,
          "sequence"          => 8,
          "quality"           => 9,
          "match_chr"         => 10,
          "match_contig"      => 11,
          "match_position"    => 12,
          "match_strand"      => 13,
          "match_desc"        => 14,
          "single_read_score" => 15,
          "paired_read_score" => 16,
          "partner_chr"       => 17,
          "partner_contig"    => 18,
          "partner_offset"    => 19,
          "partner_strand"    => 20,
          "filtering"         => 21,
      );
      
      my $result = main();
      exit $result;
      
      ##############################
      # Main
      ##############################
      sub main {
      	# get command line arguments
      	GetOptions(
      		\%opts,
      		"help|?",
      		"man",
      		"bed=s",
      		"length:i",
      	) or pod2usage(64);		# 64=>EX_USAGE
      
      	pod2usage(1) if $opts{'help'};
      	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};
      
      	while(my ($arg, $value) = each(%opts)) {
      		if(!defined $value) {
      			print "ERROR: Missing argument $arg\n";
      			pod2usage(128);
      		}
      	}
          
          my $read_length = $opts{'length'};
          my %unique_read_hash;
          my $output_file = $opts{'bed'};
          
          foreach my $file (@ARGV) {
              print "Processing file $file...\n";
              my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
              print "Completed processing file $file\n";
          }
          
          my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);
      
      	return 0;
      }
      
      ##############################
      # Functions
      ##############################
      
      sub process_export_file {
          my $export_file = shift;
          my $read_length = shift;
          my $unique_read_hash = shift;
          my $output_file;
          
          my $output_file = $export_file;
          $output_file =~ s/txt/bed/;
          
          open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
          
          while(my $line = <$ifh>) {
              $line =~ s/^\s+//;
              $line =~ s/\s+$//;
              
              my @input_line = split(/\t/, $line);
              
              my $chr = $input_line[$_exportfield{'match_chr'}];
              my $start = $input_line[$_exportfield{'match_position'}];
              my $end = $start + $read_length - 1;
              my $pf = $input_line[$_exportfield{'filtering'}];
              
              # Skip any reads that do not align, do not uniquely align, or pass 
              # the Illumina filtering
              if (($chr =~ /chr/) && ($pf eq "Y")) {
                  $chr =~ s/\.fa//g;
                  my $bed_line = join("_", $chr, $start, $end);
                  $unique_read_hash->{$bed_line} = 1;
              } else {
                  next;
              }
          }
          close($ifh);
          
          #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);
      
          my $count = scalar(keys(%$unique_read_hash));
          print "Total collapsed unique aligned reads $count\n";
          return 0;
      }
      
      
      sub print_bed_file {
          my $bed_file = shift;
          my $read_hash = shift;
          
          open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
          foreach my $bed_entry (sort(keys(%$read_hash))) {
              my ($chr, $start, $end) = split("_", $bed_entry);
              print {$ofh} "$chr\t$start\t$end\n";
          }
          close($ofh);
      }
      
      __END__
      
      =head1 NAME
      
      export2bed.pl
      
      =head1 SYNOPSIS
      
      B<export2bed.pl> [options] [file ...]
      
      	Options:
      	--help			brief help message
      	--man			full documentation
      	--bed           name of output BED file
      	--length        read length (optional: default 101bp)
      
      =head1 OPTIONS
      
      =over 8
      
      =item B<--help>
      
      Print a brief help message and exit.
      
      =item B<--man>
      
      Print the manual page.
      
      =item B<--bed>
      
      Name of output BED file
      
      =item B<--length>
      
      Read length (optional, default is 101bp)
      
      =back
      
      =head1 DESCRIPTION
      
      B<export2bed.pl> convert an Illumina export.txt file to a BED file
      
      =head1 EXAMPLES
      
      export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt
      
      =head1 AUTHORS
      
      The Ontario Institute for Cancer Research
      
      =head1 SEE ALSO
      
      =cut
      Originally posted by joseph View Post
      Can anybody share an algorithm that takes eland files and make bed files out of them?
      Thanks
      Joseph

      Comment

      • mgogol
        Senior Member
        • Mar 2008
        • 197

        #18
        Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

        Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

        perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

        Code:
        #!/usr/bin/perl
        # Program to convert eland export format to BED format
        # for running MACS
        # Chris Seidel, June 2009
        #
        # Requires tab delim file of chromosome names in the format:
        # chr_name chr_length
        # corrects for alignments that go off the ends of the chrs
        # negative bases are trimmed to 1, 
        # bases > chr_length are set to chr_length
        # (I know the former exist, I don't know if the latter exist)
        # results are not sorted, but can be sorted in linux by:
        # sort -o infile.bed -k1,1 -k2,2n infile.bed
        # (sort in place, first column, then by second column numeric)
        
        use File::Basename;
        
        die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
        
        # get info on chromosomes
        open(cmap, $ARGV[0]) || die("no chromosome length file!");
        %chrmap = {};
        while($line = <cmap>){
            chomp($line);
            ($chrom, $size) = split(/\t/, $line);
            $chrmap{$chrom} = $chrom;
            $chrsize{$chrom} = $size;
        }
        
        # open input file
        open(fp, $ARGV[1]) || die("can't open eland file");
        
        $lines = 0;
        while(<fp>){
            chop;
            ++$lines;
            @bits = split(/\t/);
            # skip reads that didn't pass filtering
            next if($bits[21] eq "N");
            # get match name
            $seqname = $bits[10];
            # skip No Matches or QC failures
            # next if($seqname =~ /NM|QC/);
            # skip repeat matches
            # next if($seqname =~ /\d+:\d+:\d+/);
            # we're only interested in sequences that match our chrs
        
            $chrom = $bits[10];
            $seqlen = length($bits[8]);
            next unless(exists($chrmap{$chrom}));
            $start = $bits[12];
            $end = $start + $seqlen - 1;
            $strand = $bits[13];
            #print "start:$start end:$end strand:$strand\n";
        
            # parse match descriptor
            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
            # skip reads beyond a certain threshold
            next if($n > 2);
            $read_code = "U".$n;
        
            # correct for alignments off the chromosome ends
            if( $start <= 0 ){
           print STDERR "start less than or equal to 0:   ", $start, "\n";
           print STDERR join("\t", @bits), "\n";
           $start = 1;
            }
        
            if($end > $chrsize{$seqname}){
           print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
           print STDERR join("\t", @bits), "\n";
           $end = $chrsize{$seqname};
            }
        
            if($strand eq "F"){
           $strand = "+";
           $color = "0,0,255";
            }
            else{
           $strand = "-";
           $color = "255,0,0";
            }
        
            $score = 1;
            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
            my $chr = $chrmap{$seqname};
            $chr =~ s/\.fa$//g;
            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
        
            # give some feedback
            print STDERR "$lines processed\n" if(!($lines % 100000));
        }

        Comment

        • diablito
          Junior Member
          • Apr 2012
          • 5

          #19
          Xenopus table

          Originally posted by mgogol View Post
          Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

          Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

          perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

          .........

          [/CODE]
          Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

          Comment

          • diablito
            Junior Member
            • Apr 2012
            • 5

            #20
            Originally posted by apfejes View Post

            The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
            Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

            Comment

            • mgogol
              Senior Member
              • Mar 2008
              • 197

              #21
              The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

              Comment

              • diablito
                Junior Member
                • Apr 2012
                • 5

                #22
                Originally posted by mgogol View Post
                The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.
                Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

                Comment

                • mgogol
                  Senior Member
                  • Mar 2008
                  • 197

                  #23
                  Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

                  Comment

                  • diablito
                    Junior Member
                    • Apr 2012
                    • 5

                    #24
                    Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look
                    Attached Files

                    Comment

                    • mgogol
                      Senior Member
                      • Mar 2008
                      • 197

                      #25
                      Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

                      Code:
                      #!/usr/bin/perl
                      # Program to convert eland export format to BED format
                      # for running MACS
                      # Chris Seidel, June 2009
                      #
                      # Requires tab delim file of chromosome names in the format:
                      # chr_name chr_length
                      # corrects for alignments that go off the ends of the chrs
                      # negative bases are trimmed to 1, 
                      # bases > chr_length are set to chr_length
                      # (I know the former exist, I don't know if the latter exist)
                      # results are not sorted, but can be sorted in linux by:
                      # sort -o infile.bed -k1,1 -k2,2n infile.bed
                      # (sort in place, first column, then by second column numeric)
                      
                      use File::Basename;
                      
                      die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
                      
                      # get info on chromosomes
                      open(cmap, $ARGV[0]) || die("no chromosome length file!");
                      %chrmap = {};
                      while($line = <cmap>){
                          chomp($line);
                          ($chrom, $size) = split(/\t/, $line);
                          $chrmap{$chrom} = $chrom;
                          $chrsize{$chrom} = $size;
                      }
                      
                      # open input file
                      open(fp, $ARGV[1]) || die("can't open eland file");
                      
                      $lines = 0;
                      while(<fp>){
                          chop;
                          ++$lines;
                          @bits = split(/\t/);
                          # skip reads that didn't pass filtering
                          next if($bits[21] eq "N");
                          # get match name
                          $seqname = $bits[11];
                          # skip No Matches or QC failures
                          # next if($seqname =~ /NM|QC/);
                          # skip repeat matches
                          # next if($seqname =~ /\d+:\d+:\d+/);
                          # we're only interested in sequences that match our chrs
                      
                          $chrom = $bits[11];
                          $seqlen = length($bits[8]);
                          next unless(exists($chrmap{$chrom}));
                          $start = $bits[12];
                          $end = $start + $seqlen - 1;
                          $strand = $bits[13];
                          #print "start:$start end:$end strand:$strand\n";
                      
                          # parse match descriptor
                          $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                          # skip reads beyond a certain threshold
                          next if($n > 2);
                          $read_code = "U".$n;
                      
                          # correct for alignments off the chromosome ends
                          if( $start <= 0 ){
                         print STDERR "start less than or equal to 0:   ", $start, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $start = 1;
                          }
                      
                          if($end > $chrsize{$seqname}){
                         print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $end = $chrsize{$seqname};
                          }
                      
                          if($strand eq "F"){
                         $strand = "+";
                         $color = "0,0,255";
                          }
                          else{
                         $strand = "-";
                         $color = "255,0,0";
                          }
                      
                          $score = 1;
                          $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                          my $chr = $chrmap{$seqname};
                          $chr =~ s/\.fa$//g;
                          print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                      
                          # give some feedback
                          print STDERR "$lines processed\n" if(!($lines % 100000));
                      }

                      Comment

                      • diablito
                        Junior Member
                        • Apr 2012
                        • 5

                        #26
                        It works now! Thank you. I should have figured it out myself, of course....

                        Comment

                        • mgogol
                          Senior Member
                          • Mar 2008
                          • 197

                          #27
                          Yay! That's okay. Glad it works now.

                          Comment

                          • zzz2010
                            Junior Member
                            • Mar 2010
                            • 1

                            #28
                            #!/usr/bin/perl
                            #modified by zzz2010, no need chromosome length file
                            # Program to convert eland export format to BED format
                            # for running MACS
                            # Chris Seidel, June 2009
                            #
                            # Requires tab delim file of chromosome names in the format:
                            # chr_name chr_length
                            # corrects for alignments that go off the ends of the chrs
                            # negative bases are trimmed to 1,
                            # bases > chr_length are set to chr_length
                            # (I know the former exist, I don't know if the latter exist)
                            # results are not sorted, but can be sorted in linux by:
                            # sort -o infile.bed -k1,1 -k2,2n infile.bed
                            # (sort in place, first column, then by second column numeric)

                            use File::Basename;

                            die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);



                            # open input file
                            open(fp, $ARGV[0]) || die("can't open eland file");

                            $lines = 0;
                            while(<fp>){
                            chop;
                            ++$lines;
                            @bits = split(/\t/);
                            # skip reads that didn't pass filtering
                            next if($bits[21] eq "N");
                            # get match name
                            $seqname = $bits[10];
                            # skip No Matches or QC failures
                            # next if($seqname =~ /NM|QC/);
                            # skip repeat matches
                            # next if($seqname =~ /\d+:\d+:\d+/);
                            # we're only interested in sequences that match our chrs

                            $chrom = $bits[10];
                            $seqlen = length($bits[8]);
                            $start = $bits[12];
                            $end = $start + $seqlen - 1;
                            $strand = $bits[13];
                            #print "start:$start end:$end strand:$strand\n";

                            # parse match descriptor
                            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                            # skip reads beyond a certain threshold
                            next if($n > 2);
                            $read_code = "U".$n;

                            # correct for alignments off the chromosome ends
                            if( $start <= 0 ){
                            next;
                            }



                            if($strand eq "F"){
                            $strand = "+";
                            $color = "0,0,255";
                            }
                            else{
                            $strand = "-";
                            $color = "255,0,0";
                            }

                            $score = 1;
                            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                            my $chr = $seqname;
                            $chr =~ s/\.fa$//g;
                            $chr =~ s/[\S]+[Cc]hr/chr/g;
                            if (index($chr, "chr") != -1)
                            {
                            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                            }
                            # give some feedback
                            print STDERR "$lines processed\n" if(!($lines % 100000));
                            }

                            Comment

                            Latest Articles

                            Collapse

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Yesterday, 10:09 AM
                            0 responses
                            10 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-04-2026, 08:59 AM
                            0 responses
                            21 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            27 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            22 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...