Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks ECO.. it was original. (=

    In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
    The more you know, the more you know you don't know. —Aristotle

    Comment


    • #17
      Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

      Here's some sample usage:

      rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
      Processing file s_8_1_export.txt...
      Total collapsed unique aligned reads 656
      Completed processing file s_8_1_export.txt


      Hope it helps!

      Code:
      #!/usr/bin/perl
      
      #
       # Created by Richard de Borja on 2011-06-10.
       # Copyright (C) 2011 The Ontario Institute for Cancer Research
       #
       # This program is free software; you can redistribute it and/or modify
       # it under the terms of the GNU General Public License as published by
       # the Free Software Foundation; either version 3 of the License, or (at
       # your option) any later version.
       #
       # This program is distributed in the hope that it will be useful, but
       # WITHOUT ANY WARRANTY; without even the implied warranty of
       # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
       # General Public License for more details.
       #
       # You should have received a copy of the GNU General Public License
       # along with this program; if not, write to the Free Software
       # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
       # MA 02110-1301, USA.
      #
      # $Id export2bed.pl rdeborja$
      
      use strict;
      use Carp;
      use Getopt::Long;
      use Pod::Usage;
      use Data::Dumper;
      
      # command line default arguments
      our %opts = (
      	# list of arguments and default values
      	"bed"               => undef,
      	"length"            => 101,
      
      );
      
      our %_exportfield = (
          # list of columns for export file
          # See CASAVA 1.7 User Guide page 61
          "machine"           => 0,
          "run_number"        => 1,
          "lane"              => 2,
          "tile"              => 3,
          "x_coord_cluster"   => 4,
          "y_coord_cluster"   => 5,
          "index_seq"         => 6,
          "read_number"       => 7,
          "sequence"          => 8,
          "quality"           => 9,
          "match_chr"         => 10,
          "match_contig"      => 11,
          "match_position"    => 12,
          "match_strand"      => 13,
          "match_desc"        => 14,
          "single_read_score" => 15,
          "paired_read_score" => 16,
          "partner_chr"       => 17,
          "partner_contig"    => 18,
          "partner_offset"    => 19,
          "partner_strand"    => 20,
          "filtering"         => 21,
      );
      
      my $result = main();
      exit $result;
      
      ##############################
      # Main
      ##############################
      sub main {
      	# get command line arguments
      	GetOptions(
      		\%opts,
      		"help|?",
      		"man",
      		"bed=s",
      		"length:i",
      	) or pod2usage(64);		# 64=>EX_USAGE
      
      	pod2usage(1) if $opts{'help'};
      	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};
      
      	while(my ($arg, $value) = each(%opts)) {
      		if(!defined $value) {
      			print "ERROR: Missing argument $arg\n";
      			pod2usage(128);
      		}
      	}
          
          my $read_length = $opts{'length'};
          my %unique_read_hash;
          my $output_file = $opts{'bed'};
          
          foreach my $file (@ARGV) {
              print "Processing file $file...\n";
              my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
              print "Completed processing file $file\n";
          }
          
          my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);
      
      	return 0;
      }
      
      ##############################
      # Functions
      ##############################
      
      sub process_export_file {
          my $export_file = shift;
          my $read_length = shift;
          my $unique_read_hash = shift;
          my $output_file;
          
          my $output_file = $export_file;
          $output_file =~ s/txt/bed/;
          
          open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
          
          while(my $line = <$ifh>) {
              $line =~ s/^\s+//;
              $line =~ s/\s+$//;
              
              my @input_line = split(/\t/, $line);
              
              my $chr = $input_line[$_exportfield{'match_chr'}];
              my $start = $input_line[$_exportfield{'match_position'}];
              my $end = $start + $read_length - 1;
              my $pf = $input_line[$_exportfield{'filtering'}];
              
              # Skip any reads that do not align, do not uniquely align, or pass 
              # the Illumina filtering
              if (($chr =~ /chr/) && ($pf eq "Y")) {
                  $chr =~ s/\.fa//g;
                  my $bed_line = join("_", $chr, $start, $end);
                  $unique_read_hash->{$bed_line} = 1;
              } else {
                  next;
              }
          }
          close($ifh);
          
          #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);
      
          my $count = scalar(keys(%$unique_read_hash));
          print "Total collapsed unique aligned reads $count\n";
          return 0;
      }
      
      
      sub print_bed_file {
          my $bed_file = shift;
          my $read_hash = shift;
          
          open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
          foreach my $bed_entry (sort(keys(%$read_hash))) {
              my ($chr, $start, $end) = split("_", $bed_entry);
              print {$ofh} "$chr\t$start\t$end\n";
          }
          close($ofh);
      }
      
      __END__
      
      =head1 NAME
      
      export2bed.pl
      
      =head1 SYNOPSIS
      
      B<export2bed.pl> [options] [file ...]
      
      	Options:
      	--help			brief help message
      	--man			full documentation
      	--bed           name of output BED file
      	--length        read length (optional: default 101bp)
      
      =head1 OPTIONS
      
      =over 8
      
      =item B<--help>
      
      Print a brief help message and exit.
      
      =item B<--man>
      
      Print the manual page.
      
      =item B<--bed>
      
      Name of output BED file
      
      =item B<--length>
      
      Read length (optional, default is 101bp)
      
      =back
      
      =head1 DESCRIPTION
      
      B<export2bed.pl> convert an Illumina export.txt file to a BED file
      
      =head1 EXAMPLES
      
      export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt
      
      =head1 AUTHORS
      
      The Ontario Institute for Cancer Research
      
      =head1 SEE ALSO
      
      =cut
      Originally posted by joseph View Post
      Can anybody share an algorithm that takes eland files and make bed files out of them?
      Thanks
      Joseph

      Comment


      • #18
        Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

        Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

        perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

        Code:
        #!/usr/bin/perl
        # Program to convert eland export format to BED format
        # for running MACS
        # Chris Seidel, June 2009
        #
        # Requires tab delim file of chromosome names in the format:
        # chr_name chr_length
        # corrects for alignments that go off the ends of the chrs
        # negative bases are trimmed to 1, 
        # bases > chr_length are set to chr_length
        # (I know the former exist, I don't know if the latter exist)
        # results are not sorted, but can be sorted in linux by:
        # sort -o infile.bed -k1,1 -k2,2n infile.bed
        # (sort in place, first column, then by second column numeric)
        
        use File::Basename;
        
        die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
        
        # get info on chromosomes
        open(cmap, $ARGV[0]) || die("no chromosome length file!");
        %chrmap = {};
        while($line = <cmap>){
            chomp($line);
            ($chrom, $size) = split(/\t/, $line);
            $chrmap{$chrom} = $chrom;
            $chrsize{$chrom} = $size;
        }
        
        # open input file
        open(fp, $ARGV[1]) || die("can't open eland file");
        
        $lines = 0;
        while(<fp>){
            chop;
            ++$lines;
            @bits = split(/\t/);
            # skip reads that didn't pass filtering
            next if($bits[21] eq "N");
            # get match name
            $seqname = $bits[10];
            # skip No Matches or QC failures
            # next if($seqname =~ /NM|QC/);
            # skip repeat matches
            # next if($seqname =~ /\d+:\d+:\d+/);
            # we're only interested in sequences that match our chrs
        
            $chrom = $bits[10];
            $seqlen = length($bits[8]);
            next unless(exists($chrmap{$chrom}));
            $start = $bits[12];
            $end = $start + $seqlen - 1;
            $strand = $bits[13];
            #print "start:$start end:$end strand:$strand\n";
        
            # parse match descriptor
            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
            # skip reads beyond a certain threshold
            next if($n > 2);
            $read_code = "U".$n;
        
            # correct for alignments off the chromosome ends
            if( $start <= 0 ){
           print STDERR "start less than or equal to 0:   ", $start, "\n";
           print STDERR join("\t", @bits), "\n";
           $start = 1;
            }
        
            if($end > $chrsize{$seqname}){
           print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
           print STDERR join("\t", @bits), "\n";
           $end = $chrsize{$seqname};
            }
        
            if($strand eq "F"){
           $strand = "+";
           $color = "0,0,255";
            }
            else{
           $strand = "-";
           $color = "255,0,0";
            }
        
            $score = 1;
            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
            my $chr = $chrmap{$seqname};
            $chr =~ s/\.fa$//g;
            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
        
            # give some feedback
            print STDERR "$lines processed\n" if(!($lines % 100000));
        }

        Comment


        • #19
          Xenopus table

          Originally posted by mgogol View Post
          Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

          Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

          perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

          .........

          [/CODE]
          Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line

          Comment


          • #20
            Originally posted by apfejes View Post

            The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
            Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((

            Comment


            • #21
              The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.

              Comment


              • #22
                Originally posted by mgogol View Post
                The .fai file can be generated with samtools faidx on the genome fasta file. Sorry I didn't see this earlier, my thread subscription email was set up wrong.
                Thank you. I have no Samtools installed, but I found the Samtools:Fastaindex utility at the Genepattern site. So, I am running your script....and getting empty outputs, after quite intensive memory use and normal exit of the script. Any suggestions? Please?

                Comment


                • #23
                  Hm, I don't know, it's quite likely the eland export format has changed. If you post part of a file somewhere I could try to figure it out.

                  Comment


                  • #24
                    Oh, thanks. On the contrary, it is a pretty old file, circa 2009. I attached the index and a chunk of the export. I'd greatly appreciate, if you can take a look
                    Attached Files

                    Comment


                    • #25
                      Looks like I'm looking in a different column than your chromosome name seems to be in... Try this:

                      Code:
                      #!/usr/bin/perl
                      # Program to convert eland export format to BED format
                      # for running MACS
                      # Chris Seidel, June 2009
                      #
                      # Requires tab delim file of chromosome names in the format:
                      # chr_name chr_length
                      # corrects for alignments that go off the ends of the chrs
                      # negative bases are trimmed to 1, 
                      # bases > chr_length are set to chr_length
                      # (I know the former exist, I don't know if the latter exist)
                      # results are not sorted, but can be sorted in linux by:
                      # sort -o infile.bed -k1,1 -k2,2n infile.bed
                      # (sort in place, first column, then by second column numeric)
                      
                      use File::Basename;
                      
                      die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);
                      
                      # get info on chromosomes
                      open(cmap, $ARGV[0]) || die("no chromosome length file!");
                      %chrmap = {};
                      while($line = <cmap>){
                          chomp($line);
                          ($chrom, $size) = split(/\t/, $line);
                          $chrmap{$chrom} = $chrom;
                          $chrsize{$chrom} = $size;
                      }
                      
                      # open input file
                      open(fp, $ARGV[1]) || die("can't open eland file");
                      
                      $lines = 0;
                      while(<fp>){
                          chop;
                          ++$lines;
                          @bits = split(/\t/);
                          # skip reads that didn't pass filtering
                          next if($bits[21] eq "N");
                          # get match name
                          $seqname = $bits[11];
                          # skip No Matches or QC failures
                          # next if($seqname =~ /NM|QC/);
                          # skip repeat matches
                          # next if($seqname =~ /\d+:\d+:\d+/);
                          # we're only interested in sequences that match our chrs
                      
                          $chrom = $bits[11];
                          $seqlen = length($bits[8]);
                          next unless(exists($chrmap{$chrom}));
                          $start = $bits[12];
                          $end = $start + $seqlen - 1;
                          $strand = $bits[13];
                          #print "start:$start end:$end strand:$strand\n";
                      
                          # parse match descriptor
                          $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                          # skip reads beyond a certain threshold
                          next if($n > 2);
                          $read_code = "U".$n;
                      
                          # correct for alignments off the chromosome ends
                          if( $start <= 0 ){
                         print STDERR "start less than or equal to 0:   ", $start, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $start = 1;
                          }
                      
                          if($end > $chrsize{$seqname}){
                         print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
                         print STDERR join("\t", @bits), "\n";
                         $end = $chrsize{$seqname};
                          }
                      
                          if($strand eq "F"){
                         $strand = "+";
                         $color = "0,0,255";
                          }
                          else{
                         $strand = "-";
                         $color = "255,0,0";
                          }
                      
                          $score = 1;
                          $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                          my $chr = $chrmap{$seqname};
                          $chr =~ s/\.fa$//g;
                          print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                      
                          # give some feedback
                          print STDERR "$lines processed\n" if(!($lines % 100000));
                      }

                      Comment


                      • #26
                        It works now! Thank you. I should have figured it out myself, of course....

                        Comment


                        • #27
                          Yay! That's okay. Glad it works now.

                          Comment


                          • #28
                            #!/usr/bin/perl
                            #modified by zzz2010, no need chromosome length file
                            # Program to convert eland export format to BED format
                            # for running MACS
                            # Chris Seidel, June 2009
                            #
                            # Requires tab delim file of chromosome names in the format:
                            # chr_name chr_length
                            # corrects for alignments that go off the ends of the chrs
                            # negative bases are trimmed to 1,
                            # bases > chr_length are set to chr_length
                            # (I know the former exist, I don't know if the latter exist)
                            # results are not sorted, but can be sorted in linux by:
                            # sort -o infile.bed -k1,1 -k2,2n infile.bed
                            # (sort in place, first column, then by second column numeric)

                            use File::Basename;

                            die("usage: $0 eland_export.txt") unless(scalar(@ARGV) == 1);



                            # open input file
                            open(fp, $ARGV[0]) || die("can't open eland file");

                            $lines = 0;
                            while(<fp>){
                            chop;
                            ++$lines;
                            @bits = split(/\t/);
                            # skip reads that didn't pass filtering
                            next if($bits[21] eq "N");
                            # get match name
                            $seqname = $bits[10];
                            # skip No Matches or QC failures
                            # next if($seqname =~ /NM|QC/);
                            # skip repeat matches
                            # next if($seqname =~ /\d+:\d+:\d+/);
                            # we're only interested in sequences that match our chrs

                            $chrom = $bits[10];
                            $seqlen = length($bits[8]);
                            $start = $bits[12];
                            $end = $start + $seqlen - 1;
                            $strand = $bits[13];
                            #print "start:$start end:$end strand:$strand\n";

                            # parse match descriptor
                            $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
                            # skip reads beyond a certain threshold
                            next if($n > 2);
                            $read_code = "U".$n;

                            # correct for alignments off the chromosome ends
                            if( $start <= 0 ){
                            next;
                            }



                            if($strand eq "F"){
                            $strand = "+";
                            $color = "0,0,255";
                            }
                            else{
                            $strand = "-";
                            $color = "255,0,0";
                            }

                            $score = 1;
                            $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
                            my $chr = $seqname;
                            $chr =~ s/\.fa$//g;
                            $chr =~ s/[\S]+[Cc]hr/chr/g;
                            if (index($chr, "chr") != -1)
                            {
                            print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";
                            }
                            # give some feedback
                            print STDERR "$lines processed\n" if(!($lines % 100000));
                            }

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            13 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            69 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X