SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a BED file format validator? Does a BED file have to be sorted position? LauraSmith Bioinformatics 3 05-21-2013 11:54 AM
Eland to Bed __sequence Illumina/Solexa 0 06-08-2011 04:07 AM
How to convert eland file to BED format huo Bioinformatics 1 01-19-2011 10:43 AM
Are Solexa GAPipeline v.1.0 ELAND results realignable with new v.1.4 ELAND module? marlei Bioinformatics 1 10-15-2009 05:51 AM
BED to ELAND/bowtie/MAQ ? ewilbanks Bioinformatics 1 08-21-2009 12:25 AM

Reply
 
Thread Tools
Old 10-04-2008, 11:25 AM   #1
joseph
Member
 
Location: ca

Join Date: Feb 2008
Posts: 38
Default Eland-to-Bed algorithm

Can anybody share an algorithm that takes eland files and make bed files out of them?
Thanks
Joseph
joseph is offline   Reply With Quote
Old 10-05-2008, 07:26 PM   #2
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,295
Default

If you don't get any help...post a few lines of the eland file and i can try to slap something together.
ECO is offline   Reply With Quote
Old 10-05-2008, 07:55 PM   #3
jlli
Member
 
Location: FL

Join Date: Jun 2008
Posts: 19
Default

ElandtoBed.jar at FindPeak package (http://www.bcgsc.ca/platform/bioinfo/software/findpeaks)
jlli is offline   Reply With Quote
Old 10-05-2008, 07:58 PM   #4
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,295
Default

java...*shudders*
ECO is offline   Reply With Quote
Old 10-05-2008, 08:23 PM   #5
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

Hey - Java's not THAT bad. I like that I can get 600%+ CPU usage with it, without any explicit multi-threading.

Anyhow, I have much better translators, now, in the Vancouver Short Read Analysis Package.... but they're still java. :P

The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 06-08-2011, 04:14 AM   #6
__sequence
Junior Member
 
Location: EU

Join Date: Jun 2011
Posts: 7
Default

apfejes, could you please tell, what is the last downloadable version of your program, and what is an example of a command line to use it? I need to convert a single file (whole genome, not divided into chromosomes) as provided by the Eland export, convert it to .Bed
__sequence is offline   Reply With Quote
Old 06-08-2011, 06:46 AM   #7
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 190
Default perl script export2bed.pl

Here's a script from a colleague that I've used before.

Code:
#!/usr/bin/perl
# Program to convert eland export format to BED format
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome or contig names 
# (eland fa match files) in the format:
# UCSC_chr_name chr_length eland_name
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k 1,1 -k 2,2n infile.bed
# (sort in place, first column, then by second column numeric)

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# create output filename
$outfile = $ARGV[1];
$outfile =~ s/\.txt$/\.bed/;
open(FOUT, ">$outfile") || die("can't open output file: $outfile");

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome name mapping file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($newval, $size, $oldval) = split(/\t/, $line);
    $chrmap{$oldval} = $newval;
    $chrsize{$oldval} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[10];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs
    next unless(exists($chrmap{$seqname}));

    $seqlen = length($bits[8]);
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
        print STDERR "start less than or equal to 0:   ", $start, "\n";
        print STDERR join("\t", @bits), "\n";
        $start = 1;
    }

    if($end > $chrsize{$seqname}){
        print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
        print STDERR join("\t", @bits), "\n";
        $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
        $strand = "+";
        $color = "0,0,255";
    }
    else{
        $strand = "-";
        $color = "255,0,0";
    }

    $score = 0;
    print FOUT join("\t", $chrmap{$seqname}, $start, $end, $read_code, $score, $strand, $start, $end, $color), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}

close(FOUT);
print STDERR "output file: $outfile\n";
mgogol is offline   Reply With Quote
Old 06-08-2011, 07:00 AM   #8
__sequence
Junior Member
 
Location: EU

Join Date: Jun 2011
Posts: 7
Default

mgogol, Thank you for posting. I have tried this script, and it returns an empty file. May be the problem is that it requires an additional file as input?

Quote:
Requires tab delim file of chromosome or contig names
# (eland fa match files) in the format:
# UCSC_chr_name chr_length eland_name
I have all chromosomes in one input file. So I still need to create this additional file with chromosome names? Not clear how.
__sequence is offline   Reply With Quote
Old 06-08-2011, 10:07 PM   #9
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

Sorry for the slow reply - I'm currently away at a conference.

The latest versions can be found as part of the Vancouver Short Read Analysis Package: http://vancouvershortr.sourceforge.net/

There should be several work flows here, depending on the starting format.

http://sourceforge.net/apps/mediawik...e=ConvertToBed

If that doesn't work for you, please let me know, and I'll provide more explicit information when I return to the office.
Cheers
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 06-09-2011, 03:03 AM   #10
__sequence
Junior Member
 
Location: EU

Join Date: Jun 2011
Posts: 7
Default

Quote:
Originally Posted by apfejes View Post
The latest versions can be found as part of the Vancouver Short Read Analysis Package: http://vancouvershortr.sourceforge.net/

There should be several work flows here, depending on the starting format.

http://sourceforge.net/apps/mediawik...e=ConvertToBed

If that doesn't work for you, please let me know, and I'll provide more explicit information when I return to the office.
Hi apfejes,

Thank you for your reply. I tried the following format: java -jar conversion_util/ConvertToBed.jar -aligner eland -input "input_dir/name" -output "output_dir" -name "name" -noprepend

As a result I got the following error:

Version: Initializing class ElandIterator $Revision: 2933 $
Error: Line 1 has an invalid read:
Error: Mismatches is less than 0

Last edited by __sequence; 06-09-2011 at 03:22 AM.
__sequence is offline   Reply With Quote
Old 06-09-2011, 03:30 AM   #11
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

You need to put the log file in a directory that exists and for which you have write permissions. If the directory you've given above does not exits, then it will not be able to create the log file.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 06-09-2011, 03:39 AM   #12
__sequence
Junior Member
 
Location: EU

Join Date: Jun 2011
Posts: 7
Default

Oups, I just edited my post above. I figured out about the directory, but there is still an error:
Error: Line 1 has an invalid read:
Error: Mismatches is less than 0
__sequence is offline   Reply With Quote
Old 06-09-2011, 05:00 AM   #13
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

That is some serious spam above - and worse, copied from my own blog!

Anyhow, can you paste the first line of the file? There's probably something simple that's going wrong, eg, you haven't followed the work flow to remove unmapped reads.

http://sourceforge.net/apps/mediawik...itle=WorkFlows
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 06-09-2011, 06:02 AM   #14
__sequence
Junior Member
 
Location: EU

Join Date: Jun 2011
Posts: 7
Default

Quote:
Originally Posted by apfejes View Post
That is some serious spam above - and worse, copied from my own blog!

Anyhow, can you paste the first line of the file? There's probably something simple that's going wrong, eg, you haven't followed the work flow to remove unmapped reads.

http://sourceforge.net/apps/mediawik...itle=WorkFlows
I tried using
Quote:
grep “U[012]” Input.eland > Input.um.eland
as suggested at your manual, but it returns an empty file. May be my file is not in a standard Eland export format? I have been told that this is the output from Eland.

Here is how the first line looks like:

Quote:
XX-XXXX01 21 1 1 1065 918 0 1 NTCAAAAACCCAGCGAACATCATTCTTTGGCTAGGG BMMMNVWTTVb____b_bb__b_b__bQQ______Y chr7.fa 128316527 R A35 111 Y
Is it Eland Export format?

Last edited by __sequence; 06-09-2011 at 06:19 AM.
__sequence is offline   Reply With Quote
Old 06-09-2011, 06:10 AM   #15
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,295
Default

Sorry about the spam...the guy used an interesting strategy of spamming with Anthony's content to get past the filters. Grrrrrr.
ECO is offline   Reply With Quote
Old 06-10-2011, 01:02 AM   #16
apfejes
Senior Member
 
Location: Vancouver, Canada

Join Date: Feb 2008
Posts: 236
Default

Thanks ECO.. it was original. (=

In any case, I'll have a look at this on monday. I don't have time to look at it until I'm back in Canada. If you haven't heard from me by then, please send me a reminder.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 06-10-2011, 05:12 PM   #17
rdeborja
Member
 
Location: Toronto

Join Date: Aug 2008
Posts: 41
Default

Talk about coincidence, I was working on a piece of code that would take Illumina TruSeq export files and merge them into a single bed file just this morning. It outputs only those reads that align uniquely and pass filtering and collapses potential PCR biases (via start point).

Here's some sample usage:

rdeborja@hn1:~/tmp$ export2bed.pl --bed s_8_1_export.bed s_8_1_export.txt
Processing file s_8_1_export.txt...
Total collapsed unique aligned reads 656
Completed processing file s_8_1_export.txt


Hope it helps!

Code:
#!/usr/bin/perl

#
 # Created by Richard de Borja on 2011-06-10.
 # Copyright (C) 2011 The Ontario Institute for Cancer Research
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation; either version 3 of the License, or (at
 # your option) any later version.
 #
 # This program is distributed in the hope that it will be useful, but
 # WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 # General Public License for more details.
 #
 # You should have received a copy of the GNU General Public License
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
 # MA 02110-1301, USA.
#
# $Id export2bed.pl rdeborja$

use strict;
use Carp;
use Getopt::Long;
use Pod::Usage;
use Data::Dumper;

# command line default arguments
our %opts = (
	# list of arguments and default values
	"bed"               => undef,
	"length"            => 101,

);

our %_exportfield = (
    # list of columns for export file
    # See CASAVA 1.7 User Guide page 61
    "machine"           => 0,
    "run_number"        => 1,
    "lane"              => 2,
    "tile"              => 3,
    "x_coord_cluster"   => 4,
    "y_coord_cluster"   => 5,
    "index_seq"         => 6,
    "read_number"       => 7,
    "sequence"          => 8,
    "quality"           => 9,
    "match_chr"         => 10,
    "match_contig"      => 11,
    "match_position"    => 12,
    "match_strand"      => 13,
    "match_desc"        => 14,
    "single_read_score" => 15,
    "paired_read_score" => 16,
    "partner_chr"       => 17,
    "partner_contig"    => 18,
    "partner_offset"    => 19,
    "partner_strand"    => 20,
    "filtering"         => 21,
);

my $result = main();
exit $result;

##############################
# Main
##############################
sub main {
	# get command line arguments
	GetOptions(
		\%opts,
		"help|?",
		"man",
		"bed=s",
		"length:i",
	) or pod2usage(64);		# 64=>EX_USAGE

	pod2usage(1) if $opts{'help'};
	pod2usage(-exitstatus => 0, -verbose => 2) if $opts{'man'};

	while(my ($arg, $value) = each(%opts)) {
		if(!defined $value) {
			print "ERROR: Missing argument $arg\n";
			pod2usage(128);
		}
	}
    
    my $read_length = $opts{'length'};
    my %unique_read_hash;
    my $output_file = $opts{'bed'};
    
    foreach my $file (@ARGV) {
        print "Processing file $file...\n";
        my $processed_export = process_export_file($file, $read_length, \%unique_read_hash, $output_file);
        print "Completed processing file $file\n";
    }
    
    my $create_bed_file = print_bed_file($output_file, \%unique_read_hash);

	return 0;
}

##############################
# Functions
##############################

sub process_export_file {
    my $export_file = shift;
    my $read_length = shift;
    my $unique_read_hash = shift;
    my $output_file;
    
    my $output_file = $export_file;
    $output_file =~ s/txt/bed/;
    
    open(my $ifh, '<', $export_file) or croak "Cannot open file $export_file: $!";
    
    while(my $line = <$ifh>) {
        $line =~ s/^\s+//;
        $line =~ s/\s+$//;
        
        my @input_line = split(/\t/, $line);
        
        my $chr = $input_line[$_exportfield{'match_chr'}];
        my $start = $input_line[$_exportfield{'match_position'}];
        my $end = $start + $read_length - 1;
        my $pf = $input_line[$_exportfield{'filtering'}];
        
        # Skip any reads that do not align, do not uniquely align, or pass 
        # the Illumina filtering
        if (($chr =~ /chr/) && ($pf eq "Y")) {
            $chr =~ s/\.fa//g;
            my $bed_line = join("_", $chr, $start, $end);
            $unique_read_hash->{$bed_line} = 1;
        } else {
            next;
        }
    }
    close($ifh);
    
    #my $create_bed_file = print_bed_file($output_file, \$unique_read_hash);

    my $count = scalar(keys(%$unique_read_hash));
    print "Total collapsed unique aligned reads $count\n";
    return 0;
}


sub print_bed_file {
    my $bed_file = shift;
    my $read_hash = shift;
    
    open(my $ofh, '>', $bed_file) or croak "Cannot open file $bed_file";
    foreach my $bed_entry (sort(keys(%$read_hash))) {
        my ($chr, $start, $end) = split("_", $bed_entry);
        print {$ofh} "$chr\t$start\t$end\n";
    }
    close($ofh);
}

__END__

=head1 NAME

export2bed.pl

=head1 SYNOPSIS

B<export2bed.pl> [options] [file ...]

	Options:
	--help			brief help message
	--man			full documentation
	--bed           name of output BED file
	--length        read length (optional: default 101bp)

=head1 OPTIONS

=over 8

=item B<--help>

Print a brief help message and exit.

=item B<--man>

Print the manual page.

=item B<--bed>

Name of output BED file

=item B<--length>

Read length (optional, default is 101bp)

=back

=head1 DESCRIPTION

B<export2bed.pl> convert an Illumina export.txt file to a BED file

=head1 EXAMPLES

export2bed.pl --bed out.bed --length 101 s_1_1_export.txt s_1_2_export.txt

=head1 AUTHORS

The Ontario Institute for Cancer Research

=head1 SEE ALSO

=cut
Quote:
Originally Posted by joseph View Post
Can anybody share an algorithm that takes eland files and make bed files out of them?
Thanks
Joseph
rdeborja is offline   Reply With Quote
Old 07-01-2011, 07:03 AM   #18
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 190
Default

Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

Code:
#!/usr/bin/perl
# Program to convert eland export format to BED format
# for running MACS
# Chris Seidel, June 2009
#
# Requires tab delim file of chromosome names in the format:
# chr_name chr_length
# corrects for alignments that go off the ends of the chrs
# negative bases are trimmed to 1, 
# bases > chr_length are set to chr_length
# (I know the former exist, I don't know if the latter exist)
# results are not sorted, but can be sorted in linux by:
# sort -o infile.bed -k1,1 -k2,2n infile.bed
# (sort in place, first column, then by second column numeric)

use File::Basename;

die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2);

# get info on chromosomes
open(cmap, $ARGV[0]) || die("no chromosome length file!");
%chrmap = {};
while($line = <cmap>){
    chomp($line);
    ($chrom, $size) = split(/\t/, $line);
    $chrmap{$chrom} = $chrom;
    $chrsize{$chrom} = $size;
}

# open input file
open(fp, $ARGV[1]) || die("can't open eland file");

$lines = 0;
while(<fp>){
    chop;
    ++$lines;
    @bits = split(/\t/);
    # skip reads that didn't pass filtering
    next if($bits[21] eq "N");
    # get match name
    $seqname = $bits[10];
    # skip No Matches or QC failures
    # next if($seqname =~ /NM|QC/);
    # skip repeat matches
    # next if($seqname =~ /\d+:\d+:\d+/);
    # we're only interested in sequences that match our chrs

    $chrom = $bits[10];
    $seqlen = length($bits[8]);
    next unless(exists($chrmap{$chrom}));
    $start = $bits[12];
    $end = $start + $seqlen - 1;
    $strand = $bits[13];
    #print "start:$start end:$end strand:$strand\n";

    # parse match descriptor
    $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/);
    # skip reads beyond a certain threshold
    next if($n > 2);
    $read_code = "U".$n;

    # correct for alignments off the chromosome ends
    if( $start <= 0 ){
   print STDERR "start less than or equal to 0:   ", $start, "\n";
   print STDERR join("\t", @bits), "\n";
   $start = 1;
    }

    if($end > $chrsize{$seqname}){
   print STDERR "end greater than chr end $chrsize{$seqname}:   $end, diff: ", $end - $chrsize{$seqname}, "\n";
   print STDERR join("\t", @bits), "\n";
   $end = $chrsize{$seqname};
    }

    if($strand eq "F"){
   $strand = "+";
   $color = "0,0,255";
    }
    else{
   $strand = "-";
   $color = "255,0,0";
    }

    $score = 1;
    $name = join('',("@",$bits[0],"_",$bits[1],":",$bits[2],":",$bits[3],":",$bits[4],":",$bits[5],"#",$bits[6],"/",$bits[7]));
    my $chr = $chrmap{$seqname};
    $chr =~ s/\.fa$//g;
    print join("\t", $chr, $start, $end, $name, $score, $strand), "\n";

    # give some feedback
    print STDERR "$lines processed\n" if(!($lines % 100000));
}
mgogol is offline   Reply With Quote
Old 04-18-2012, 02:04 AM   #19
diablito
Junior Member
 
Location: China

Join Date: Apr 2012
Posts: 5
Default Xenopus table

Quote:
Originally Posted by mgogol View Post
Okay, I rewrote it a little. It now takes a fasta index file (produced by samtools faidx genome.fa)

Of course, now you're getting a script hacked away at by two people.... You could try other solutions suggested above.

perl export2bed.pl mm9_full.fa.fai s_1_1_export.txt > s_1_1_export.bed

.........

[/CODE]
Hello. Thank you for this script. I used to convert export to bed using a simple cut&grep regexp and then adding 50 to coordinates. This, of course, created end of chromosome problem, which was easy to correct in human exports. But now I have a Xenopus export file, and it is a 1000x problem. So, could you, please, specify the format for chromosome name/size file or give an example? I could not figure it out from the description line
diablito is offline   Reply With Quote
Old 04-18-2012, 02:09 AM   #20
diablito
Junior Member
 
Location: China

Join Date: Apr 2012
Posts: 5
Default

Quote:
Originally Posted by apfejes View Post

The manuals are a work in progress. If anyone would like to give them a try, I'll update that part of the manual.
Could you, please, give en example of chromosome file? I need to make one for Xenopus...BLEH(((
diablito is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:42 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.