SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Beginner Question: Generate WIG file (http://seqanswers.com/forums/showthread.php?t=23010)

el_Davido 09-02-2012 12:46 PM

Beginner Question: Generate WIG file
 
I found a really interesting published ChIP-Seq file I'd like to load into UCSCs genome browser. The file has the following content

(just showing the first few lines)
chr11 41481370 41481395 U0 0 +
chr15 93881170 93881195 U0 0 -
chr17 70890479 70890504 U0 0 +
chr2 170407262 170407287 U0 0 -
chr17 33602056 33602081 U0 0 +
chr11 92554663 92554688 U0 0 -
chr14 22768281 22768306 U0 0 +
chr14 28429583 28429608 U0 0 -
chr14 98610846 98610871 U0 0 +
chr7 100735751 100735776 U0 0 +
chr9 46391869 46391894 U0 0 +

So apparently a list of single 25 base reads.

All my other files I loaded into UCSC had the following structure:
track type=wiggle_0 name="DHS WT naive" color=0,0,0
chr1 4486200 4486399 0.3
chr1 4486400 4486599 0.8
chr1 4486600 4486799 0.8
chr1 4486800 4486999 0.4
chr1 4758400 4758599 1.2
chr1 4758600 4758799 1.2
chr1 4758800 4758999 0.2
chr1 4759000 4759199 0.2
chr1 4759600 4759799 0.2

Apparently, some sort of sum was created from the single reads.

What tools do I need to create the former file into the latter? Windows-based would be nice, but could also be on Linux.
My aplogies if this was already mentioned, but I didn't find anything during my search for the problem.

SES 09-03-2012 06:38 AM

You will need to figure out which field in the first file is the value you want in your wiggle file (I just assumed it was the 5th field below). If you have a file called "Chip-Seq.txt" you could do something like:

Code:

grep "chr11" Chip-Seq.txt | cut -f1-3,5 > Chip-Seq_chr11.wig
That will not create the header, but I would just manually type that in instead of writing a program for this one small job. You'll surely want to the create specific files for each chromosome and the first file is oddly not sorted that way, so I'm unsure what format that is exactly.

el_Davido 09-03-2012 11:40 AM

Thanks for your answer, SES.

The thing is, it's not just a matter of re-formatting. When I sort the file of interest in terms of chromosome and read-position, it looks like this (Sorry for not mentioning this before):

chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3053811 3053836 U0 0 +
chr1 3053930 3053955 U0 0 -
chr1 3053930 3053955 U0 0 -
chr1 3053930 3053955 U0 0 -


So the task is: Realize, that chr1 3001356 to 3001381 is measured 4 times and give an according score etc.

Any advice how to continue?

PS: I got the original file from http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE27158

SES 09-03-2012 11:56 AM

Thanks for the information. What are you wanting to do with the 4 scores for each position? From this snippet it looks like the scores are the same and you could just remove the duplicate entries in the original file. Is this what you want? If so, you could just add one command to the original (assuming you are on a *nix machine).

Original file:
Code:

$ cat Chip-Seq.txt
chr1        3001356        3001381        U0        0        -
chr1        3001356        3001381        U0        0        -
chr1        3001356        3001381        U0        0        -
chr1        3001356        3001381        U0        0        -
chr1        3007329        3007354        U0        0        +
chr1        3013242        3013267        U0        0        -
chr1        3016493        3016518        U0        0        -
chr1        3016493        3016518        U0        0        -
chr1        3016493        3016518        U0        0        -
chr1        3053811        3053836        U0        0        +
chr1        3053930        3053955        U0        0        -
chr1        3053930        3053955        U0        0        -
chr1        3053930        3053955        U0        0        -

Command to create wiggle file:
Code:

$ uniq Chip-Seq.txt | grep "chr1" | cut -f1-3,5 > Chip-Seq_chr1_uniq.wig
chr1        3001356        3001381        0
chr1        3007329        3007354        0
chr1        3013242        3013267        0
chr1        3016493        3016518        0
chr1        3053811        3053836        0
chr1        3053930        3053955        0

Note that I'm not sure this is what you want, so let us know if there is a different output you had in mind.

Chipper 09-03-2012 12:53 PM

The data is in BED format with one line per alignment. You can upload it to UCSC as a BED file but to get it in wiggle format you need to use some type of peak finder. There are online tools that can use the SRA URL, I know cistrome.org accepts BED file and can produce wiggle files using the MACS peak finder.

el_Davido 09-03-2012 01:00 PM

For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

Is it a valid approach to count the number of individual reads?

Code:

chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -

Would become (in my view)

Code:

chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3


Thanks for your support so far, SES!

SES 09-03-2012 01:04 PM

Quote:

Originally Posted by el_Davido (Post 83072)
Is it a valid approach to count the number of individual reads?

Code:

chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -

Would become (in my view)

Code:

chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3

Thanks for your support so far, SES!

You are welcome, but unfortunately, I am not sure this is the right thing to do with this data. You might want to follow up with the suggestion from @Chipper above. Good luck.

Chipper 09-03-2012 01:21 PM

You want to get the count of all reads surrounding a bindingsite, in your example you will just count all reads with the same alignment position. Email the author and ask him to send the processed data, if it is not available as supplementary to the article.

SES 09-03-2012 05:57 PM

Quote:

Originally Posted by el_Davido (Post 83072)
For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

Is it a valid approach to count the number of individual reads?

Code:

chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -

Would become (in my view)

Code:

chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3


Based on the comment by @Chipper, it sounds like you were spot on with your approach. Here is a small script that can produce this output.

Call this script "bed2wig.pl" (or whatever you want):
Code:

#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;

my $usage = "$0 -i BEDfile -o wiggle_file\n";
my $infile;
my $outfile;

GetOptions(
          'i|infile=s'  => \$infile,
          'o|outfile=s' => \$outfile,
          );

die $usage if !$infile or !$outfile;

open(my $in, '<', $infile) or die "\nERROR: Could not open file: $infile\n";
open(my $out, '>', $outfile) or die "\nERROR: Could not open file: $outfile\n";

my %mapped;

print $out "track type=wiggle_0 name=\"DHS WT naive\" color=0,0,0\n"; # change this line to whatever you need

while (my $line = <$in>) {
    chomp $line;
    my @line = split('\s+', $line);
    my $read = join("|",@line);
    $mapped{$read}++;
}

for my $key (sort keys %mapped) {
    my @pos = split(/\|/, $key);
    print $out join("\t",($pos[0],$pos[1],$pos[2],$mapped{$key})),"\n";
}

close($in);
close($out);

Run the script like this:

Code:

perl bed2wig.pl -i Chip-Seq.txt -o Chip-Seq.wig
Original file (Chip-Seq.txt):
Code:

$ cat Chip-Seq.txt
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -

Output file (Chip-Seq.wig):
Code:

$ cat Chip-Seq.wig
track type=wiggle_0 name="DHS WT naive" color=0,0,0
chr1        3001356        3001381        4
chr1        3007329        3007354        1
chr1        3013242        3013267        1
chr1        3016493        3016518        3

You'll certainly have to modify the script based on the exact format of your input and the output you need, but it should be a start for you.

el_Davido 09-04-2012 01:02 PM

Thanks for your ideas.
I'll start with cistrome.org and see how far I get!

Jim Robinson 09-05-2012 07:15 AM

You can use igvtools to compute coverage from a bed file. I think bedtools can also do this.

el_Davido 09-12-2012 08:08 AM

My problem is solved, many thanks for your advice.

www.cistrome.org offers a GUI and is just the right thing, if one just wants to convert a couple of BED files.


All times are GMT -8. The time now is 07:43 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.