SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Beginner question about Picard AddOrReplaceReadGroups yeppomonkey Bioinformatics 1 09-03-2012 11:59 PM
Bowtie beginner question... milesgr General 6 03-14-2012 07:21 AM
samtools sort bam file error: generate non-existent file mediator Bioinformatics 0 03-05-2012 09:42 PM
Beginner Sequencing Analysis question mgibson General 0 06-17-2011 10:30 AM
1000 genomes - a beginner question EHC Bioinformatics 3 05-26-2011 01:51 AM

Reply
 
Thread Tools
Old 09-02-2012, 01:46 PM   #1
el_Davido
Junior Member
 
Location: Germany

Join Date: Sep 2012
Posts: 5
Default Beginner Question: Generate WIG file

I found a really interesting published ChIP-Seq file I'd like to load into UCSCs genome browser. The file has the following content

(just showing the first few lines)
chr11 41481370 41481395 U0 0 +
chr15 93881170 93881195 U0 0 -
chr17 70890479 70890504 U0 0 +
chr2 170407262 170407287 U0 0 -
chr17 33602056 33602081 U0 0 +
chr11 92554663 92554688 U0 0 -
chr14 22768281 22768306 U0 0 +
chr14 28429583 28429608 U0 0 -
chr14 98610846 98610871 U0 0 +
chr7 100735751 100735776 U0 0 +
chr9 46391869 46391894 U0 0 +

So apparently a list of single 25 base reads.

All my other files I loaded into UCSC had the following structure:
track type=wiggle_0 name="DHS WT naive" color=0,0,0
chr1 4486200 4486399 0.3
chr1 4486400 4486599 0.8
chr1 4486600 4486799 0.8
chr1 4486800 4486999 0.4
chr1 4758400 4758599 1.2
chr1 4758600 4758799 1.2
chr1 4758800 4758999 0.2
chr1 4759000 4759199 0.2
chr1 4759600 4759799 0.2

Apparently, some sort of sum was created from the single reads.

What tools do I need to create the former file into the latter? Windows-based would be nice, but could also be on Linux.
My aplogies if this was already mentioned, but I didn't find anything during my search for the problem.
el_Davido is offline   Reply With Quote
Old 09-03-2012, 07:38 AM   #2
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

You will need to figure out which field in the first file is the value you want in your wiggle file (I just assumed it was the 5th field below). If you have a file called "Chip-Seq.txt" you could do something like:

Code:
grep "chr11" Chip-Seq.txt | cut -f1-3,5 > Chip-Seq_chr11.wig
That will not create the header, but I would just manually type that in instead of writing a program for this one small job. You'll surely want to the create specific files for each chromosome and the first file is oddly not sorted that way, so I'm unsure what format that is exactly.
SES is offline   Reply With Quote
Old 09-03-2012, 12:40 PM   #3
el_Davido
Junior Member
 
Location: Germany

Join Date: Sep 2012
Posts: 5
Default

Thanks for your answer, SES.

The thing is, it's not just a matter of re-formatting. When I sort the file of interest in terms of chromosome and read-position, it looks like this (Sorry for not mentioning this before):

chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3053811 3053836 U0 0 +
chr1 3053930 3053955 U0 0 -
chr1 3053930 3053955 U0 0 -
chr1 3053930 3053955 U0 0 -


So the task is: Realize, that chr1 3001356 to 3001381 is measured 4 times and give an according score etc.

Any advice how to continue?

PS: I got the original file from http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE27158
el_Davido is offline   Reply With Quote
Old 09-03-2012, 12:56 PM   #4
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Thanks for the information. What are you wanting to do with the 4 scores for each position? From this snippet it looks like the scores are the same and you could just remove the duplicate entries in the original file. Is this what you want? If so, you could just add one command to the original (assuming you are on a *nix machine).

Original file:
Code:
$ cat Chip-Seq.txt
chr1	3001356	3001381	U0	0	-
chr1	3001356	3001381	U0	0	-
chr1	3001356	3001381	U0	0	-
chr1	3001356	3001381	U0	0	-
chr1	3007329	3007354	U0	0	+
chr1	3013242	3013267	U0	0	-
chr1	3016493	3016518	U0	0	-
chr1	3016493	3016518	U0	0	-
chr1	3016493	3016518	U0	0	-
chr1	3053811	3053836	U0	0	+
chr1	3053930	3053955	U0	0	-
chr1	3053930	3053955	U0	0	-
chr1	3053930	3053955	U0	0	-
Command to create wiggle file:
Code:
$ uniq Chip-Seq.txt | grep "chr1" | cut -f1-3,5 > Chip-Seq_chr1_uniq.wig
chr1	3001356	3001381	0
chr1	3007329	3007354	0
chr1	3013242	3013267	0
chr1	3016493	3016518	0
chr1	3053811	3053836	0
chr1	3053930	3053955	0
Note that I'm not sure this is what you want, so let us know if there is a different output you had in mind.
SES is offline   Reply With Quote
Old 09-03-2012, 01:53 PM   #5
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

The data is in BED format with one line per alignment. You can upload it to UCSC as a BED file but to get it in wiggle format you need to use some type of peak finder. There are online tools that can use the SRA URL, I know cistrome.org accepts BED file and can produce wiggle files using the MACS peak finder.
Chipper is offline   Reply With Quote
Old 09-03-2012, 02:00 PM   #6
el_Davido
Junior Member
 
Location: Germany

Join Date: Sep 2012
Posts: 5
Default

For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

Is it a valid approach to count the number of individual reads?

Code:
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
Would become (in my view)

Code:
chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3

Thanks for your support so far, SES!
el_Davido is offline   Reply With Quote
Old 09-03-2012, 02:04 PM   #7
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by el_Davido View Post
Is it a valid approach to count the number of individual reads?

Code:
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
Would become (in my view)

Code:
chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3
Thanks for your support so far, SES!
You are welcome, but unfortunately, I am not sure this is the right thing to do with this data. You might want to follow up with the suggestion from @Chipper above. Good luck.
SES is offline   Reply With Quote
Old 09-03-2012, 02:21 PM   #8
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

You want to get the count of all reads surrounding a bindingsite, in your example you will just count all reads with the same alignment position. Email the author and ask him to send the processed data, if it is not available as supplementary to the article.
Chipper is offline   Reply With Quote
Old 09-03-2012, 06:57 PM   #9
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by el_Davido View Post
For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

Is it a valid approach to count the number of individual reads?

Code:
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
Would become (in my view)

Code:
chr1 3001356 3001381 4
chr1 3007329 3007354 1
chr1 3013242 3013267 1
chr1 3016493 3016518 3
Based on the comment by @Chipper, it sounds like you were spot on with your approach. Here is a small script that can produce this output.

Call this script "bed2wig.pl" (or whatever you want):
Code:
#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;

my $usage = "$0 -i BEDfile -o wiggle_file\n";
my $infile;
my $outfile;

GetOptions(
           'i|infile=s'  => \$infile,
           'o|outfile=s' => \$outfile,
           );

die $usage if !$infile or !$outfile;

open(my $in, '<', $infile) or die "\nERROR: Could not open file: $infile\n";
open(my $out, '>', $outfile) or die "\nERROR: Could not open file: $outfile\n";

my %mapped;

print $out "track type=wiggle_0 name=\"DHS WT naive\" color=0,0,0\n"; # change this line to whatever you need

while (my $line = <$in>) {
    chomp $line;
    my @line = split('\s+', $line);
    my $read = join("|",@line);
    $mapped{$read}++;
}

for my $key (sort keys %mapped) {
    my @pos = split(/\|/, $key);
    print $out join("\t",($pos[0],$pos[1],$pos[2],$mapped{$key})),"\n";
}

close($in);
close($out);
Run the script like this:

Code:
perl bed2wig.pl -i Chip-Seq.txt -o Chip-Seq.wig
Original file (Chip-Seq.txt):
Code:
$ cat Chip-Seq.txt
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3001356 3001381 U0 0 -
chr1 3007329 3007354 U0 0 +
chr1 3013242 3013267 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
chr1 3016493 3016518 U0 0 -
Output file (Chip-Seq.wig):
Code:
$ cat Chip-Seq.wig 
track type=wiggle_0 name="DHS WT naive" color=0,0,0
chr1	3001356	3001381	4
chr1	3007329	3007354	1
chr1	3013242	3013267	1
chr1	3016493	3016518	3
You'll certainly have to modify the script based on the exact format of your input and the output you need, but it should be a start for you.
SES is offline   Reply With Quote
Old 09-04-2012, 02:02 PM   #10
el_Davido
Junior Member
 
Location: Germany

Join Date: Sep 2012
Posts: 5
Default

Thanks for your ideas.
I'll start with cistrome.org and see how far I get!
el_Davido is offline   Reply With Quote
Old 09-05-2012, 08:15 AM   #11
Jim Robinson
Member
 
Location: Boston, MA

Join Date: May 2009
Posts: 75
Default

You can use igvtools to compute coverage from a bed file. I think bedtools can also do this.
Jim Robinson is offline   Reply With Quote
Old 09-12-2012, 09:08 AM   #12
el_Davido
Junior Member
 
Location: Germany

Join Date: Sep 2012
Posts: 5
Default

My problem is solved, many thanks for your advice.

www.cistrome.org offers a GUI and is just the right thing, if one just wants to convert a couple of BED files.
el_Davido is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO