Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Beginner Question: Generate WIG file

    I found a really interesting published ChIP-Seq file I'd like to load into UCSCs genome browser. The file has the following content

    (just showing the first few lines)
    chr11 41481370 41481395 U0 0 +
    chr15 93881170 93881195 U0 0 -
    chr17 70890479 70890504 U0 0 +
    chr2 170407262 170407287 U0 0 -
    chr17 33602056 33602081 U0 0 +
    chr11 92554663 92554688 U0 0 -
    chr14 22768281 22768306 U0 0 +
    chr14 28429583 28429608 U0 0 -
    chr14 98610846 98610871 U0 0 +
    chr7 100735751 100735776 U0 0 +
    chr9 46391869 46391894 U0 0 +

    So apparently a list of single 25 base reads.

    All my other files I loaded into UCSC had the following structure:
    track type=wiggle_0 name="DHS WT naive" color=0,0,0
    chr1 4486200 4486399 0.3
    chr1 4486400 4486599 0.8
    chr1 4486600 4486799 0.8
    chr1 4486800 4486999 0.4
    chr1 4758400 4758599 1.2
    chr1 4758600 4758799 1.2
    chr1 4758800 4758999 0.2
    chr1 4759000 4759199 0.2
    chr1 4759600 4759799 0.2

    Apparently, some sort of sum was created from the single reads.

    What tools do I need to create the former file into the latter? Windows-based would be nice, but could also be on Linux.
    My aplogies if this was already mentioned, but I didn't find anything during my search for the problem.

  • #2
    You will need to figure out which field in the first file is the value you want in your wiggle file (I just assumed it was the 5th field below). If you have a file called "Chip-Seq.txt" you could do something like:

    Code:
    grep "chr11" Chip-Seq.txt | cut -f1-3,5 > Chip-Seq_chr11.wig
    That will not create the header, but I would just manually type that in instead of writing a program for this one small job. You'll surely want to the create specific files for each chromosome and the first file is oddly not sorted that way, so I'm unsure what format that is exactly.

    Comment


    • #3
      Thanks for your answer, SES.

      The thing is, it's not just a matter of re-formatting. When I sort the file of interest in terms of chromosome and read-position, it looks like this (Sorry for not mentioning this before):

      chr1 3001356 3001381 U0 0 -
      chr1 3001356 3001381 U0 0 -
      chr1 3001356 3001381 U0 0 -
      chr1 3001356 3001381 U0 0 -
      chr1 3007329 3007354 U0 0 +
      chr1 3013242 3013267 U0 0 -
      chr1 3016493 3016518 U0 0 -
      chr1 3016493 3016518 U0 0 -
      chr1 3016493 3016518 U0 0 -
      chr1 3053811 3053836 U0 0 +
      chr1 3053930 3053955 U0 0 -
      chr1 3053930 3053955 U0 0 -
      chr1 3053930 3053955 U0 0 -


      So the task is: Realize, that chr1 3001356 to 3001381 is measured 4 times and give an according score etc.

      Any advice how to continue?

      PS: I got the original file from http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE27158

      Comment


      • #4
        Thanks for the information. What are you wanting to do with the 4 scores for each position? From this snippet it looks like the scores are the same and you could just remove the duplicate entries in the original file. Is this what you want? If so, you could just add one command to the original (assuming you are on a *nix machine).

        Original file:
        Code:
        $ cat Chip-Seq.txt
        chr1	3001356	3001381	U0	0	-
        chr1	3001356	3001381	U0	0	-
        chr1	3001356	3001381	U0	0	-
        chr1	3001356	3001381	U0	0	-
        chr1	3007329	3007354	U0	0	+
        chr1	3013242	3013267	U0	0	-
        chr1	3016493	3016518	U0	0	-
        chr1	3016493	3016518	U0	0	-
        chr1	3016493	3016518	U0	0	-
        chr1	3053811	3053836	U0	0	+
        chr1	3053930	3053955	U0	0	-
        chr1	3053930	3053955	U0	0	-
        chr1	3053930	3053955	U0	0	-
        Command to create wiggle file:
        Code:
        $ uniq Chip-Seq.txt | grep "chr1" | cut -f1-3,5 > Chip-Seq_chr1_uniq.wig
        chr1	3001356	3001381	0
        chr1	3007329	3007354	0
        chr1	3013242	3013267	0
        chr1	3016493	3016518	0
        chr1	3053811	3053836	0
        chr1	3053930	3053955	0
        Note that I'm not sure this is what you want, so let us know if there is a different output you had in mind.

        Comment


        • #5
          The data is in BED format with one line per alignment. You can upload it to UCSC as a BED file but to get it in wiggle format you need to use some type of peak finder. There are online tools that can use the SRA URL, I know cistrome.org accepts BED file and can produce wiggle files using the MACS peak finder.

          Comment


          • #6
            For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

            The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

            Is it a valid approach to count the number of individual reads?

            Code:
            chr1 3001356 3001381 U0 0 -
            chr1 3001356 3001381 U0 0 -
            chr1 3001356 3001381 U0 0 -
            chr1 3001356 3001381 U0 0 -
            chr1 3007329 3007354 U0 0 +
            chr1 3013242 3013267 U0 0 -
            chr1 3016493 3016518 U0 0 -
            chr1 3016493 3016518 U0 0 -
            chr1 3016493 3016518 U0 0 -
            Would become (in my view)

            Code:
            chr1 3001356 3001381 4
            chr1 3007329 3007354 1
            chr1 3013242 3013267 1
            chr1 3016493 3016518 3

            Thanks for your support so far, SES!

            Comment


            • #7
              Originally posted by el_Davido View Post
              Is it a valid approach to count the number of individual reads?

              Code:
              chr1 3001356 3001381 U0 0 -
              chr1 3001356 3001381 U0 0 -
              chr1 3001356 3001381 U0 0 -
              chr1 3001356 3001381 U0 0 -
              chr1 3007329 3007354 U0 0 +
              chr1 3013242 3013267 U0 0 -
              chr1 3016493 3016518 U0 0 -
              chr1 3016493 3016518 U0 0 -
              chr1 3016493 3016518 U0 0 -
              Would become (in my view)

              Code:
              chr1 3001356 3001381 4
              chr1 3007329 3007354 1
              chr1 3013242 3013267 1
              chr1 3016493 3016518 3
              Thanks for your support so far, SES!
              You are welcome, but unfortunately, I am not sure this is the right thing to do with this data. You might want to follow up with the suggestion from @Chipper above. Good luck.

              Comment


              • #8
                You want to get the count of all reads surrounding a bindingsite, in your example you will just count all reads with the same alignment position. Email the author and ask him to send the processed data, if it is not available as supplementary to the article.

                Comment


                • #9
                  Originally posted by el_Davido View Post
                  For me as a layperson it seems that the number of reads does correspond to the signal intensity. Is that correct?

                  The file comes from a paper where they performed chromatin-immunoprecipitation for STAT5, followed by sequencing.

                  Is it a valid approach to count the number of individual reads?

                  Code:
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3007329 3007354 U0 0 +
                  chr1 3013242 3013267 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  Would become (in my view)

                  Code:
                  chr1 3001356 3001381 4
                  chr1 3007329 3007354 1
                  chr1 3013242 3013267 1
                  chr1 3016493 3016518 3
                  Based on the comment by @Chipper, it sounds like you were spot on with your approach. Here is a small script that can produce this output.

                  Call this script "bed2wig.pl" (or whatever you want):
                  Code:
                  #!/usr/bin/env perl
                  
                  use strict;
                  use warnings;
                  use Getopt::Long;
                  
                  my $usage = "$0 -i BEDfile -o wiggle_file\n";
                  my $infile;
                  my $outfile;
                  
                  GetOptions(
                             'i|infile=s'  => \$infile,
                             'o|outfile=s' => \$outfile,
                             );
                  
                  die $usage if !$infile or !$outfile;
                  
                  open(my $in, '<', $infile) or die "\nERROR: Could not open file: $infile\n";
                  open(my $out, '>', $outfile) or die "\nERROR: Could not open file: $outfile\n";
                  
                  my %mapped;
                  
                  print $out "track type=wiggle_0 name=\"DHS WT naive\" color=0,0,0\n"; # change this line to whatever you need
                  
                  while (my $line = <$in>) {
                      chomp $line;
                      my @line = split('\s+', $line);
                      my $read = join("|",@line);
                      $mapped{$read}++;
                  }
                  
                  for my $key (sort keys %mapped) {
                      my @pos = split(/\|/, $key);
                      print $out join("\t",($pos[0],$pos[1],$pos[2],$mapped{$key})),"\n";
                  }
                  
                  close($in);
                  close($out);
                  Run the script like this:

                  Code:
                  perl bed2wig.pl -i Chip-Seq.txt -o Chip-Seq.wig
                  Original file (Chip-Seq.txt):
                  Code:
                  $ cat Chip-Seq.txt
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3001356 3001381 U0 0 -
                  chr1 3007329 3007354 U0 0 +
                  chr1 3013242 3013267 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  chr1 3016493 3016518 U0 0 -
                  Output file (Chip-Seq.wig):
                  Code:
                  $ cat Chip-Seq.wig 
                  track type=wiggle_0 name="DHS WT naive" color=0,0,0
                  chr1	3001356	3001381	4
                  chr1	3007329	3007354	1
                  chr1	3013242	3013267	1
                  chr1	3016493	3016518	3
                  You'll certainly have to modify the script based on the exact format of your input and the output you need, but it should be a start for you.

                  Comment


                  • #10
                    Thanks for your ideas.
                    I'll start with cistrome.org and see how far I get!

                    Comment


                    • #11
                      You can use igvtools to compute coverage from a bed file. I think bedtools can also do this.

                      Comment


                      • #12
                        My problem is solved, many thanks for your advice.

                        www.cistrome.org offers a GUI and is just the right thing, if one just wants to convert a couple of BED files.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:07 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        69 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X