Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting random tags?

    I am about to start playing with the GLITR peak finding tool and compare the results to my current favorite MACS. However, one of the requirements of GLITR is that there be at least 3-4 times the number of control sequence tags than treatment tags. I have about equal numbers.

    Therefore, I was wondering (to avoid reinventing the wheel) whether anyone knows of or has a Perl algorithm for extracting N number of tags from a given tag set. My thought is that if I randomly select ~10mill tags from my treatment set I will still have a representative treatment set that is 1/4 the size of my control set.

    Any thoughts?

    Many thanks!
    Ian

  • #2
    nudge

    nudge for second viewing, thanks!

    Comment


    • #3
      There are several ways to do this. A quick way to randomise a data structure in an array is to shuffle it and the read off the first 10 million entries:

      perldoc -q shuffle

      Since you probably don't want to read 40 million sequences into memory then you could just shuffle an array of integers and then read through the list of sequences printing out only those whose indices were selected.

      Another way to go would be to use a random function to decide whether to print out each individual sequence. If you make your function have a probability of 0.25 then you'll get approximately 1/4 of your data printed, but his won't be exactly 1/4 of the data and it will be different each time you run it.

      eg:

      print $sequence if (rand() < 0.25);

      Comment


      • #4
        Thanks for your suggestion Simon!

        Here is my offering of PERL code for anyone who finds it useful:

        #!/usr/bin/perl -w

        use List::Util 'shuffle';
        use strict;

        # Randomise the order of lines in a file
        # Ian Donaldson. Nov. 2009

        # Usage
        unless(@ARGV==2) {
        die("$0 | Input file | Output file\n\n");
        }

        # Open files
        open(INPUT, "<$ARGV[0]");
        open(OUTPUT, ">$ARGV[1]");

        # Put whole file into memory (OK unless very big)
        my @list = <INPUT>;

        # Shuffle array
        my @shuffled = shuffle(@list);

        # Print shuffled array to output
        print OUTPUT @shuffled;

        # Close files
        close(INPUT);
        close(OUTPUT);

        exit;

        Comment


        • #5
          Better random line extractor script

          Here is another version of a script that will extract N random lines from a BED file and prodice a ChIP alignment formatted file. This is all designed for use with GLITR, but could be adapted for other formats. An input file of ~40million tags/lines still required in excess of 2Gb of memory!!!
          Attached Files

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 08:47 AM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          59 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X