Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Downsampling a bam file for a specific number of reads

    Is there an easy way to downsample a bam file for a specific number of reads? I know that picard and samtools can downsample to a proportion of reads (like 10% of your total reads) but I have three replicates for which I would like to extract an equal number of random reads.
    Thanks in advance for any help!

  • #2
    I have a tool, ReformatReads, that will extract an exact number of randomly distributed reads from a file. It's designed for fasta, fastq, and scarf, but it will process sam files (and bam files if samtools is installed) if the cigar strings are in 1.4 format (with = and X instead of M) and the reads are not paired. If the reads paired or in an old format, you would need to sample the fastq files first and remap them. It can convert sam/bam to fastq regardless of the cigar string version, but sam/bam have no guarantees about read order, so pairing information will be lost (so don't do that!)

    reformat.sh in=reads.fq out=sampled.fq samplereads=1000

    ...would sample 1000 reads from a fastq file, or 1000 pairs if the file was interleaved.

    reformat.sh in1=r1.fq in2=r2.fq out1=s1.fq out2=s2.fq samplereads=1000

    ...would sample 1000 pairs from paired read files.

    reformat.sh in=mapped.bam out=sampled.bam samplereads=1000

    ...should sample 1000 reads from a bam file if all of the conditions are met (samtools installed, single-ended, sam 1.4 format reads).

    You alternatively sample a specific number of bases worth of reads, if you have variable-length reads, with the "samplebases" flag.
    Last edited by Brian Bushnell; 03-27-2014, 05:08 PM.

    Comment


    • #3
      Here's a few ideas:


      Also look into bamtools random

      Comment


      • #4
        StreamSampler

        I uploaded a command line utility I created to uniformly sample lines from a stream of input. Using samtools in a bash comand shell, you can use this command to uniformly downsample a bam file to a specific number of reads:

        (
        samtools view -H [bamfile];
        samtools view -F 0x004 [bamfile] |
        java -jar StreamSampler.jar [# of reads to sample] [total # reads]
        ) |
        samtools -bS - > [sampled bam file]

        It's important to keep in mind that this just does the downsampling, which as Brian mentions above, would result in a bam file with inconsistent flags if the data is paired. If this is important for your application, you would need another step to fix the flags on reads whose mates were discarded... Alternatively, you could avoid fixing flags by sampling the read names, and then filter out all reads that match one of the sampled names. If you want to try this out, you can download the jar from

        command line utility for uniformly sampilng from a stream with known number of elements - shenkers/sampling

        Comment


        • #5
          samtools sort -n bam namesorted
          samtools view namesorted | head -n xxxxxx
          or just calculate the percentage and use picard?

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          45 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X