Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introduction and request for BWA information

    Hello,

    I am a new student of Bioinformatics from Seattle and so far it's a fascinating field. I am starting to work on a school project and are still a little lost, since this is my first contact with the field and the tools.

    As part of the project, I would like to test BWA with a genome (it does not have to be as long as the human one, something smaller and easier to work with would be great) and reads of different lengths/error rates. The goal of the test would be to see how accurate BWA is when sequencing reads of different lengths and with different error rates, and how its performance is degraded as the length of the reads grows.

    I have the Windows versions of BWA and SAMtools from Codeplex, as recommended in a different thread.

    My question is, where can I find data to test BWA as mentioned above? How could I test different lengths/error rates? Any quick, general instructions on how to start would be greatly appreciated.

    Thanks again, it's a pleasure to be here.

  • #2
    The best way - in fact, I would say, the only way - to test how accurate an aligner is, would be with synthetic data. The e.coli reference would be nice for this, just download that file and rename it to ecoli.fa (alternately you could just use human chromosome 21).

    If you download BBMap, you can generate random reads like this:

    randomreads.sh -Xmx1g ref=ecoli.fa build=1 out=reads.fq maxq=10 minq=10 len=100 reads=100000

    That will generate 100000 reads of 100bp length, all of quality 10 (meaning 10% chance of error per base - quality 20 is 1%, quality 30 is 0.1%, etc). They will be randomly distributed around the e.coli genome, and every read will have a header indicating its genomic origin. You can also add insertions and deletions with other flags, like "delrate=0.5 maxdellen=20 maxdels=3" which would put deletions in 50% of the reads, of length 1 to 20, and up to 3 deletions per read - specifically, a 50% chance of 1+ deletions, a 25% chance of 2+ deletions, and a 25% chance of 3 deletions.

    After you map with an aligner, you will get a sam file. You can evaluate it like this:

    gradesam.sh in=mapped.sam reads=100000

    This will give you the true positive, false positive, and false negative mapping rates, both strict (requiring both read ends to map back to the exact origin) and loose (requiring at least 1 end to map back to within 20bp of the origin), as well as rate of ambiguous mapping.

    P.S. If you want to do everything in Windows, the shellscripts won't work. You have to have Java installed, and run the programs like this:

    java -Xmx1g -cp path/to/bbmap/current align2.RandomReads3 ref=ecoli.fa build=1 out=reads.fq maxq=10 minq=10 len=100 reads=100000

    and

    java -Xmx1g -cp path/to/bbmap/current align2.GradeSamFile in=mapped.sam reads=100000

    BBMap also runs in Windows. You can run it like this:

    java -Xmx1g -cp path/to/bbmap/current align2.BBMap ref=ecoli.fa in=reads.fq out=mapped.sam
    Last edited by Brian Bushnell; 03-19-2014, 10:21 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      The best way - in fact, I would say, the only way - to test how accurate an aligner is, would be with synthetic data. The e.coli reference would be nice for this, just download that file and rename it to ecoli.fa (alternately you could just use human chromosome 21).
      Brian,

      Thank you so much for such a thorough and clear response. This is exactly the kind of direction I needed (and much more than I expected). I will try it out right away. Thanks again!

      Comment


      • #4
        Originally posted by calatian View Post
        Brian,

        Thank you so much for such a thorough and clear response. This is exactly the kind of direction I needed (and much more than I expected). I will try it out right away. Thanks again!
        You're welcome

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X