Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low frequency variant caller for any ploidy level

    Hi all,

    we've developed a generic, sensitive and fast method (quite a mouthful but it's true, see for yourself ) for prediction of very rare variants called LoFreq and I'd be happy if other people would try it out. LoFreq can be used to predict variants in any type of data, i.e. viral or bacterial populations, but also pooled data, human genome data etc. It's neither restricted to haploid nor diploid data. It's main area of application is high coverage Illumina data (makes use of base call quality scores), but as long as you have a SAM/BAM file (best with recalibrated qualities) you can simply use it.

    LoFreq takes a samtools [m]pileup (via file or pipe) as input and produces a simple list of variants including associated frequencies, p-values (use for filtering!), base-counts, strand-bias info etc. The format is self-explanatory, plain CSV at the moment, but I will add VCF-support soon. For now, if your data consists of several chromosomes, you have to run it once per chromosome.

    An example call would look like this:
    Code:
    samtools mpileup -f your-ref.fa your-bam.bam | lofreq_snpcaller.py -b 1 -o snp-out.txt -v
    Note, that you might want to play with Samtools' options, most importantly the depth cap and BAQ. For example, if you have high coverage data, it's probably best to switch samtools' coverage cap off (e.g. -d 100000) to make use of all the data available. Make sure to use a reference fasta for the pileup, as no calls can be made on columns that have an N as reference.

    The code is accessible from https://github.com/andreaswilm/LoFreq
    (requires Python 2.6 or Python 2.7).

    I'd be happy to receive any type of feedback!

    Cheers,
    Andreas

  • #2
    Hi Andreas

    Does this tool handle indels as well as SNPs?

    Thanks

    Mark

    Comment


    • #3
      Hi Mark,

      I'm afraid it doesn't at the moment. I'm not sure, but I think FreeBayes might be able to handel indels.

      Andreas

      Comment


      • #4
        Hi Andreas,

        Do I understand the following correctly. If I have a reference fasta with two segments (from a virus), do I need to split those two segments into separate fasta files and run them each separately?

        Comment


        • #5
          Hi,

          No need to split the fasta file. Just run the Samtools/LoFreq combo once for each fragment, each time providing the corresponding fragment name to samtools, so that the [m]pileup is created for one fragment at a time.

          The reason is the following: The current version of LoFreq only prints SNV coordinates, without a chromosome/sequence name. In order to avoid a mixup you better run the analysis separately per chromosome. We'll release a version that can handle this properly and produces standard vcf output in the coming weeks.

          Let me know if I can help further,
          Andreas

          Comment


          • #6
            Hi all,

            please note: we've moved the project over to Sourceforge
            and also updated to a new version (much faster, lots of bug-fixes etc)

            Andreas
            Last edited by me_myself_andI; 07-12-2012, 12:31 AM.

            Comment


            • #7
              Hi all,

              we've released version 0.3.1, which is much easier to use. The most visible changes are: Samtools is now called internally, support for regions (bed), chromosome awareness (overdue) and a "somatic" (SNVs unique to one sample) SNV calling pipeline script.

              The paper is now accessible as well, see http://nar.oxfordjournals.org/cgi/co...St&keytype=ref

              Andreas

              Comment


              • #8
                Hi Andreas,

                I have installed LoFreq but have some issues with samtools. Our system-wide copy of samtools is outdated and so I have 0.1.18 installed in one of my user folders. I use an alias as well as exported it's path to my .bashrc, so when I run samtools from the command line, it uses the 0.1.18 version. However, when I run lofreq_snpcaller.py, it appears to be reverting to the outdated version as it can't find the 'depth' command. Is there a way to tell the script the path to my updated samtools copy?

                Cheers,
                WK

                Comment


                • #9
                  Hi WK,

                  an alias won't work, because as LoFreq will use a system call to execute samtools. It therefore will use samtools found in the first path mentioned in your PATH variable.
                  Can you make sure that the directory for 0.1.18 comes before any other installation? In other words, make sure that after removing the alias, a simple samtools call will execute the right version.

                  The next version has samtools build-in, so quirks like this can't happen.

                  Andreas

                  Comment


                  • #10
                    Hi Andreas,

                    Thanks for that. I managed to get it to work by inserting my samtools path in front of the PATH variable so that it gets looked at first.

                    WK

                    Comment


                    • #11
                      Hi Andreas,

                      Another question.. I am running the lofreq_uniq_pipeline.py script on a tumour-normal whole exome pair that has been realigned and re calibrated by GATK.

                      I get numerous warnings about mismatches between base count and coverage value
                      WARNING [2013-01-22 15:12:50,484]: Mismatch between number of bases (= 749) and samtools coverage value (= 750). Ins/del events: 0/0. Cleaned base_str is....

                      Is this expected?

                      WK

                      Comment


                      • #12
                        Hi Wengkhong,

                        it's not expected, but you can safely ignore those messages. It's actually a bug in one of the routines checking the integrity of the data. The data is fine, so don't worry. This is fixed in the latest version 0.5.0 which I uploaded last week. As a bonus: this version also integrates mapping quality.

                        Andreas

                        Comment


                        • #13
                          Hi Andreas,

                          I've been getting result which is puzzling when doing a viral quasispecies analysis with LoFreq . When I ran lofreq_snpcaller.py with the -Q filter, I get less variants with the default base quality of 3 than when I ran with 20. Perhaps I'm missing something here, I thought by filtering out lower quality bases, I should get more reliable and less variants? In addition, I'm also getting different number of variants using versions 0.4 and 0.5. This leads me to the next question of which parameters to use for filtering the raw variants calls. Should I ignore all those with AF values below 0.05? Thank you.

                          ZL

                          Comment


                          • #14
                            Hi ZL,

                            when you change the Q parameter, you filter all bases, i.e. reference and non-reference bases. This is not necessarily what you want. Qualities are built into LoFreq's model. By filtering too harshly you introduce unnecessary biases. Low quality bases are not a problem per se for LoFreq; low quality means higher error probability and therefore higher chance of seeing a random error (i.e. a SNV becomes less likely). All this is part of the model. I wouldn't play with the default parameters unless you have good reason to do so.

                            Regarding the changes between 0.4 and 0.5: By default the newer version builds mapping quality into the model as well by combining base and mapping qualities. You can switch this off with --dont-join-mapq-and-baseq. Again, I usually wouldn't mess with the defaults unless you know your mapping qualities are completely off. In addition, the automatic Bonferroni settings (for p-value filtering) have been slightly changed in the new version.

                            For filtering recommendations have a look at the Wiki. If the qualities in your BAM file were calibrated with GATK, then we the only recommended filtering steps would be a strand-bias and coverage filter. No need to filter based on frequency (we've in-vitro validated SNVs down to 0.5%(!) and in silico the resolution can go much lower depending on the coverage etc), unless you have prior knowledge. Without GATK recalibration you might see a few spurious SNVs at the lower frequency range.

                            Andreas

                            Comment


                            • #15
                              Variant calling

                              Hi Andreas,

                              I used SRMA to perform local realignment so indells can be detected. However, Lofreq only generates null bam file after SRMA treatment. Is there a fix to this situation? I can see my INDEL clearly from looking at my bam file on IGV.

                              Thanks,
                              Joe

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X