SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
Variant Frequency Calculation (non-SNP)? dmtruong Bioinformatics 0 12-27-2011 06:53 PM
PubMed: Resequencing of 200 human exomes identifies an excess of low-frequency non-sy Newsbot! Literature Watch 1 12-02-2010 08:04 AM
Software for variant frequency sasignor Bioinformatics 4 04-30-2010 01:31 PM
samtools pileup for low frequency alleles? greigite Bioinformatics 0 03-02-2010 04:43 PM

Reply
 
Thread Tools
Old 03-20-2012, 02:03 AM   #1
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default Low frequency variant caller for any ploidy level

Hi all,

we've developed a generic, sensitive and fast method (quite a mouthful but it's true, see for yourself ) for prediction of very rare variants called LoFreq and I'd be happy if other people would try it out. LoFreq can be used to predict variants in any type of data, i.e. viral or bacterial populations, but also pooled data, human genome data etc. It's neither restricted to haploid nor diploid data. It's main area of application is high coverage Illumina data (makes use of base call quality scores), but as long as you have a SAM/BAM file (best with recalibrated qualities) you can simply use it.

LoFreq takes a samtools [m]pileup (via file or pipe) as input and produces a simple list of variants including associated frequencies, p-values (use for filtering!), base-counts, strand-bias info etc. The format is self-explanatory, plain CSV at the moment, but I will add VCF-support soon. For now, if your data consists of several chromosomes, you have to run it once per chromosome.

An example call would look like this:
Code:
samtools mpileup -f your-ref.fa your-bam.bam | lofreq_snpcaller.py -b 1 -o snp-out.txt -v
Note, that you might want to play with Samtools' options, most importantly the depth cap and BAQ. For example, if you have high coverage data, it's probably best to switch samtools' coverage cap off (e.g. -d 100000) to make use of all the data available. Make sure to use a reference fasta for the pileup, as no calls can be made on columns that have an N as reference.

The code is accessible from https://github.com/andreaswilm/LoFreq
(requires Python 2.6 or Python 2.7).

I'd be happy to receive any type of feedback!

Cheers,
Andreas
me_myself_andI is offline   Reply With Quote
Old 06-13-2012, 02:32 AM   #2
Mark
Member
 
Location: Raleigh, NC

Join Date: Nov 2008
Posts: 51
Default

Hi Andreas

Does this tool handle indels as well as SNPs?

Thanks

Mark
Mark is offline   Reply With Quote
Old 06-13-2012, 04:22 AM   #3
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi Mark,

I'm afraid it doesn't at the moment. I'm not sure, but I think FreeBayes might be able to handel indels.

Andreas
me_myself_andI is offline   Reply With Quote
Old 06-13-2012, 05:20 AM   #4
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi Andreas,

Do I understand the following correctly. If I have a reference fasta with two segments (from a virus), do I need to split those two segments into separate fasta files and run them each separately?
kga1978 is offline   Reply With Quote
Old 06-13-2012, 07:48 AM   #5
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi,

No need to split the fasta file. Just run the Samtools/LoFreq combo once for each fragment, each time providing the corresponding fragment name to samtools, so that the [m]pileup is created for one fragment at a time.

The reason is the following: The current version of LoFreq only prints SNV coordinates, without a chromosome/sequence name. In order to avoid a mixup you better run the analysis separately per chromosome. We'll release a version that can handle this properly and produces standard vcf output in the coming weeks.

Let me know if I can help further,
Andreas
me_myself_andI is offline   Reply With Quote
Old 07-11-2012, 11:31 PM   #6
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi all,

please note: we've moved the project over to Sourceforge
and also updated to a new version (much faster, lots of bug-fixes etc)

Andreas

Last edited by me_myself_andI; 07-12-2012 at 01:31 AM.
me_myself_andI is offline   Reply With Quote
Old 10-14-2012, 07:00 PM   #7
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi all,

we've released version 0.3.1, which is much easier to use. The most visible changes are: Samtools is now called internally, support for regions (bed), chromosome awareness (overdue) and a "somatic" (SNVs unique to one sample) SNV calling pipeline script.

The paper is now accessible as well, see http://nar.oxfordjournals.org/cgi/co...St&keytype=ref

Andreas
me_myself_andI is offline   Reply With Quote
Old 01-21-2013, 09:03 PM   #8
wengkhong
Junior Member
 
Location: Singapore

Join Date: Jan 2013
Posts: 3
Default

Hi Andreas,

I have installed LoFreq but have some issues with samtools. Our system-wide copy of samtools is outdated and so I have 0.1.18 installed in one of my user folders. I use an alias as well as exported it's path to my .bashrc, so when I run samtools from the command line, it uses the 0.1.18 version. However, when I run lofreq_snpcaller.py, it appears to be reverting to the outdated version as it can't find the 'depth' command. Is there a way to tell the script the path to my updated samtools copy?

Cheers,
WK
wengkhong is offline   Reply With Quote
Old 01-21-2013, 09:17 PM   #9
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi WK,

an alias won't work, because as LoFreq will use a system call to execute samtools. It therefore will use samtools found in the first path mentioned in your PATH variable.
Can you make sure that the directory for 0.1.18 comes before any other installation? In other words, make sure that after removing the alias, a simple samtools call will execute the right version.

The next version has samtools build-in, so quirks like this can't happen.

Andreas
me_myself_andI is offline   Reply With Quote
Old 01-21-2013, 09:33 PM   #10
wengkhong
Junior Member
 
Location: Singapore

Join Date: Jan 2013
Posts: 3
Default

Hi Andreas,

Thanks for that. I managed to get it to work by inserting my samtools path in front of the PATH variable so that it gets looked at first.

WK
wengkhong is offline   Reply With Quote
Old 01-21-2013, 11:18 PM   #11
wengkhong
Junior Member
 
Location: Singapore

Join Date: Jan 2013
Posts: 3
Default

Hi Andreas,

Another question.. I am running the lofreq_uniq_pipeline.py script on a tumour-normal whole exome pair that has been realigned and re calibrated by GATK.

I get numerous warnings about mismatches between base count and coverage value
WARNING [2013-01-22 15:12:50,484]: Mismatch between number of bases (= 749) and samtools coverage value (= 750). Ins/del events: 0/0. Cleaned base_str is....

Is this expected?

WK
wengkhong is offline   Reply With Quote
Old 01-28-2013, 12:36 AM   #12
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi Wengkhong,

it's not expected, but you can safely ignore those messages. It's actually a bug in one of the routines checking the integrity of the data. The data is fine, so don't worry. This is fixed in the latest version 0.5.0 which I uploaded last week. As a bonus: this version also integrates mapping quality.

Andreas
me_myself_andI is offline   Reply With Quote
Old 01-29-2013, 11:27 AM   #13
zlu
Member
 
Location: Southeast Asia

Join Date: Nov 2008
Posts: 34
Default

Hi Andreas,

I've been getting result which is puzzling when doing a viral quasispecies analysis with LoFreq . When I ran lofreq_snpcaller.py with the -Q filter, I get less variants with the default base quality of 3 than when I ran with 20. Perhaps I'm missing something here, I thought by filtering out lower quality bases, I should get more reliable and less variants? In addition, I'm also getting different number of variants using versions 0.4 and 0.5. This leads me to the next question of which parameters to use for filtering the raw variants calls. Should I ignore all those with AF values below 0.05? Thank you.

ZL
zlu is offline   Reply With Quote
Old 01-29-2013, 06:29 PM   #14
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi ZL,

when you change the Q parameter, you filter all bases, i.e. reference and non-reference bases. This is not necessarily what you want. Qualities are built into LoFreq's model. By filtering too harshly you introduce unnecessary biases. Low quality bases are not a problem per se for LoFreq; low quality means higher error probability and therefore higher chance of seeing a random error (i.e. a SNV becomes less likely). All this is part of the model. I wouldn't play with the default parameters unless you have good reason to do so.

Regarding the changes between 0.4 and 0.5: By default the newer version builds mapping quality into the model as well by combining base and mapping qualities. You can switch this off with --dont-join-mapq-and-baseq. Again, I usually wouldn't mess with the defaults unless you know your mapping qualities are completely off. In addition, the automatic Bonferroni settings (for p-value filtering) have been slightly changed in the new version.

For filtering recommendations have a look at the Wiki. If the qualities in your BAM file were calibrated with GATK, then we the only recommended filtering steps would be a strand-bias and coverage filter. No need to filter based on frequency (we've in-vitro validated SNVs down to 0.5%(!) and in silico the resolution can go much lower depending on the coverage etc), unless you have prior knowledge. Without GATK recalibration you might see a few spurious SNVs at the lower frequency range.

Andreas
me_myself_andI is offline   Reply With Quote
Old 04-17-2014, 03:19 PM   #15
madonjoe
Junior Member
 
Location: San Diego

Join Date: Dec 2013
Posts: 6
Default Variant calling

Hi Andreas,

I used SRMA to perform local realignment so indells can be detected. However, Lofreq only generates null bam file after SRMA treatment. Is there a fix to this situation? I can see my INDEL clearly from looking at my bam file on IGV.

Thanks,
Joe
madonjoe is offline   Reply With Quote
Old 04-20-2014, 08:16 PM   #16
me_myself_andI
Member
 
Location: Singapore

Join Date: Nov 2010
Posts: 30
Default

Hi Joe,

I'm sorry, but currently available versions of LoFreq can only predict substitutions, but not indels (see earlier post in this thread). Version 2.1 of LoFreq will be able to predict indels and should be available within the next two months.

Andreas
me_myself_andI is offline   Reply With Quote
Old 04-21-2014, 08:39 AM   #17
madonjoe
Junior Member
 
Location: San Diego

Join Date: Dec 2013
Posts: 6
Default Thanks

Thanks, Andrea! LoFreq rocks.
madonjoe is offline   Reply With Quote
Reply

Tags
diploid, haploid, samtools, snp calling, variant calling

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:10 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO