Seqanswers Leaderboard Ad

**HESmith** · 02-29-2012, 06:29 AM

We've used a pipeline of BFAST -> Samtools -> Annovar with success for S. cerevisiae. However, be aware that the SNP density is very high, and you'll need high read coverage (at least 100X) to obtain accurate results.

**Noa** · 03-12-2012, 11:34 PM

Thanks- can you please elaborate on that pipeline?
Also- I am working on data previously generated by the new lab I joined- the data was collected on Illumina not on a single clone but rather on a mix of ~150 yeast clones together (lumped into one single Illumina lane without barcoding). The goal is to find genes that are causing a specific phenotype. Is this feasible or should I redo the experiment and sequence single clones?
Thanks

**matan8** · 03-20-2012, 02:24 AM

I am working with bwa->samtools->GATK. Didn't verify lot of my work, so far. But it looks good.

**swbarnes2** · 03-20-2012, 08:27 AM

Originally posted by Noa View Post

Thanks- can you please elaborate on that pipeline?
Also- I am working on data previously generated by the new lab I joined- the data was collected on Illumina not on a single clone but rather on a mix of ~150 yeast clones together (lumped into one single Illumina lane without barcoding). The goal is to find genes that are causing a specific phenotype. Is this feasible or should I redo the experiment and sequence single clones?
Thanks

150 clones together? So that a true mutation would be seen in < 1% of the reads? You'll need huge coverage to distinguish true rare mutations from background error, and I'm not sure off the top of my head what software will reliably call SNPs like that.

If you redid, say, 10 clones, found their mutations, then sanger sequenced candidate genes in the rest of the clones, that might work better.

**HESmith** · 03-20-2012, 09:07 AM

Originally posted by Noa View Post

Thanks- can you please elaborate on that pipeline?
Also- I am working on data previously generated by the new lab I joined- the data was collected on Illumina not on a single clone but rather on a mix of ~150 yeast clones together (lumped into one single Illumina lane without barcoding). The goal is to find genes that are causing a specific phenotype. Is this feasible or should I redo the experiment and sequence single clones?
Thanks

What elaboration would you like? I'm happy to answer specific questions.

Regarding the 150 pooled clones: are these merely independent segregants from the same diploid genotype, or isolates from 150 different mutant strains? If the latter, the data will be useless for identifying mutations. If the former, then you should be fine. See previous comment re: coverage.

-Harold

**Noa** · 03-20-2012, 11:02 AM

Thanks for all your help on this. OK so the way I understand it (and please dont ask why the experiment was done this way...I was not involved then)- we have ~200fold coverage of each of the parent lines, ~200x coverage of a lump from the 5th generation after various backcrosses to one of the parents (performed by just taking DNA from all the yeast, not from any number of individuals, so I dont even know if a few of the yeasts are more highly represented than others, etc). Then we have about 600x coverage of the 10% of the yeast that showed the phenotype of interest, and this was done by taking ~100 individual yeast clones, extracting DNA, and taking identical quantities of their DNA to build an Illumina library (so each of these 100 clones is roughly identically represented). I think the thinking was something like extreme QTL analysis. Is it possible/likely that a lot of these 100 clones will harbor the same few mutations (as they came from the same parents and presumably got the phenotype from one of the parents via introgression before the backcrossing), and that therefore the coverage would be enough to identify something??

**HESmith** · 03-20-2012, 11:26 AM

The coverage should be sufficient for mutation identification using the following criteria. 1) The causative mutation should be homozygous. 2) If the parental strains used for sequencing are pre-mutagenesis, then the causative mutation should be unique (i.e., absent in the parents). 3) Variants that were preexisting in the mutagenized strain and tightly linked to the causative mutation should also be homozygous (and, conversely, unique variants from the backcross strain should be absent in this interval). 4) Variants that are unique to either parent should be heterozygous at most loci.

Good luck,
Harold

**Noa** · 03-20-2012, 11:44 AM

1) how can the causative mutation be homozygous if my sequencing data is from 100 strains? can i just use allele frequency and assume that the frequency should be much higher than that sequenced in the entire generation (not looking at the clones of a particular phenotype)?
2) there was no mutagenesis so I cant know whether there was a SNP that occurred randomly and was selected for giving the particular phenotype, or whether it is one/a few genes given by the donor parent in the beginning of the introgression.
3) same problem as in 1 - how can i be sure it is homozygous if we are looking at a population? can i use allele frequency?
4) wasnt sure what you meant by #4- why heterozygous?

**HESmith** · 03-20-2012, 12:24 PM

Originally posted by Noa View Post

1) how can the causative mutation be homozygous if my sequencing data is from 100 strains? can i just use allele frequency and assume that the frequency should be much higher than that sequenced in the entire generation (not looking at the clones of a particular phenotype)?
2) there was no mutagenesis so I cant know whether there was a SNP that occurred randomly and was selected for giving the particular phenotype, or whether it is one/a few genes given by the donor parent in the beginning of the introgression.
3) same problem as in 1 - how can i be sure it is homozygous if we are looking at a population? can i use allele frequency?
4) wasnt sure what you meant by #4- why heterozygous?

From the way you described the experiment, I assumed that you have a single variant locus that produces your phenotype of interest. The criteria I outlined are based on the parental and pooled data sets only. I also assumed that the pooled sample came from segregants of parent A crossed to parent B.

1) You said that you picked and pooled only those isolates that had the phenotype; each of those isolates should contain the causative mutation, which will appear as a homozygous variant in that sample (i.e., allele frequency should be 1).
2) Okay, so you can't use uniqueness as a criterion.
3 & 4) You have data from each of the parent strains. Identify all of the variants present in parent A and in parent B. Each variant will be unique to A, unique to B, or present in both. Ignore the last. Unlinked variants in your pooled sample will segregate randomly and be present in ~50% of the isolates; those will be reported as heterozygotes. Linked variants should be present or absent from all isolates for the same reason as in #1.

If the assumptions that I made were incorrect, then the analysis becomes more complicated. For example, if the phenotype results from two loci, then you'll have to look for two homozygous alleles in your pooled sample. Or, if the pooled sample was generated after five backcrosses to parent B, then you'll have to filter out the homozygous parent B variants from your pooled sample since they're a consequence of the backcrossing rather than the phenotype.

One more complication: since your mutation may be spontaneous, it may be a transposon insertion. Standard SNP pipelines will almost certainly not detect this type of lesion, so you'll need to screen your data by a different approach.

**Noa** · 03-21-2012, 12:48 AM

Thanks for all your help.
One more question: you mentioned a transposon insertion - I was planning on looking for INDELS as well. I assume I need something different for this. Any tools you know of?

And finally- one additional worry I have is with respect to what genome do I map back to? I have been mapping SNPs so far using the reference S288C yeast genome. This is more or less identical to one parent we used. Our other parent is a S cerevisiae from nature. My worry is - what if there is a gene/s present in the natural isolate- we could entirely miss this in the "unmapped" reads. Is this common (huge regions/genes) that are unmapped when mapping a natural isolate to the ref genome? Should I build the entire parental genome or should I BLAST contigs made from de novo f the unmapped reads?
Thanks again...

**HESmith** · 03-21-2012, 05:04 AM

Check the wiki for recommended software for indel/structural variant analysis. You can also use split-end reads (found here) for both transposon and indel mapping. De novo assembly of the unmapped reads might be useful in identifying novel segments of the natural isolate.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Recommendations for yeast mutation identification

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News