SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK complains that my bam file isn't indexed efoss Bioinformatics 9 09-04-2014 12:44 AM
[SNP calling, GATK] this contig isn't present in the Fasta sequence dictionar sehrrot Bioinformatics 3 05-30-2012 10:17 PM
cuffdiff says SAM isnít sorted, although it handled it cufflinks crader Bioinformatics 2 04-05-2012 08:43 AM
perfect alignment match NicoBxl Bioinformatics 3 10-07-2010 04:28 AM

Reply
 
Thread Tools
Old 02-12-2013, 07:16 AM   #1
mmanhart
Member
 
Location: United States

Join Date: Jan 2013
Posts: 11
Default When the reference sequence isn't perfect

What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.

Thanks in advance for any suggestions!
mmanhart is offline   Reply With Quote
Old 02-12-2013, 07:29 AM   #2
yzzhang
Member
 
Location: florida

Join Date: Jan 2013
Posts: 66
Default

Why not try to assembly YPH500 genome using your reads? There is a PAGIT pipeline published on nature protocol may can do this job
Quote:
Originally Posted by mmanhart View Post
What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.

Thanks in advance for any suggestions!
yzzhang is offline   Reply With Quote
Old 02-12-2013, 07:52 AM   #3
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 502
Default

Use your first time point as your baseline, and filter out the variants that were present in that sample.
HESmith is offline   Reply With Quote
Old 02-12-2013, 08:58 AM   #4
Zam
Member
 
Location: Oxford

Join Date: Apr 2010
Posts: 51
Default

Or use de novo assembly to directly call variants between your samples ignoring the reference, as was done here in S.aureus, also in a longitudinal study

Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. B. Young, T Golubchik et al, Proc. Nat. Acad. Sci Proc. Nat. Acad. Sci (2012) (doi:10.1073/pnas.1113219109)

The pipeline is published here:
High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012;
http://bioinformatics.oxfordjournals...ntent/29/2/275

and the basics first published here

De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean, Nature Genetics (2012) doi:10.1038/ng.1028


You can still use the S288C reference to provide coordinates (if you choose to), but the variant discovery can completely ignore the reference. I've used it on yeast by the way, so I know it works there.

Feel free to contact me directly (zam AT well.ox.ac.uk) for more details.

best wishes

Zam
Zam is offline   Reply With Quote
Old 07-23-2013, 07:51 PM   #5
tracecakes
Member
 
Location: Sydney

Join Date: Jun 2013
Posts: 14
Default

Hi everyone,

I am in a similar situation and was wondering if anyone could give me some advice too.

We want to align tiger reads to the cat (felCat5) reference genome, however colleagues have told me that 1. felCat 5 is horrible and I might as well align to the dog reference genome (CanFam3), 2. we are too poor for deep sequencing and cannot do a de novo assembly approach...

One idea to improve the alignment that has popped up would be to chose a different sequencing approach, i.e. 100 bp PE reads vs. 150 bp PE reads vs. 150 bp single reads (Illumina), except I am not sure which one would be best. (mmhart what did you guys end up doing?)

Does anyone have any idea about advantages/disadvantages between these?
tracecakes is offline   Reply With Quote
Old 07-23-2013, 08:43 PM   #6
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 465
Default

tracecakes, what is the goal of your project? Many genotyping by sequencing projects don't have a good reference available, and some strategies we've used is to run a small set of the samples as overlapping PE reads to make pseudo-read contigs that are ~180 bp. The longer length does help with mapping in my anecdotal experience. If done on a MiSeq this could give quite long mappable reads.

But in these cases, the longer pseudoreads are just used to help map to a close genome to identify synteny and therefore likely nearby genes. The short reads, piled to high depth, are used for the SNP calling, since methods like RAD or nextRAD will focus the reads on discrete loci across a genome and don't require a reference genome to identify variation between samples.

But if finding SNPs isn't your goal, the shorter take-away is that longer reads do seem to help with alignment to a not-so-great reference. We (in my academic lab) have also developed RAD PE contigs to get 500 bp - 5kb pseudoreads (see http://www.plosone.org/article/info:...l.pone.0018561), but that is an even more involved approach for situations needing the longest contigs. The alignment software is crucial, though. You probably want to go with one that allows high levels of mismatches (novoalign, for example).
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 07-24-2013, 07:37 PM   #7
mmanhart
Member
 
Location: United States

Join Date: Jan 2013
Posts: 11
Default

In my case, the strategy so far has been to simply filter out regions where the strains have major differences (which can be easily determined using our initial time point data, or just looking at what differences are present in all samples). The risk here is that we lose data on any real mutations in those regions. Since in my case the data is just from a different strain of yeast, there aren't too many of these regions and they mainly involve transposons and other low-interest stuff, so I believe this isn't a major sacrifice. In more divergent cases (e.g., tiger vs. cat or dog) this might be a big loss.

De novo assembly is still on my radar, though. Perhaps had I started on that track from the beginning it would have been preferable, but at this point I'm trying to get by without it.

Michael
mmanhart is offline   Reply With Quote
Old 07-25-2013, 03:08 PM   #8
tracecakes
Member
 
Location: Sydney

Join Date: Jun 2013
Posts: 14
Default

Thanks for the advice guys. SNPsaurus, we do want to call SNPs and genotype and we will probably use the MiSeq. I think I will try the pseudo-read contig approach with velvet... I've never done it or heard of it before so thanks for the enlightenment !
tracecakes is offline   Reply With Quote
Old 07-25-2013, 07:06 PM   #9
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 465
Default

If you are doing a MiSeq run I'd aim for a paired-end run with a little bit of overlap. PE 250 will give you 400-450 bp overlapped reads, which is what you'd expect to get with the RAD PE contig approach of local assembly. It is hard to get high numbers of reads on long fragments (amplification of the library is biased toward small fragments, and there may be additional bias on the flow cell), so getting contigs of 1 kb is rare.

We found making the overlap type of library much easier, and the informatics simpler.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO