Seqanswers Leaderboard Ad

**yzzhang** · 02-12-2013, 08:29 AM

Why not try to assembly YPH500 genome using your reads? There is a PAGIT pipeline published on nature protocol may can do this job

Originally posted by mmanhart View Post

What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.

Thanks in advance for any suggestions!

**HESmith** · 02-12-2013, 08:52 AM

Use your first time point as your baseline, and filter out the variants that were present in that sample.

**Zam** · 02-12-2013, 09:58 AM

Or use de novo assembly to directly call variants between your samples ignoring the reference, as was done here in S.aureus, also in a longitudinal study

Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. B. Young, T Golubchik et al, Proc. Nat. Acad. Sci Proc. Nat. Acad. Sci (2012) (doi:10.1073/pnas.1113219109)

The pipeline is published here:
High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012;

http://bioinformatics.oxfordjournals.org/content/29/2/275

and the basics first published here

De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean, Nature Genetics (2012) doi:10.1038/ng.1028

You can still use the S288C reference to provide coordinates (if you choose to), but the variant discovery can completely ignore the reference. I've used it on yeast by the way, so I know it works there.

Feel free to contact me directly (zam AT well.ox.ac.uk) for more details.

best wishes

Zam

**tracecakes** · 07-23-2013, 07:51 PM

Hi everyone,

I am in a similar situation and was wondering if anyone could give me some advice too.

We want to align tiger reads to the cat (felCat5) reference genome, however colleagues have told me that 1. felCat 5 is horrible and I might as well align to the dog reference genome (CanFam3), 2. we are too poor for deep sequencing and cannot do a de novo assembly approach...

One idea to improve the alignment that has popped up would be to chose a different sequencing approach, i.e. 100 bp PE reads vs. 150 bp PE reads vs. 150 bp single reads (Illumina), except I am not sure which one would be best. (mmhart what did you guys end up doing?)

Does anyone have any idea about advantages/disadvantages between these?

**SNPsaurus** · 07-23-2013, 08:43 PM

tracecakes, what is the goal of your project? Many genotyping by sequencing projects don't have a good reference available, and some strategies we've used is to run a small set of the samples as overlapping PE reads to make pseudo-read contigs that are ~180 bp. The longer length does help with mapping in my anecdotal experience. If done on a MiSeq this could give quite long mappable reads.

But in these cases, the longer pseudoreads are just used to help map to a close genome to identify synteny and therefore likely nearby genes. The short reads, piled to high depth, are used for the SNP calling, since methods like RAD or nextRAD will focus the reads on discrete loci across a genome and don't require a reference genome to identify variation between samples.

But if finding SNPs isn't your goal, the shorter take-away is that longer reads do seem to help with alignment to a not-so-great reference. We (in my academic lab) have also developed RAD PE contigs to get 500 bp - 5kb pseudoreads (see http://www.plosone.org/article/info:...l.pone.0018561), but that is an even more involved approach for situations needing the longest contigs. The alignment software is crucial, though. You probably want to go with one that allows high levels of mismatches (novoalign, for example).

**mmanhart** · 07-24-2013, 07:37 PM

In my case, the strategy so far has been to simply filter out regions where the strains have major differences (which can be easily determined using our initial time point data, or just looking at what differences are present in all samples). The risk here is that we lose data on any real mutations in those regions. Since in my case the data is just from a different strain of yeast, there aren't too many of these regions and they mainly involve transposons and other low-interest stuff, so I believe this isn't a major sacrifice. In more divergent cases (e.g., tiger vs. cat or dog) this might be a big loss.

De novo assembly is still on my radar, though. Perhaps had I started on that track from the beginning it would have been preferable, but at this point I'm trying to get by without it.

Michael

**tracecakes** · 07-25-2013, 03:08 PM

Thanks for the advice guys. SNPsaurus, we do want to call SNPs and genotype and we will probably use the MiSeq. I think I will try the pseudo-read contig approach with velvet... I've never done it or heard of it before so thanks for the enlightenment !

**SNPsaurus** · 07-25-2013, 07:06 PM

If you are doing a MiSeq run I'd aim for a paired-end run with a little bit of overlap. PE 250 will give you 400-450 bp overlapped reads, which is what you'd expect to get with the RAD PE contig approach of local assembly. It is hard to get high numbers of reads on long fragments (amplification of the library is biased toward small fragments, and there may be additional bias on the flow cell), so getting contigs of 1 kb is rare.

We found making the overlap type of library much easier, and the informatics simpler.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

When the reference sequence isn't perfect

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News