Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • When the reference sequence isn't perfect

    What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.

    Thanks in advance for any suggestions!

  • #2
    Why not try to assembly YPH500 genome using your reads? There is a PAGIT pipeline published on nature protocol may can do this job
    Originally posted by mmanhart View Post
    What does one do for alignment and variant discovery when the reference sequence doesn't exactly provide the baseline expectation that you want? Specifically, I have sequencing data at several time points from an experimentally-evolved population of yeast. The yeast strain is YPH500, which has no published reference genome, so I've been using the standard S288C reference. Although this is very close in most places, there are numerous loci where the strains differ. So when I align the reads to S288C, of course there are many mismatches, but some are due to evolution occurring during our experiment (which are the main interest) and some are just the differences between YPH500 and S288C (which are not the main interest). Are there any standard strategies for dealing with this situation? Currently I'm thinking of just filtering out any variants in loci that appear to have major strain differences. This seems like a decent conservative approach, but I could lose interesting variants in the process.

    Thanks in advance for any suggestions!

    Comment


    • #3
      Use your first time point as your baseline, and filter out the variants that were present in that sample.

      Comment


      • #4
        Or use de novo assembly to directly call variants between your samples ignoring the reference, as was done here in S.aureus, also in a longitudinal study

        Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. B. Young, T Golubchik et al, Proc. Nat. Acad. Sci Proc. Nat. Acad. Sci (2012) (doi:10.1073/pnas.1113219109)

        The pipeline is published here:
        High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012;


        and the basics first published here

        De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean, Nature Genetics (2012) doi:10.1038/ng.1028


        You can still use the S288C reference to provide coordinates (if you choose to), but the variant discovery can completely ignore the reference. I've used it on yeast by the way, so I know it works there.

        Feel free to contact me directly (zam AT well.ox.ac.uk) for more details.

        best wishes

        Zam

        Comment


        • #5
          Hi everyone,

          I am in a similar situation and was wondering if anyone could give me some advice too.

          We want to align tiger reads to the cat (felCat5) reference genome, however colleagues have told me that 1. felCat 5 is horrible and I might as well align to the dog reference genome (CanFam3), 2. we are too poor for deep sequencing and cannot do a de novo assembly approach...

          One idea to improve the alignment that has popped up would be to chose a different sequencing approach, i.e. 100 bp PE reads vs. 150 bp PE reads vs. 150 bp single reads (Illumina), except I am not sure which one would be best. (mmhart what did you guys end up doing?)

          Does anyone have any idea about advantages/disadvantages between these?

          Comment


          • #6
            tracecakes, what is the goal of your project? Many genotyping by sequencing projects don't have a good reference available, and some strategies we've used is to run a small set of the samples as overlapping PE reads to make pseudo-read contigs that are ~180 bp. The longer length does help with mapping in my anecdotal experience. If done on a MiSeq this could give quite long mappable reads.

            But in these cases, the longer pseudoreads are just used to help map to a close genome to identify synteny and therefore likely nearby genes. The short reads, piled to high depth, are used for the SNP calling, since methods like RAD or nextRAD will focus the reads on discrete loci across a genome and don't require a reference genome to identify variation between samples.

            But if finding SNPs isn't your goal, the shorter take-away is that longer reads do seem to help with alignment to a not-so-great reference. We (in my academic lab) have also developed RAD PE contigs to get 500 bp - 5kb pseudoreads (see http://www.plosone.org/article/info:...l.pone.0018561), but that is an even more involved approach for situations needing the longest contigs. The alignment software is crucial, though. You probably want to go with one that allows high levels of mismatches (novoalign, for example).
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              In my case, the strategy so far has been to simply filter out regions where the strains have major differences (which can be easily determined using our initial time point data, or just looking at what differences are present in all samples). The risk here is that we lose data on any real mutations in those regions. Since in my case the data is just from a different strain of yeast, there aren't too many of these regions and they mainly involve transposons and other low-interest stuff, so I believe this isn't a major sacrifice. In more divergent cases (e.g., tiger vs. cat or dog) this might be a big loss.

              De novo assembly is still on my radar, though. Perhaps had I started on that track from the beginning it would have been preferable, but at this point I'm trying to get by without it.

              Michael

              Comment


              • #8
                Thanks for the advice guys. SNPsaurus, we do want to call SNPs and genotype and we will probably use the MiSeq. I think I will try the pseudo-read contig approach with velvet... I've never done it or heard of it before so thanks for the enlightenment !

                Comment


                • #9
                  If you are doing a MiSeq run I'd aim for a paired-end run with a little bit of overlap. PE 250 will give you 400-450 bp overlapped reads, which is what you'd expect to get with the RAD PE contig approach of local assembly. It is hard to get high numbers of reads on long fragments (amplification of the library is biased toward small fragments, and there may be additional bias on the flow cell), so getting contigs of 1 kb is rare.

                  We found making the overlap type of library much easier, and the informatics simpler.
                  Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  27 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  26 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X