Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina for resequencing E. coli

    Hi, I just started a project where we need to resequence many E. coli mutants using an Illumina GAII system. Anyone have any thoughts on paired-end versus single end, potentially with multiplexing to save $$? I have estimated that we can run ~5 multiplexed samples to get >10X coverage for each (4.6 Mbp genome), but I have no experience on what the actual data will look like (errors, other things I haven't thought of, etc.). For anyone who has done resequencing is 10X coverage realistic to identify SNPs or other mutations?

  • #2
    For SNP and mutation discovery people usually try to aim for 20 to 30X.

    Comment


    • #3
      If I were faced with this question, I'd probably start with thinking about the following:

      Paired-end vs. single end. What is the real cost per basepair for each approach? The read length would figure in here. Given the read lengths you are using & E.coli (are these lab strains? wild strains?), what fraction of reads will not be unambiguously mappable with only single end coverage? There are probably some E.coli datasets in the Short Read Archive to play with, you could simulate some (there are various tools out there; getting the error profile right could be a bugaboo) or do a pilot run at very long read length & paired end, from which you can then simulate any lesser version of data (single end, shorter reads).

      For the depth question, you really need to think about what is the risk of false positive (due to a miscall being classified as an error because you do not get reads to contradict it) or false negative (insufficient depth to find the SNP) errors -- how worried are you of each?

      One more set of considerations: how will the results be used? What will the next experiment look like? For example, if next you retest mutations by another sequencing strategy (PCR-based Sanger, for example), how many positives do you want to screen? Or is the next step really expensive, such as generating the same mutation in another background or expressing mutant protein or such?

      Comment


      • #4
        I am still waiting to find out pricing for single end versus paired end runs with multiplexing, but I am trying to figure out if I should be using one versus the other for any specific reason other than price. Our ultimate goal is to sequence mutants of E. coli K12 to better understand pathway evolution. We will be confirming mutations by Sanger sequencing regions of interest so my biggest concern would probably be false negative errors.

        As I am very new to this area, where would I find these E. coli datasets? Also, are there any programs I can run on a laptop with 1 GB RAM to look at this (running Windows XP)? We are going to be purchasing a new computer for the actual data analysis, but we haven't decided what to get quite yet.

        Thanks for the help!

        Comment


        • #5
          Paired-end reads don't offer any advantage for identifying SNPs, although they can be useful for identifying indels. Unless you have a large number of strains to analyze, 30X coverage is reasonable. A single lane of Illumina short-read sequence (30 million reads x 36bp) is sufficient for 30X coverage for seven barcoded strains. Reagent costs for the clustering and sequencing are ~$600, and costs per sample library can be >$100 w/ home-made protocol. So you're looking at ~$200 per strain; your labor costs for the time spent on data analysis will be MUCH more than that !

          -Harold

          Comment


          • #6
            SNPs or transposons moving around. That is the question.

            I had much better luck with SOLiD mate pairs in resolving unknown bacterial mutations, while only after the fact were we able to confirm with frag reads.

            Comment


            • #7
              Thanks for all of the information, it is hard to find everything in one place!

              Our plan is to run paired-end with 36 cycles to catch genomic rearrangements as well as SNPs. I am thinking of starting with multiplexing 6 strains per lane and see what the coverage ends up looking like. Anyone have any thoughts on the size of barcodes - we don't want to loose too much sequence but maintain fidelity. Also, with paired-end runs do you have to put the barcode on both ends? I was thinking you wouldn't but reading some of the threads I am now confused on this issue again.

              Thanks again!

              Comment


              • #8
                Please ignore the barcoding question relating to paired-end reads, just realized I was thinking about incorrectly!

                Comment


                • #9
                  Originally posted by ECO View Post
                  SNPs or transposons moving around. That is the question.

                  I had much better luck with SOLiD mate pairs in resolving unknown bacterial mutations, while only after the fact were we able to confirm with frag reads.
                  I agree that SOLiD would be better for this, but we only have an Illumina system available to us. We think the most likely mutations with be SNPs, but it is likely we will also find a variety of insertions/deletions. I am not sure if we will really be able to see transposons but it would be nice to see those as well. I am hoping a paired-end Illumina run with 36 cycles will pick up most of those, but since I haven't had any experience with this I am not sure which type of Illumina run will be the best.

                  Thanks!

                  Comment


                  • #10
                    I've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).

                    Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).

                    -Harold

                    Comment


                    • #11
                      Cool!

                      If you want to have fun & have a sufficient machine, you could also feed each read set into velvet & then take the contigs & BLAST them against reference E.coli & your transposons.

                      Comment


                      • #12
                        Originally posted by krobison View Post
                        Cool!

                        If you want to have fun & have a sufficient machine, you could also feed each read set into velvet & then take the contigs & BLAST them against reference E.coli & your transposons.
                        I would also recommend to do de novo assembly, even at the cost of deeper coverage/cost.

                        Comment


                        • #13
                          Ours was a one-off experiment, so I didn't invest the time in mastering a de novo assembler (I'm a molecular biologist, so it takes me longer ). But if we were working with a larger collection of mutations, I could see the advantage of building a reference genome from the starting strain. And coverage wouldn't be a problem; even one lane of 36bp single end would yield >90X (and surely that's enough for an accurate assembly).

                          -Harold

                          Comment


                          • #14
                            Originally posted by HESmith View Post
                            I've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).

                            Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).

                            -Harold
                            Hi Harold,
                            We recently got data from several multiplexed 36bp single-read E. coli samples and are starting to analyze them (we couldn't find anyone to run paired-end lanes with us at the time). I have been using Bowtie to align the data to the E. coli K-12 MG1655 genome, but am having problems with transposons and other large insertions and was curious about the method you described to find them. I am still pretty new at this and couldn't quite understand what you did with the second chromosome file? I think we may try some de novo sequencing down the line, but I don't know when the software will be available on our server and would really like to get started identifying regions of interest for Sanger sequencing.

                            Thanks!
                            Jamie

                            Comment


                            • #15
                              Hi Jamie,

                              By default, most aligners return the reads that have a single best match to the reference genome. Transposons are present in multiple copies, so reads derived from transposons will produce multiple best matches. To use the approach I took, you want remove the transposon sequences in the K-12 genome and build a second reference file that contains a single copy of each transposon.

                              Specifically:

                              1) Build a new reference file that contains 20 bases from the 5' and 3' end of each transposon/insertion element.
                              2) Replace those sequences with N's in the K-12 genome.
                              3) Use your favorite aligner to map the first 16 bases of your reads to the two reference files.
                              4) Repeat the alignment with the last 16 bases of your reads.
                              5) Identify the set of reads where the first 16 bases aligns to the K-12 reference and the last 16 bases aligns to the transposon reference, and vice versa. These are the junction reads (genome/transposon).
                              6) Use the genome alignment position to map the location of your reads to known or novel insertion sites.

                              Note that there are more sophisticated tools for identifying indels, but this approach worked fine for my purposes.

                              Email (smithhe2ATniddk.nih.gov) if you have further questions.

                              -Harold

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:49 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X