Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extend contigs with a reference sequence

    Hi,

    I'm trying to extend my contigs to cover entire viral genomes. When I align my reads (1.5M) to a viral genome (2.6KB), I can get full coverage of it and export a concensus sequence. When I assemble these reads, I get 2 contigs that cover about half of the viral reference sequence. I can scaffold (scaffold_builder: http://edwards.sdsu.edu/scaffold_builder/) these contigs, but I am missing two large sequences in the beginning and middle of the sequence (brackets denote missing sequence):
    Code:
    reference: -----------------------------------
    contigs:   [      ]--------------[    ]-------
    I know that my reads cover this sequence, but I think that they didn't make it into any contigs so the scaffold program can't use them to fill in the gaps.

    My plan is to do the following: combine the reads that map to the reference with the contigs and use the scaffold builder on the contigs and reads together to fill in the gaps. This pipeline is running now, but I wanted to see if anyone has any other ideas about what to do?

    Thanks!

  • #2
    I would guess that the genome has repeats which do not assemble well. Scaffolding and gap-filling with paired reads depend on the gaps being small enough and the read insert sizes being long enough.

    If you can't solve the problem computationally, you might need a new library with longer reads or longer insert sizes. But before doing more sequencing, I would suggest trying different assemblers - many perform better on specific scenarios.

    Also, you might try a gap-filling algorithm. The only one I'm familiar with is PBJelly, which is designed for long reads (like PacBio), but there are many others.

    Comment


    • #3
      Hi Brian, thanks for the reply. The virus I'm trying to assemble is this one: http://www.ncbi.nlm.nih.gov/nuccore/...report=GenBank and doesn't seem to have too many repeats. Our read size is about 50 bp and we have single end reads. I've used idba_ud and Velvet to assemble this genome and got slightly longer contigs from idba_ud. Most other assemblers I've used have had poorer results.

      Comment


      • #4
        50bp SE reads are not very good for assembling; I would suggest going to longer reads, paired-ends, and a greater kmer length if you want a quality assembly. It's probably less expensive to resequence and assemble from longer reads than to spend a lot of time generating an inferior assembly from very short reads.

        Comment


        • #5
          For most assembly, unpaired 50bp reads aren't that great. But in your case, the genome is tiny and the maximum size of any repeat in it is only 12bp. So 50bp reads should be fine. I created around 2000 simulated reads of 50bp from your genome with a 10% error rate and Geneious assembles it fine into a single contig.

          Maybe the problem is that you have too much coverage. Try using a small subset of your data of just a few thousand reads.

          If that doesn't help, then maybe the error rate in your data is too high. In that case you could try Geneious since it works well with high error rates.

          Comment


          • #6
            Thank you for the in depth help! I don't have geneious, but if I do get it I will try that. One thing I've been playing around with is just generating a consensus sequence from my reads instead of assembling it. I used samtools and bwa to look at my distrubution of reads and I definitely have the whole genome covered by the reads. Then I did this:
            Code:
            samtools mpileup -uf $refseq $alignmentmappedsortedbam | bcftools view -cg - | vcfutils.pl vcf2fq > $consensus
            and got a consensus fastq file.

            My question is, is there any reason not to do this vs de novo assembly? I realize that I need to have a reference genome for this to work, but I'm able to get the full sequence using this but not with de novo assembly (I haven't tried using a subset of the reads yet).

            I'm thinking of making a pipeline where I do de novo assembly of the reads -> blast contigs -> get reference sequences -> map reads to reference -> obtain consensus. Does that workflow make sense?

            The reason for this vs resequencing with longer reads is that I have a lot of 50 bp reads already. Eventually (hopefully soon) we will be moving to longer paired end reads though.
            Last edited by arundurvasula; 07-10-2014, 11:22 AM.

            Comment


            • #7
              Most likely the problem is the excessive level coverage you've got. Most de novo assemblers (Geneious included) won't work well with that level of coverage. The reason is that the same random sequencing error ends up occurring so many times that the assembler treats it as a real variant. A few thousand reads are more than enough for your genome.

              If your reference sequence is close enough to your sequenced organism, then mapping is OK. But with a genome as small as yours which has no significant repeats and an appropriate level of coverage, I'd have more confidence that the de novo assembly results are correct over the mapping results. If you apply both methods and get the same consensus sequence that should give you even more confidence you've got the correct result.

              Comment


              • #8
                Cool, I'm working on using less reads now. One problem is that my reads are not evenly distributed:


                Where the y axis is position along the genome and x is number of reads at that base position.

                I'm examining the results of subsampling my reads here: https://github.com/arundurvasula/cov...depth-assembly

                I think I need a way to sample the reads while still maintaining full coverage. If I drop down to 50x coverage, I get something like this:

                which is expected, but I don't think I will be able to get 1 contig from my reads.

                Comment


                • #9
                  With highly irregular coverage distributions, you should probably look to normalization rather than subsampling. There's an example of how to do that here:

                  Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                  Comment


                  • #10
                    That looks like exactly what I was looking for, thank you Brian!

                    Comment


                    • #11
                      You're welcome; please let me know if it solves the problem.

                      Comment


                      • #12
                        So I ran it on my data and looked at the coverage using the same method (left is subsampling, right is normalization):

                        10x:


                        50x:


                        so normalization is a lot better, but there's still highly variable coverage.

                        The script I used to run these analyses is here: https://github.com/arundurvasula/cov...rmalization.sh

                        Notably, I'm using
                        Code:
                        bash bbnorm.sh in=$reads out=../temp/$1/normalized_reads.fasta target=$2 -Xmx2g
                        to run bbnorm. What options can I play with to make the coverage more even?

                        Also am I calculating coverage correctly? I saw that target is for kmer depth and there's formula to convert kmer depth to read depth so I did Dr = Dk * (50/50-31+1) and calculated Dk for 5x, 10x, 20x, 30x, 40x, and 50x.
                        Last edited by arundurvasula; 07-11-2014, 11:02 AM.

                        Comment


                        • #13
                          Because your reads are so short, you can try reducing the kmer length to perhaps 21 with "k=21" flag. The coverage profile after normalization is currently way too spiky, and too high - it should be almost perfectly flat and very close to the target level. So, normalization does not seem to be working well, which probably indicates a high level of errors or reads from slightly different genomes.

                          You can also try error-correction with the "ecc" flag, and throwing away low-quality reads, since you have have excess coverage. In summary:

                          bash reformat.sh in=reads.fq out=highQuality.fq maq=16

                          (that will throw away all reads with average quality below 16)

                          bash bbnorm.sh in=$reads out=../temp/$1/normalized_reads.fasta target=$2 -Xmx2g ecc k=21

                          And while this may bias the assembly, you could also try throwing away all reads that don't map to the reference within some edit distance, prior to normalization.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          8 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X