Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet tests using ILM DH10B data

    I'm working on learning more about velvet's flags, using Illumina's DH10B paired end data.



    After running it through the velvet workflow and then using mummerplot to compare it to NC_010473 (from ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_010473.fna), I get poor alignment.

    velveth EC_MiSeq 89 -create_binary -shortPaired -fastq -separate MiSeq_Ecoli_DH10B_110721_PF_R1.fastq MiSeq_Ecoli_DH10B_110721_PF_R2.fastq
    After pushing through velvetg, I get:

    Final graph has 96 nodes and n50 of 107680, max 326350, total 4459289, using 12781988/14405136 reads

    But conversely, my mummerplot is kind of strange. I was wondering if anybody else had recapitulated the ILM data for themselves so I could compare my mummerplot data to it.
    Last edited by winsettz; 01-09-2013, 11:07 AM.

  • #2
    Can you share your mummerplot results?

    Comment


    • #3
      Sure.

      Before I do, I should probably post my mummer settings.
      mummer -mum -b -c $REFERENCE $QUERY > mummer.mums
      mummerplot -p mummer mummer.mums
      To generate non-binary Sequences file for use in mummer, I use:
      velveth EC_MiSeq 89 -shortPaired -fastq -separate MiSeq_Ecoli_DH10B_110721_PF_R1.fastq MiSeq_Ecoli_DH10B_110721_PF_R2.fastq
      (Alternatively, one could convert to fasta and then cat the two files together)
      Using the Sequences file from velveth as query and DH10B reference, I generate a map that shows where the paired ends are mapping relative to reference.



      Strangely; there's a chunk of reference not represented by the ILM reads at all. (Or the reference is wrong)

      Using velvetg and no additional flags and mapping resultant contigs.fa to DH10B reference, I generate



      Using a mixture of velvetg flags and DH10B reference, I generate



      To see in a quantitative fashion how my reads map to DH10B, I ran a little bowtie2 run (without the -a, I wish I'd seen that thread in Bioinformatics!)

      bowtie2 -p 4 --end-to-end -q -x NC_010473 -1 MiSeq_Ecoli_DH10B_110721_PF_R1.fastq -2 MiSeq_Ecoli_DH10B_110721_PF_R2.fastq -S seq_DH10B_e2e_PE.sam &> map_DH10B_e2e_PE.map
      7202568 reads; of these:
      7202568 (100.00%) were paired; of these:
      241028 (3.35%) aligned concordantly 0 times
      6437903 (89.38%) aligned concordantly exactly 1 time
      523637 (7.27%) aligned concordantly >1 times
      ----
      241028 pairs aligned concordantly 0 times; of these:
      61619 (25.57%) aligned discordantly 1 time
      ----
      179409 pairs aligned 0 times concordantly or discordantly; of these:
      358818 mates make up the pairs; of these:
      255154 (71.11%) aligned 0 times
      85770 (23.90%) aligned exactly 1 time
      17894 (4.99%) aligned >1 times
      98.23% overall alignment rate
      The mummerplot of PE reads on the NCBI genome and the bowtie statistics suggest strong overlap between the read information and the reference genome (asides from that odd gap of reference that isn't represented at all in the reads), and theoretically possessing the reads that correspond to parts of the reference suggest that I have all the puzzle pieces, in practice it doesn't appear to assemble very well.
      Last edited by winsettz; 01-14-2013, 11:36 AM.

      Comment


      • #4
        You know what, using the same data and velvet, but mauve contig mover, i've seen the same thing. I never really followed up on ths. One thing to test: if you run genomeCoverageBed from bedTools on the bam file, with the -d flag (genomeCoverageBed -ibam aligned_reads.bam -d) you'll get the per-base coverage of your mapped reads. Do you see the same gap there as well?

        Comment


        • #5
          My bedtools output from
          genomeCoverageBed -ibam MiSeq_Ecoli_DH10B_110721_PF.bam -d > output


          Excel isn't the best tool to plot with, I'll redo with gnuplot. My gap does show up though.


          Using mummer -maxmatch the <1Mb gap disappears.

          Reading my *.mum file, it's interesting that I get matches in window-size 20, when in theory starting contigs should all be matched at window-size that is the size of the Illumina read (149-151).

          Edit:

          Feeding output through gnuplot instead, which doesn't cap out at 1 million entries.

          plot 'output' using 2:3
          Last edited by winsettz; 01-15-2013, 06:51 AM.

          Comment


          • #6
            DH10B is known to have a large segmental duplication around 113kb. This will be assembled into a single consensus contig, hence the reason for the large gap I suspect.

            Comment


            • #7
              Originally posted by nickloman View Post
              DH10B is known to have a large segmental duplication around 113kb. This will be assembled into a single consensus contig, hence the reason for the large gap I suspect.


              There are two major insertions in DH10B relative to
              MG1655. First, the 80dlacZM15 insertion is a mosaic element described in detail below. Second, a 113-kb region of
              genome (genomic coordinates 514341 to 627601) is precisely
              duplicated in tandem. IS5 elements immediately flanking this
              segment presumably provided the homology necessary for creating the duplication. The 106 duplicated genes encode functions ranging from membrane components and transporters to
              transcriptional regulators.
              Yup. When I map on bwa and check on IGV the reads are correctly assigned to both regions. But during de novo they are condensed. I'm thinking I should lower my max-coverage some more and hope that it flattens things out and allows me to traverse the repeat.

              I guess this is a moment where Moleculo or some other long read data would do the job, or mate-pair. Being an self-development exercise, I will work with what I have.

              Comment


              • #8
                This is basically genomic 'dark matter' that is impossible to assemble de novo right now from shotgun reads. You'd need to resort to a BAC-to-BAC method or have some super long reads, or mate pairs with huge insert sizes of over 100kb, neither are available yet .. but maybe one day, soon?

                Comment


                • #9
                  I'm surprised no secret operatives from OpGen or BioNano have chimed in here. Seems like a perfect use-case for their tech. Not much in the way of 3rd party assembly tools for these data types though. And they are pretty durn pricey relative to an Illumina assembly.

                  Comment


                  • #10
                    Or Pacbio+Illumina to traverse the repeats.

                    I wish ILM had put up some mate-pair and tried to suggest PE+MP would do the trick. Or moleculo.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    56 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    70 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X