Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Who's running Quake?

    Hello,

    I am trying to do some error correction on illumina data using Quake. Unfortunately, I encounter the following problems when running
    quake.py -f fastq.list -k 17 -p 8

    terminate called after throwing an instance of 'jellyfish::fastq_hash::ErrorRead
    ing'
    what(): Bad file type '
    /bin/sh: line 1: 11089 Aborted /usit/titan/u1/chrishah/programme
    s/Quake/bin/jellyfish qdump -c fastq.dbm > fastq.list.qcts
    Optimization of distribution likelihood function to choose k-mer cutoff failed.
    Very likely you have set the value of k too high or not provided adequate covera
    ge (>15x). Inspect the k-mer counts for a clear separation of the error and true
    k-mer distributions.
    Guessing quality values are on ascii 64 scale

    Anyone seen that before? Any ideas???

    Much obliged!
    Christoph

  • #2
    I don't recall the exact error we were seeing with quake but, like yours, it had to do with jellyfish failures. We were able to avoid it by supplying the --no_jelly command-line option (causing quake to use its own k-mer counter).
    David Dooling

    Comment


    • #3
      My error is:
      terminate called after throwing an instance of 'mapped_file::ErrorMMap'

      what(): Can't open file 'file_list.dbm':

      /bin/sh: line 1: 21058 Aborted /disk2/Quake/bin/jellyfish qdump -c file_list.dbm > file_list.txt.qcts

      Optimization of distribution likelihood function to choose k-mer cutoff failed. Very likely you have set the value of k too high or not provided adequate coverage (>15x). Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.

      I will try “ --no_jelly ”.

      My genome is very big, about 3G, maybe “ --no_jelly ” will take a long time.

      Comment


      • #4
        Hi,

        Thanks for the suggestion. Tried --no_jelly and got the following:

        Processing sequences...
        ....................................................................................................................................................................................................................................................................260205970 sequences processed, -1699182760 bp scanned
        WARNING: Input had 11499462 non-DNA (ACGT) characters whose kmers were not counted
        358433106 total distinct mers
        358433106 mers occur at least 0 times
        Optimization of distribution likelihood function to choose k-mer cutoff failed. Very likely you have set the value of k too high or not provided adequate coverage (>15x). Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.
        Guessing quality values are on ascii 64 scale

        tried to run cov_model.py on the kmer.txt file I got from that:

        -bash-3.2$ python ~/programmes/Quake/bin/cov_model.py kmers.txt
        Optimization of distribution likelihood function to choose k-mer cutoff failed. Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.

        So, it looks to me as if I get accross the first step, when using the --no_jelly option, but encounter problems in the next step (the cov_model.py script).

        The r.log file says:
        .
        .
        Read 30000 items
        Error in est.cov(cov) :
        Cannot identify reliable trusted k-mer coverage peak. Please verify the integrity of your dataset (e.g. by running kmer_hist.r). If there is no clear trusted k-mer distribution, the coverage is too low to correct these reads. If there is, please e-mail [email protected] and choose the cutoff by hand in the meantime
        Execution halted

        As suggested I run the kmer_hist.r script and got the attached kmer_hist.pdf. Not quite sure how to read it, though..

        Guess I should try to set the cutoff by hand as suggested. Any ideas?

        Much obliged!
        Christoph
        Attached Files

        Comment


        • #5
          Would you tell me the genome size of your sample and the sequencing coverage?

          Comment


          • #6
            Hi,

            The genome size is estimated to be about 100MB and I have about 250Million Illumnina reads (~70bp), so in theory I sould have very high coverage (>150x). What makes everything a bit difficult is that the reads contain probably quite a high percentage of contamination from the host organism. How many percent is difficult to guess.
            From first assembly attempts I would say at least 25% of the reads are contamination. For the host contamination I would expect the coverage to be very low and I guess thats what causes problems for Quake..??

            cheers,
            Christoph

            Comment


            • #7
              I see. I think you should separate the real reads from the host contamination reads firstly. Do you have a reference genome?

              Comment


              • #8
                No, no reference genome. There is a draft genome of the host on the web, but we are reluctant to just fish out reads that map onto that draft because in that way we might loose conserved regions, if you see what I mean.

                Comment


                • #9
                  Hi Christoph!
                  At the coverage you have, I don't think that graph is that useful. If you change the kmer_hist.r script to show more of your data, I think you could make a better educated guess.

                  On line 51 in that script, change
                  Code:
                  hist(cov, breaks=seq(0,ceiling(cov.max)), xlim=c(0, round(1.7*cov.mean)), ylim=c(0, round(1.2*hc[cov.mean])), xlab="Coverage", main="k-mer counts")
                  to something like this:
                  Code:
                  hist(cov, breaks=seq(0,ceiling(cov.max)), xlim=c(0, round(8*cov.mean)), ylim=c(0, round(5*hc[cov.mean])), xlab="Coverage", main="k-mer counts")
                  or you could just set xlim and ylim to suitable values by hand. For example, xlim should be at least set to xlim=c(0, 200) since the coverage of your parasite is 150x.

                  When you've done this, post again or send the author of the program an email and I guess it will be easier to get some help.

                  You might end up with a cutoff at 20 or something.

                  Ole

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  21 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X