Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Who's running Quake?

    Hello,

    I am trying to do some error correction on illumina data using Quake. Unfortunately, I encounter the following problems when running
    quake.py -f fastq.list -k 17 -p 8

    terminate called after throwing an instance of 'jellyfish::fastq_hash::ErrorRead
    ing'
    what(): Bad file type '
    /bin/sh: line 1: 11089 Aborted /usit/titan/u1/chrishah/programme
    s/Quake/bin/jellyfish qdump -c fastq.dbm > fastq.list.qcts
    Optimization of distribution likelihood function to choose k-mer cutoff failed.
    Very likely you have set the value of k too high or not provided adequate covera
    ge (>15x). Inspect the k-mer counts for a clear separation of the error and true
    k-mer distributions.
    Guessing quality values are on ascii 64 scale

    Anyone seen that before? Any ideas???

    Much obliged!
    Christoph

  • #2
    I don't recall the exact error we were seeing with quake but, like yours, it had to do with jellyfish failures. We were able to avoid it by supplying the --no_jelly command-line option (causing quake to use its own k-mer counter).
    David Dooling

    Comment


    • #3
      My error is:
      terminate called after throwing an instance of 'mapped_file::ErrorMMap'

      what(): Can't open file 'file_list.dbm':

      /bin/sh: line 1: 21058 Aborted /disk2/Quake/bin/jellyfish qdump -c file_list.dbm > file_list.txt.qcts

      Optimization of distribution likelihood function to choose k-mer cutoff failed. Very likely you have set the value of k too high or not provided adequate coverage (>15x). Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.

      I will try “ --no_jelly ”.

      My genome is very big, about 3G, maybe “ --no_jelly ” will take a long time.

      Comment


      • #4
        Hi,

        Thanks for the suggestion. Tried --no_jelly and got the following:

        Processing sequences...
        ....................................................................................................................................................................................................................................................................260205970 sequences processed, -1699182760 bp scanned
        WARNING: Input had 11499462 non-DNA (ACGT) characters whose kmers were not counted
        358433106 total distinct mers
        358433106 mers occur at least 0 times
        Optimization of distribution likelihood function to choose k-mer cutoff failed. Very likely you have set the value of k too high or not provided adequate coverage (>15x). Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.
        Guessing quality values are on ascii 64 scale

        tried to run cov_model.py on the kmer.txt file I got from that:

        -bash-3.2$ python ~/programmes/Quake/bin/cov_model.py kmers.txt
        Optimization of distribution likelihood function to choose k-mer cutoff failed. Inspect the k-mer counts for a clear separation of the error and true k-mer distributions.

        So, it looks to me as if I get accross the first step, when using the --no_jelly option, but encounter problems in the next step (the cov_model.py script).

        The r.log file says:
        .
        .
        Read 30000 items
        Error in est.cov(cov) :
        Cannot identify reliable trusted k-mer coverage peak. Please verify the integrity of your dataset (e.g. by running kmer_hist.r). If there is no clear trusted k-mer distribution, the coverage is too low to correct these reads. If there is, please e-mail [email protected] and choose the cutoff by hand in the meantime
        Execution halted

        As suggested I run the kmer_hist.r script and got the attached kmer_hist.pdf. Not quite sure how to read it, though..

        Guess I should try to set the cutoff by hand as suggested. Any ideas?

        Much obliged!
        Christoph
        Attached Files

        Comment


        • #5
          Would you tell me the genome size of your sample and the sequencing coverage?

          Comment


          • #6
            Hi,

            The genome size is estimated to be about 100MB and I have about 250Million Illumnina reads (~70bp), so in theory I sould have very high coverage (>150x). What makes everything a bit difficult is that the reads contain probably quite a high percentage of contamination from the host organism. How many percent is difficult to guess.
            From first assembly attempts I would say at least 25% of the reads are contamination. For the host contamination I would expect the coverage to be very low and I guess thats what causes problems for Quake..??

            cheers,
            Christoph

            Comment


            • #7
              I see. I think you should separate the real reads from the host contamination reads firstly. Do you have a reference genome?

              Comment


              • #8
                No, no reference genome. There is a draft genome of the host on the web, but we are reluctant to just fish out reads that map onto that draft because in that way we might loose conserved regions, if you see what I mean.

                Comment


                • #9
                  Hi Christoph!
                  At the coverage you have, I don't think that graph is that useful. If you change the kmer_hist.r script to show more of your data, I think you could make a better educated guess.

                  On line 51 in that script, change
                  Code:
                  hist(cov, breaks=seq(0,ceiling(cov.max)), xlim=c(0, round(1.7*cov.mean)), ylim=c(0, round(1.2*hc[cov.mean])), xlab="Coverage", main="k-mer counts")
                  to something like this:
                  Code:
                  hist(cov, breaks=seq(0,ceiling(cov.max)), xlim=c(0, round(8*cov.mean)), ylim=c(0, round(5*hc[cov.mean])), xlab="Coverage", main="k-mer counts")
                  or you could just set xlim and ylim to suitable values by hand. For example, xlim should be at least set to xlim=c(0, 200) since the coverage of your parasite is 150x.

                  When you've done this, post again or send the author of the program an email and I guess it will be easier to get some help.

                  You might end up with a cutoff at 20 or something.

                  Ole

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 08:47 AM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X