Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating .fastq file from simulated reads

    Hi,

    I was simulating a .fastq read set from a genome by reading random locations of the genome. My objective here was to test some read error correction software like Quake. The .fastq file I created looks something like this:

    @SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    TACGTGACTGGATCAAAACTCACAAGGACTTTAATGGCCGCCGCTATACACTGCATCATTGCGTAGTCAGCTAATGCCGGGCGACTGGTTGGCTATTGTA
    +SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    @SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    TGAAACATGGGTATTTCGTGACTCTGGTCTAAAGAGGGACGTGAGAGGGCAGCGCTACCTATTGACCTGTTGTGAATTTGCGATTGTCAGGCATGATAAA
    +SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

    However, when I run jellyfish (while running SEECER), it returns warnings like the following
    Warn: Bad character in sequence: :
    Warn: Bad character in sequence: 1
    Warn: Bad character in sequence: :
    Warn: Bad character in sequence: 2
    Warn: Bad character in sequence: 2
    Warn: Bad character in sequence: 9

    It looks like it is considering the control lines in the .fastq file as reads. I would appreciate any help with regard to what I am missing here.

  • #2
    There's nothing wrong with those reads, they are valid. Maybe it doesn't like header data on the "+" lines, where it is optional, or maybe it doesn't like the stuff after the space; you could try removing both of those.

    But if you want to test error-correction software... I'd be happy if you included BBNorm! It's extremely fast, and designed very conservatively to avoid any false corrections, even in difficult cases like polyploid organisms, amplified single-cell data, highly repetitive organisms, contaminated libraries, and indels (it does not correct indels, but it doesn't make them worse, either). Handles fasta, fastq, paired, gzipped, ... pretty much anything.

    Command (on a Linux machine with bash):

    ecc.sh in=reads.fastq out=corrected.fastq -Xmx30g

    (...where the -Xmx flag specifies roughly 85% of the physical memory on the computer).

    P.S. The same package contains BBMap, which outputs useful statistics for evaluating error-correctors even if you don't produce a sam file. For example:


    mapped: 46.9273% 98545 reads
    unambiguous: 46.6302% 97921 reads

    perfect best site: 42.8053%
    semiperfect site: 46.2454%
    ambiguousMapping: 0.2971% (Kept)
    low-Q discards: 0.0000%

    Match Rate: 99.3339% 48944396
    Error Rate: 0.0312% 15393
    Sub Rate: 0.0179% 8830
    Del Rate: 0.0090% 4455
    Ins Rate: 0.0043% 2108
    N Rate: 0.6349% 312833


    (The formatting looks better on an actual console)
    Last edited by Brian Bushnell; 04-12-2014, 08:53 PM.

    Comment


    • #3
      Hi Brian,

      Thanks for the reply. I was looking for a good error correction tool to pre-process data before I use my tool. I'll certainly try out BBNorm as well.

      Thanks,
      Govinda.

      Comment


      • #4
        Hi Brian,

        I tried the BBMap based utility ecc.sh. It sends out the following error message.

        govinda@govinda-ThinkPad-T430s:~/Documents/Research_code$ ./bbmap/ecc.sh in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
        java -da -Xmx30g -cp /home/govinda/Documents/Research_code/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
        Executing jgi.KmerNormalize [bits=16, ecc=t, passes=1, keepall, dr=f, prefilter, in=test1nn.fastq, out=corrected_bbnorm.fastq, outt=thrown_bbnorm.fastq, -Xmx30g]


        Settings:
        threads: 4
        k: 31
        deterministic: false
        toss error reads: false
        passes: 1
        bits per cell: 16
        cells: 6770.27M
        hashes: 3
        prefilter bits: 2
        prefilter cells: 29.16B
        prefilter hashes: 2
        base min quality: 5
        kmer min prob: 0.5

        target depth: 40
        min depth: 6
        max depth: 40
        min good kmers: 15
        depth percentile: 54.0
        ignore dupe kmers: true
        fix spikes: false

        Made prefilter: hashes = 2 mem = 6.79 GB cells = 29.16B used = 0.001%
        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.atomic.AtomicIntegerArray.<init>(AtomicIntegerArray.java:69)
        at kmer.KCountArray7MTA.<init>(KCountArray7MTA.java:79)
        at kmer.KCountArray.makeNew(KCountArray.java:55)
        at kmer.KmerCount7MTA.makeKca(KmerCount7MTA.java:182)
        at jgi.KmerNormalize.runPass(KmerNormalize.java:970)
        at jgi.KmerNormalize.main(KmerNormalize.java:707)

        Do I have to install any java libraries?

        Comment


        • #5
          Govinda,

          That's not a library problem, just a memory problem. How much memory does the machine have? You need to reduce -Xmx30g slightly (possibly due to other programs running in the background). If you have 32GB RAM, try... -Xmx25g. If you have 16GB RAM, try -Xmx12g.

          The next revision I release will be better able to calculate how much memory is free automatically.

          -Brian

          P.S. In this case, I'm guessing you're just using a handful of reads, as the prefilter was only 0.001% full. Bear in mind that error-correction is reliant on high coverage - BBNorm with the default settings won't correct anything unless you have at least roughly 20x coverage, because there's not enough data to make confident corrections. Also, unlike some other programs, BBNorm purely uses a "count-min sketch" data structure. This does NOT grow in size with input data; rather, you allocate the memory you want to use at the beginning, and then no matter how much data you get, it will never overflow, the accuracy just decreases. As a result, don't be concerned that even if you only process 1 read, it will use many gigabytes of memory - the memory usage is fixed regardless of input.

          So, the more memory you give it, the more accurate it will be, which is why it's best to give it as much memory as you have free on the machine. But it will still run if you say "-Xmx2g", just the accuracy will be reduced on very large datasets.
          Last edited by Brian Bushnell; 04-12-2014, 09:49 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X