Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating .fastq file from simulated reads

    Hi,

    I was simulating a .fastq read set from a genome by reading random locations of the genome. My objective here was to test some read error correction software like Quake. The .fastq file I created looks something like this:

    @SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    TACGTGACTGGATCAAAACTCACAAGGACTTTAATGGCCGCCGCTATACACTGCATCATTGCGTAGTCAGCTAATGCCGGGCGACTGGTTGGCTATTGTA
    +SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    @SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    TGAAACATGGGTATTTCGTGACTCTGGTCTAAAGAGGGACGTGAGAGGGCAGCGCTACCTATTGACCTGTTGTGAATTTGCGATTGTCAGGCATGATAAA
    +SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=100
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

    However, when I run jellyfish (while running SEECER), it returns warnings like the following
    Warn: Bad character in sequence: :
    Warn: Bad character in sequence: 1
    Warn: Bad character in sequence: :
    Warn: Bad character in sequence: 2
    Warn: Bad character in sequence: 2
    Warn: Bad character in sequence: 9

    It looks like it is considering the control lines in the .fastq file as reads. I would appreciate any help with regard to what I am missing here.

  • #2
    There's nothing wrong with those reads, they are valid. Maybe it doesn't like header data on the "+" lines, where it is optional, or maybe it doesn't like the stuff after the space; you could try removing both of those.

    But if you want to test error-correction software... I'd be happy if you included BBNorm! It's extremely fast, and designed very conservatively to avoid any false corrections, even in difficult cases like polyploid organisms, amplified single-cell data, highly repetitive organisms, contaminated libraries, and indels (it does not correct indels, but it doesn't make them worse, either). Handles fasta, fastq, paired, gzipped, ... pretty much anything.

    Command (on a Linux machine with bash):

    ecc.sh in=reads.fastq out=corrected.fastq -Xmx30g

    (...where the -Xmx flag specifies roughly 85% of the physical memory on the computer).

    P.S. The same package contains BBMap, which outputs useful statistics for evaluating error-correctors even if you don't produce a sam file. For example:


    mapped: 46.9273% 98545 reads
    unambiguous: 46.6302% 97921 reads

    perfect best site: 42.8053%
    semiperfect site: 46.2454%
    ambiguousMapping: 0.2971% (Kept)
    low-Q discards: 0.0000%

    Match Rate: 99.3339% 48944396
    Error Rate: 0.0312% 15393
    Sub Rate: 0.0179% 8830
    Del Rate: 0.0090% 4455
    Ins Rate: 0.0043% 2108
    N Rate: 0.6349% 312833


    (The formatting looks better on an actual console)
    Last edited by Brian Bushnell; 04-12-2014, 08:53 PM.

    Comment


    • #3
      Hi Brian,

      Thanks for the reply. I was looking for a good error correction tool to pre-process data before I use my tool. I'll certainly try out BBNorm as well.

      Thanks,
      Govinda.

      Comment


      • #4
        Hi Brian,

        I tried the BBMap based utility ecc.sh. It sends out the following error message.

        govinda@govinda-ThinkPad-T430s:~/Documents/Research_code$ ./bbmap/ecc.sh in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
        java -da -Xmx30g -cp /home/govinda/Documents/Research_code/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in=test1nn.fastq out=corrected_bbnorm.fastq outt=thrown_bbnorm.fastq -Xmx30g
        Executing jgi.KmerNormalize [bits=16, ecc=t, passes=1, keepall, dr=f, prefilter, in=test1nn.fastq, out=corrected_bbnorm.fastq, outt=thrown_bbnorm.fastq, -Xmx30g]


        Settings:
        threads: 4
        k: 31
        deterministic: false
        toss error reads: false
        passes: 1
        bits per cell: 16
        cells: 6770.27M
        hashes: 3
        prefilter bits: 2
        prefilter cells: 29.16B
        prefilter hashes: 2
        base min quality: 5
        kmer min prob: 0.5

        target depth: 40
        min depth: 6
        max depth: 40
        min good kmers: 15
        depth percentile: 54.0
        ignore dupe kmers: true
        fix spikes: false

        Made prefilter: hashes = 2 mem = 6.79 GB cells = 29.16B used = 0.001%
        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.atomic.AtomicIntegerArray.<init>(AtomicIntegerArray.java:69)
        at kmer.KCountArray7MTA.<init>(KCountArray7MTA.java:79)
        at kmer.KCountArray.makeNew(KCountArray.java:55)
        at kmer.KmerCount7MTA.makeKca(KmerCount7MTA.java:182)
        at jgi.KmerNormalize.runPass(KmerNormalize.java:970)
        at jgi.KmerNormalize.main(KmerNormalize.java:707)

        Do I have to install any java libraries?

        Comment


        • #5
          Govinda,

          That's not a library problem, just a memory problem. How much memory does the machine have? You need to reduce -Xmx30g slightly (possibly due to other programs running in the background). If you have 32GB RAM, try... -Xmx25g. If you have 16GB RAM, try -Xmx12g.

          The next revision I release will be better able to calculate how much memory is free automatically.

          -Brian

          P.S. In this case, I'm guessing you're just using a handful of reads, as the prefilter was only 0.001% full. Bear in mind that error-correction is reliant on high coverage - BBNorm with the default settings won't correct anything unless you have at least roughly 20x coverage, because there's not enough data to make confident corrections. Also, unlike some other programs, BBNorm purely uses a "count-min sketch" data structure. This does NOT grow in size with input data; rather, you allocate the memory you want to use at the beginning, and then no matter how much data you get, it will never overflow, the accuracy just decreases. As a result, don't be concerned that even if you only process 1 read, it will use many gigabytes of memory - the memory usage is fixed regardless of input.

          So, the more memory you give it, the more accurate it will be, which is why it's best to give it as much memory as you have free on the machine. But it will still run if you say "-Xmx2g", just the accuracy will be reduced on very large datasets.
          Last edited by Brian Bushnell; 04-12-2014, 09:49 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM
          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin



            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has seen remarkable advancements,...
            12-02-2024, 01:49 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-17-2024, 10:28 AM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-13-2024, 08:24 AM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-12-2024, 07:41 AM
          0 responses
          29 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-11-2024, 07:45 AM
          0 responses
          42 views
          0 likes
          Last Post seqadmin  
          Working...
          X