Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DigiNorm on Paired-end samples

    Hi,

    how does DigiNorm handle paired-end fastq files? I read in the article that it does not take it into account. But then how to work with the output for de-novo assembly of transcriptome if pairing is lost? is it then worth doing digital normalization?

    Cheers,

    Dave

  • #2
    KHMER does not handle paired reads currently, I believe they are working on that.

    You can still assemble your reads with one of the following strategies:

    1. Diginorm the pairs, but assemble as SE reads
    2. Diginorm only Read1 from each pair, and assemble as SE reads
    3. Don't diginorm, just assemble as PE

    Comment


    • #3
      Hi dnusol,

      If I understand your question correctly, it is possible to digitally normalize paired-end transcriptome data and assemble it using an assembler that takes advantage of the paired end info (e.g., Trinity) if your data are formatted correctly. If you're working with data from Casava 1.8+ (I think all HiSeq and MiSeq data) then you will have to modify your headers to have a /1 or /2 at the end to replicate the older way read pair member ID was encoded (see http://en.wikipedia.org/wiki/FASTQ_format).

      These are the sed commands I use on my MiSeq data to fix this:
      sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
      sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq


      Then you will need to run commands along the lines of the following:
      shuffleSequences_fastq.pl example_R1.fastq example_R2.fastq example_shuffled.fastq
      normalize-by-median.py -C 30 -k 20 -N 4 -x 2.5e9 example_shuffled.fastq
      strip-and-split-for-assembly.py example_shuffled.fastq.keep
      split_pe.py example_shuffled.fastq.keep.pe
      mv example_shuffled.fastq.keep.pe.1 example_norm_R1.fasta
      mv example_shuffled.fastq.keep.pe.2 example_norm_R2.fasta
      Trinity.pl --seqType fa --left example_norm_R1.fasta --right example_norm_R2.fasta --JM 20G --inchworm_cpu 8 --CPU 8 --output assembly


      Please let me know if this does/n't work for you.

      See this page for more info:


      Thanks,
      Kevin

      Comment


      • #4
        Trinity's version of in silico read normalization handles paired end normalization - give it a look, for sure.
        ==========
        Alex Harkess
        Leebens-Mack Lab
        Plant Biology Department
        University of Georgia, Athens GA

        Comment


        • #5
          khmer vs trinity

          Originally posted by aharkess View Post
          Trinity's version of in silico read normalization handles paired end normalization - give it a look, for sure.
          Titus Brown has written a blog post comparing KHMER to Trinity's digital normalization:
          This post can be referenced and cited at the following DOI: http://dx.doi.org/10.6084/m9.figshare.98198. For a few months, the Trinity list was...

          Comment


          • #6
            FASTQ files can have @ in the quality scores

            FYI, correct me if I'm wrong, but it's possible that FASTQ files have a @ symbol in the quality scores as well as at the start of the header line.



            Which is a dumb thing about the FASTQ format, but whatever, we're stuck with it.

            So, any approach that uses sed to search for the @ at the start of the header line is potentially unsafe.

            Originally posted by kmkocot View Post
            These are the sed commands I use on my MiSeq data to fix this:
            sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
            sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq
            So, this might cause problems in the event that you have the text @M00 in the quality scores, which is possible given the squillions of reads we deal with these days.

            My totally non-guaranteed approach:

            sed '1~4 s/$/ \/1/g' your_fastq_file.fastq > your_new_fastq_file.fastq (for left reads) , or
            sed '1~4 s/$/ \/2/g' your_fastq_file.fastq > your_new_fastq_file.fastq (for right reads).

            This simply adds ' /1' ( i.e. a space, a slash and a 1) to the end of every 4th line starting with the first line. If your file is FASTQ format this should work (works for me anyway). You can use the sed -i option to replace rather than redirecting to a new file if you want.
            Last edited by danwiththeplan; 01-15-2014, 07:29 PM. Reason: fix

            Comment


            • #7
              Hi everbody ...
              change the discussion focus, how is the real mean of -C (Cutoff) option in normalize-by-median.py script. The Digital Normalization authors suggested -C 20. But I don't sure if only reads of lower coverage 20 are trimmed.

              Thanks,
              Arthur

              Comment


              • #8
                Originally posted by arthurmelo View Post
                Hi everbody ...
                change the discussion focus, how is the real mean of -C (Cutoff) option in normalize-by-median.py script. The Digital Normalization authors suggested -C 20. But I don't sure if only reads of lower coverage 20 are trimmed.

                Thanks,
                Arthur
                I am not sure if I understand your question correctly but -C (Cutoff) is to indicate the number which is used as a cutoff for the median kmer coverage of a read.

                So basically diginorm is going through the file one read at a time and counting kmers from each read then storing the number of occurences for each kmer as it goes through the reads. The number of times a kmer has been found is considered the kmer coverage. If a read has a median kmer coverage over the cutoff, that entire read is discarded (not the same thing as trimming).

                In case there is some confusion on what median kmer coverage is, it works like this:
                If your reads are 100bp and your kmer size is 20bp (a 20-mer), then you will have 80 20-mers in each read.
                Count how many times each 20-mer has occured and then rank the counts from most to least.
                Take the median 20-mer from this rank list.
                If this number is greater than the cutoff number, the read will be discarded. If it is less than the cutoff number, the read is kept.

                This is a way to reduce the number of redundant reads while accomodating for some sequence error.

                If you need more details, this article explains what it is doing exactly:
                Last edited by rkizen; 04-09-2014, 04:25 AM.

                Comment


                • #9
                  Thank you so much rkizen.

                  Comment


                  • #10
                    Originally posted by danwiththeplan View Post
                    FYI, correct me if I'm wrong, but it's possible that FASTQ files have a @ symbol in the quality scores as well as at the start of the header line.



                    Which is a dumb thing about the FASTQ format, but whatever, we're stuck with it.

                    So, any approach that uses sed to search for the @ at the start of the header line is potentially unsafe.

                    Originally Posted by kmkocot
                    These are the sed commands I use on my MiSeq data to fix this:
                    sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
                    sed -i '/^@M00/ s/\ .\+/\/2/g' *_R2.fastq
                    So, this might cause problems in the event that you have the text @M00 in the quality scores, which is possible given the squillions of reads we deal with these days.
                    I know this post is a little old but I just wanted to point out that using /^@M00/ is a safe way to identify header lines (at least for this particular run). The M00 (that is M<zero><zero>) represents the beginning of the machine name that produced these reads. None of the standard FASTQ quality formats will have 'M' and '0' (zero) together. Phred+33 may contain '0' but not 'M'; Phred+64 may contain 'M' but not '0'.

                    Our Illumina sequencers are configured to write instrument names in the FastQ files as "HWI-xxxxxxx". I always use /^@HWI/ to grep header lines in these FastQ files. This is safe for all newer (Phred+33 output) Illumina data as 'W' will never appear in a quality line.

                    Comment


                    • #11
                      Just to update this thread:

                      NOW, Khmer DOES preserve paired ends. However, the user must recover the paired reads after digital normalization since some of the pairs will be orphaned during the process. This page gives a tutorial and, while it is based on an older version of khmer, I used it this week with success:

                      Comment


                      • #12
                        FYI, BBNorm never orphans pairs during normalization, and is substantially faster and easier to use than Khmer, which I found rather confusing.

                        The command to normalize to 50x coverage would be like this:

                        bbnorm.sh -Xmx29g in=read#.fq out=norm#.fq target=50

                        ...and you can add the flag "ecc=t" if you want error-correction as well. The -Xmx flag should be set to about 85% of the system's memory, or you can leave it off and see how well the shellscript autodetects memory. The command assumes that both files have the same name except for where the "#" symbol is, where one has a 1 and the other has a 2.
                        Last edited by Brian Bushnell; 04-30-2014, 04:05 PM.

                        Comment


                        • #13
                          Trinity’s In silico Read Normalization can handle paired reads.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X