Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering Illumina data to reduce file size

    Hello,
    I have paired end data from Illumina hi seq for a bacterial genome that has been sequenced using three different insert sizes, 160, 305 and 505 respectively. My task is to perform de novo assembly of the genome but the problem is that every single file contains more than 60 million reads and its not possible to run assembly of this much large file. Is there any way I can reduce the size of the file, by removing some reads?? or performing some kind of filteration?

  • #2
    Look for Titus Brown's 'diginorm' program. It does an intelligent reduction of data. Seems to work well for genomic data. Perhaps not so well for transcriptome data.

    Comment


    • #3
      Thanks for the suggestion, i will try that and get back for further problems

      Comment


      • #4
        Subsample reads from your files using Heng Li's seqtk program (https://github.com/lh3/seqtk) and the "sample" command.

        Comment


        • #5
          Originally posted by nickloman View Post
          Subsample reads from your files using Heng Li's seqtk program (https://github.com/lh3/seqtk) and the "sample" command.
          If you are just going to randomly throw away reads then you might as well go the cheap route and not do as much sequencing in the first place. No disrespect to Li's program but since diginorm provides for an intelligent reduction of reads then I suggest using it instead of a random selection.

          Comment


          • #6
            At the Boston Illumina User's Group meeting today, Illumina mentioned that BaseSpace will have an option for "quality-binning" -- by reducing quality scores to a small number of bins, the data compresses quite a bit (they claimed 50% reduction in compressed FASTQ size). An underlying assumption is that quality scores offer more gradation than programs really find useful.

            Pretty trivial to implement in Perl, though I leave that as an exercise for the student :-)

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X