Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to randomly remove portions of the raw reads from the FASTQ file

    Hi everyone

    I'm a graduate student just started to do some NGS for my thesis project.

    Most of the problems I had I could have searched and found it here on seq answer but I think I have a situation where I might need some help.

    I have done a 2X150 PE Hiseq sequening by pooling 3 different populations of Drosophila. Using a reference genome based reassembly I used bwa and yada yada in the end I've had pretty good coverage where at least only for chromosome 2L on average there was about 70X coverage.

    This is really good but I think its alittle overkill for me since running the fastq files through fastqc indicated the level of duplication for the library was around~25% and I'm tending to think now that I'm not really "learning" new and many of the sequencing is being wasted.

    I'm on a very limited budget and I'm pretty much having a dilema on whether I can pool more samples (maybe 4 or even 5 samples) during my sequencing reaction so I can sequence more populations.

    With this in mind I was trying to mimic a situation where I've initially pooled 4 or 5 populations by decreasing the number of reads in my current fastq file.
    So it was a long way to explain how I can randomly delete a significant proportion of paired reads from my initial fastq file?

    Thanks again for reading this far!

  • #2
    nvm so I've found useful links to solve my problem from here and here

    however is my approach makes sense in that decreasing the library size be a valid approach to see if more pooling would be beneficial?

    sorry if I'm derailing the post...

    Comment


    • #3
      FYI, I assume you mean "multiplexing" rather than "pooling". While there is pooling in both cases, the former is probably a more exact description of what you're doing (I assume you're looking for sequence differences between strains or something like that, so being able to separate reads by strain would be useful).

      Regarding your strategy, it's often termed "saturation analysis" or "making a saturation/rarefaction curve" or various permutations thereof. It's a very good thing to do and I've seen a few papers (mostly RNAseq) specifically doing that to estimate maximal statistical power. 70x is overkill for a lot of common things, so I wouldn't be surprised if you can get away with throwing more samples on there.

      Comment


      • #4
        Yes I should have ment multiplexing instead of pooling. I'm conducting a population genomic type project and trying to sequence as much populations without sacrificing coverage too much.

        Thanks dpryan!

        Comment


        • #5
          Hi choijae3,

          70X coverage is very high and an overkill for most applications so in general you are better off sequencing more samples as opposed to sequencing the same thing over and over again.

          With respect to the duplication rate, I would recommend you do no trust FastQC. FastQC estimates duplication rate by looking at the first and second reads independently. Given your high coverage it is very likely that you will get 1st reads starting at the exact same spot. I would recommend you use Picard MarkDuplicates.jar to estimate the duplication rate after alignment as this takes into account both the first and second reads of each pair.

          Comment


          • #6
            Hi barkasn

            thanks for the reply! I've been following best practice from broad institute and have done the mark duplicate steps. I haven't paid much attention to it (I really should have) and found that to be more helpful. Thanks again for the advice!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            33 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            48 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            34 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X