Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to randomly remove portions of the raw reads from the FASTQ file

    Hi everyone

    I'm a graduate student just started to do some NGS for my thesis project.

    Most of the problems I had I could have searched and found it here on seq answer but I think I have a situation where I might need some help.

    I have done a 2X150 PE Hiseq sequening by pooling 3 different populations of Drosophila. Using a reference genome based reassembly I used bwa and yada yada in the end I've had pretty good coverage where at least only for chromosome 2L on average there was about 70X coverage.

    This is really good but I think its alittle overkill for me since running the fastq files through fastqc indicated the level of duplication for the library was around~25% and I'm tending to think now that I'm not really "learning" new and many of the sequencing is being wasted.

    I'm on a very limited budget and I'm pretty much having a dilema on whether I can pool more samples (maybe 4 or even 5 samples) during my sequencing reaction so I can sequence more populations.

    With this in mind I was trying to mimic a situation where I've initially pooled 4 or 5 populations by decreasing the number of reads in my current fastq file.
    So it was a long way to explain how I can randomly delete a significant proportion of paired reads from my initial fastq file?

    Thanks again for reading this far!

  • #2
    nvm so I've found useful links to solve my problem from here and here

    however is my approach makes sense in that decreasing the library size be a valid approach to see if more pooling would be beneficial?

    sorry if I'm derailing the post...

    Comment


    • #3
      FYI, I assume you mean "multiplexing" rather than "pooling". While there is pooling in both cases, the former is probably a more exact description of what you're doing (I assume you're looking for sequence differences between strains or something like that, so being able to separate reads by strain would be useful).

      Regarding your strategy, it's often termed "saturation analysis" or "making a saturation/rarefaction curve" or various permutations thereof. It's a very good thing to do and I've seen a few papers (mostly RNAseq) specifically doing that to estimate maximal statistical power. 70x is overkill for a lot of common things, so I wouldn't be surprised if you can get away with throwing more samples on there.

      Comment


      • #4
        Yes I should have ment multiplexing instead of pooling. I'm conducting a population genomic type project and trying to sequence as much populations without sacrificing coverage too much.

        Thanks dpryan!

        Comment


        • #5
          Hi choijae3,

          70X coverage is very high and an overkill for most applications so in general you are better off sequencing more samples as opposed to sequencing the same thing over and over again.

          With respect to the duplication rate, I would recommend you do no trust FastQC. FastQC estimates duplication rate by looking at the first and second reads independently. Given your high coverage it is very likely that you will get 1st reads starting at the exact same spot. I would recommend you use Picard MarkDuplicates.jar to estimate the duplication rate after alignment as this takes into account both the first and second reads of each pair.

          Comment


          • #6
            Hi barkasn

            thanks for the reply! I've been following best practice from broad institute and have done the mark duplicate steps. I haven't paid much attention to it (I really should have) and found that to be more helpful. Thanks again for the advice!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X