Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • WXS: High amount of duplications! Enormous file size difference in normal-tumor pair!

    I wanna preface this by saying I'm relatively new to NGS analysis.

    I recently received raw data from WXS (paired end 100bp reads with 100X coverage (12Gb data) with Agilent SureSelect All Human Exon V5 kit). I noticed something's really off from the getgo.

    The file size between the normal and tumor pair are enormous. Read 1 and 2 fastq of the normal sample are about 7GB each, and the reads 1 and 2 from the tumor sample is about 40GB each. I've worked with exome data before, and they were usually close in size.

    Anyway, I assume everything is okay, and put the raw data through the "pipeline." You know the usual, alignment, sorting, duplicate marking, indel align, mate-info fixing, Base-Recalibration, etc. Almost 45% of the reads failed the "DuplicateReadFilter"(GATK-baserecalibrator
    ), and another 45% failed the mappingQualityzero filter(GATK-baserecalibrator
    )! I have to filter out >90% of my sequence reads!

    In my previous runs, I've filtered out at most ~10%.

    This made me run the FastQC on the reads, I should have done in the beginning. In the sequence duplication level section, FastQC reports that 25% of the seqs will remain if deduplicated! I see double peaks in the "per sequence GC content"!

    All this is new to me. I'm used to seeing yellow bars in the sequence quality section, I don't see it here for some reason.

    Would someone show me the ropes?

  • #2
    File sizes fluctuate based on pooling of samples (if run together? or if pooled with others in different lanes?), both files are big enough in the first instance not to be too concerned, but I would mention to whoever is doing the lab work.

    You should be trimming adapter and for quality prior to alignment, this is pretty standard now, see BBDukfrom BBMap which I prefer (v. fast). fastp package seems good too.

    A lot of duplicates: points to either low diversity or poor capture. What was the cellularity of the tumour as assessed by the pathologist? Fewer cells going into the library, lower complexity, more duplicates. Is it FFPE? What was starting DNA input, if this was too low that could also reduce complexity. How was DNA quantified? Has to be dsDNA being quantified, nanodrop is simply not good enough. Possibly reducing PCR cycles at library prep could help if all else is ok. Who did the exome capture? Did it work previously, first time using this particular kit etc?

    Double peaks in GC-content plot: uh-oh. Looks like contamination. Explains your 45% mappingQualityzero also. Could try something like MGA to identify contaminant.

    Basically, go and talk to the lab person who ran the sample, ask what happened. I am willing to bet this is not a routine thing for them.

    More importantly for you, all this begs the question: who are you reporting to, and where is your support on the ground? There really should be someone who can identify what these issues are who you can go and speak with. Asking on the internet is fine for refining your skills but you are, by own admission, and on the evidence, new to this.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    25 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X