Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with FASTQC on Trinity Mouse DC reads example dataset

    Dear SEQanswers,


    My name is Alex SMITH and I've recently started a RNA-seq bioinformatics post-doc at the Malaghan Institute of medical Research. In order to practice with the tools I'll be using, I decided to try and map the RNA-seq reads from the Mouse Dendritic Cell dataset (GSE29209 ; GSM722533) that was used as an example in the original Trinity paper "Full-length transcriptome assembly from RNA-Seq data without a reference genome" (2011) to the latest mouse genome. I downloaded the paired-end read file (52.6 M reads) and split it into 2 separate end files. However, before attempting the mapping, given that the reads were generated using Illumina technology, I decided to run them through FASTQC to get a feel for them. I was very surprised when FASTQC reported very high levels of read duplication - for example, for each end file, the 4 most duplicated reads accounted for almost 3% of all reads (each representing more than 35k reads), and the total sequence duplication level reported is >=70% in both cases.

    I realise that FASTQC is not the best software for getting an idea of sequence duplication, given that it does not take paired ends into account and limits itself to unique sequences from the 50 first nucleotides in the 200 000 first reads, as is known to give such results, but as I am not very experienced with RNA-seq these results worried me. I tried to find out what these 4 most duplicated reads corresponded to by blasting them against the Mouse genome, the whole of the nr database (temporary report link expires on 10-25 05:00 am), the 92 common ERCC RNA-seq spike-in control sequences, and against whatever Illumina adaptors, primers, barcodes etc that I could find. However, I have come up completely blank! Looking at the sequences of these heavily-replicated 50nt read parts, I also noticed that there were very few "double nucleotides", which one might expect in any given sequence. I've attached the two tables of over-represented sequences, and copied these sequences below (they are different, and not mirrors, for ends 1 and 2):

    Ends1:
    TCTAGAGTACAGTGACGAGTGACGATACACGCATACGACTGACGCCGTAC
    CACGTCACGTGTACGTAGTACGTACGCATACACGCATGTACGTATATAGT
    AGATCTCATATCGTCGCTCGTCATGCGTGTATGCGTCTGCATACGGCGCA
    GTGCAGTGCGCACATATCACATGCTATGCGTGTATGACAGTCGTATACTG

    Ends2:
    TGTCGATTATCGCACTGGTGCGAATGGATACGCGACATCTATCTGATGAC
    CACTATAGCGATAGACAAGCATGCGCTGCGTCGACTCAGATGAGTGCACG
    GCGATCGCTCTATCTGCTCATCTGCACTGCATATGAGCACGCTACTGCTA
    ATAGCGCAGAGCGTGATCATGACTATACATGATCTGTGTGCAGCACATGT


    I have attached the FASTQC duplicate sequence graphs and the per-base sequence quality box plots (for end 1) as well. Please note that the 4 first over-represented sequences did not seem to correspond to any particular quality distribution (i.e. were not all low-quality). Obviously the 5th for each was!


    Searching on SEQanswers, I found these interesting threads, but was not able to find a consensus interpretation:
    http://seqanswers.com/forums/showthread.php?t=24094
    http://seqanswers.com/forums/showthread.php?t=28607
    http://seqanswers.com/forums/showthread.php?t=30397
    http://seqanswers.com/forums/showthread.php?t=24040


    A blog post on FASTQC duplicate sequences (pointed to by one of the threads) was interesting as well:
    http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/


    I have no idea on how to interpret the strong presence of these peculiar sequences other than some problem in the library preparation, which I would find surprising given that this dataset was used as an example in a paper. Bottom line, I don't know what the best way to deal with them would be: keep them (as the mapping results should not be impacted), or remove them (only losing about 3% total reads)? Should I just go ahead and do the mapping, then use Picard tools to look at library diversity (but then these reads shouldn't map anyway)? Maybe in practice it doesn't change anything but I would like to be sure I understand what I'm doing (or not doing)!


    Thank you in advance for any help or enlightenment you can bring and always, thanks for reading!


    -- Alex
    Attached Files

  • #2
    Doesn't anybody have any ideas? I'm sorry for the long post, but I wanted to make sure I had "done my homework" before posting for help... The question boils down to:

    "What could these Illumina reads with very FASTQC high duplication levels be after eliminating all the most obvious answers?"

    Thanks,

    -- Alex

    Comment


    • #3
      From my experience, RNA Sequencing reads does have a relatively high duplication rate base on its nature and most of the time I don't read into the duplication rate from the FastQC report and focus mainly on the sequence score.

      Comment


      • #4
        Hi choishingwan,

        Thank you for your answer! I guess maybe I'm looking too much into this... Sometimes you just get problems in the data, in the tools, or both, and you just have to work with them anyway. In this case I would like to point out that some of these replicates had very high read quality scores, while others didn't, so I didn't find any pattern there.

        Either way, I'm going to find another practice dataset!

        Best regards,

        -- Alex

        Comment


        • #5
          Try and see if those reads are all coming from the same lane or if those are the second read of the read pair. Usually the lane will fail together or in general, the second read pair usually have a relatively lower quality score. If I remember correctly, you should be aiming for q30>80%, you can check illumina for the specification. Another thing to look for is to see if there is a high amount of over represented sequence at the beginning of your reads, that might be adapters that require trimming, though I haven't got a data that require to do so yet.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Innovations in Spatial Biology
            by seqadmin


            Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

            3D Genomics
            While spatial biology often involves studying proteins and RNAs in their...
            01-01-2025, 07:30 PM
          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 01-09-2025, 04:04 PM
          0 responses
          443 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 01-09-2025, 09:42 AM
          0 responses
          444 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 01-08-2025, 03:17 PM
          0 responses
          459 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 01-03-2025, 11:18 AM
          1 response
          50 views
          1 like
          Last Post Tonia
          by Tonia
           
          Working...
          X