Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • high sequence duplication levels for Illumina RNA-Seq meta-transcriptomics

    Hi there,

    I recently received rRNA-Seq data from a number of gut meta-transcriptomics samples of 100 mil 2x100bp Illumina HiSeq reads/sample. As the biopsy samples are mainly human tissue we used Microbe Enrich to deplete human RNA, in addition to Microbe Express to deplete rRNA. Due to the small cDNA volumes we had to amplify it using GenomiPhi prior to sending off to the Illumina sequencing provider. Now having checked the sequences I noticed that there are surprisingly low proportions of rRNA (5% following ribopicker-standalone-0.4.3/ribopicker.pl -c 50 -i 70 -l 30 -f sample1.fastq -dbs rrnadb) and manageable proportions of human mRNA (28% following bowtie2 -t --qc-filter --very-fast-local -x Homo_sapiens.GRCh37.67.cDNA_ncRNA -1 sample1.1.fq -2 sample1.2.fq).

    However, running FastQC for quality control reveals huge amounts of Sequence Duplication Levels where a majority of samples fail with >60% Total Duplicate Percentage. There are two typical duplication scenarios. First,

    >>Sequence Duplication Levels fail
    #Total Duplicate Percentage 82.04422080939267
    #Duplication Level Relative count
    1 100.0
    2 16.117142135824288
    3 8.134309517798537
    4 5.544054531683918
    5 4.385256248422116
    6 3.500378692249432
    7 2.910881090633678
    8 2.5334511486998235
    9 2.3024488765463267
    10++ 107.03357737944964
    >>END_MODULE
    >>Overrepresented sequences warn
    #Sequence Count Percentage Possible Source
    CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG 269367 0.24784156137009358 No Hit
    ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG 192996 0.17757345917719164 No Hit
    ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT 174958 0.16097689729695483 No Hit
    TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT 164622 0.15146685939950902 No Hit
    TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC 163800 0.15071054640108597 No Hit

    These sequences mainly hit Parabacteroides distasonis ATCC 8503, Bacteroides vulgatus ATCC 8482 and Homo sapiens BAC clone RP11-327N17. These strains are very likely habitants of this environment.

    The second duplication type is

    >>Sequence Duplication Levels fail
    #Total Duplicate Percentage 83.36543150690594
    #Duplication Level Relative count
    1 100.0
    2 31.920496894409936
    3 25.20248447204969
    4 22.56645962732919
    5 21.696894409937887
    6 19.985093167701862
    7 17.77639751552795
    8 15.91055900621118
    9 14.596273291925465
    10++ 227.23975155279504
    >>END_MODULE
    >>Overrepresented sequences pass
    >>END_MODULE
    >>Kmer Content warn
    #Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
    TTTTT 30440525 3.3208988 4.2889795 95-97
    AAAAA 31526075 3.3173122 3.9691818 95-97

    Note there are no overrepresented sequences here, just a general high duplication rate at lower flow cycles which possibly could be due to the high Kmer content.

    Could these be due to some sequencing or library prep artefact, for instance the use of GenomiPhi cDNA amplification? Could the Parabacteroides strain simply be very abundant and highly expressed in these samples, or are the duplicated sequences TOO freakishly identical to be biologically possible? Contamination?

    What would you recommend me to do before de novo assembling the meta-transcripomes and carrying out DE-Seq differential gene expression on these samples?
    a) leave as is and pretend as nothing
    b) removing exact duplicates, e.g. with fastx_collapser from the FastX toolkit and carry on from there. If so, is it safe to use the duplication numbers in the subsequent analysis?
    c) something else?

    Many thanks for any advice on this!

    Regards,
    Marcus

  • #2
    Obviously it is possible to start with such small quantity of material that duplicates start appearing. I have had that problem occasionally. Your best best is (c) -- something else -- that "something" being re-sequencing with a larger amount of starting material so that you get good valid counts. Failing that you could just throw away samples with high duplications (I presume you have reps for each sample). This will cause your statistics to be more shaky. Lacking either of those options then doing (b) would be your next step. The statistics will once again become shaky as you start dealing with smaller numbers but at least you will have something. I would not go for (a) because trying to get good statistics does not seem feasible.

    Good luck with the analysis.

    Comment


    • #3
      Thanks for your reply Rick! Just to clarify, would you say I have simply been over-sequencing the bacterial diversity as a result of too limited starting material, and that this is a case of "A low level of duplication is seen across a whole library" in http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/ ?

      Could over-amplification using too much GenomiPhi random PCR also be the case you think?

      Unfortunately I can't just throw away samples where this occurs since it happens in a majority of them. Neither can I go back and re-sequence them since there is no more biopsy material. I will probable carry on both with and without removing these exact duplicates to see where that leaves me. As this project is a pilot for a larger one using more samples, and hopefully better lib prep methods, I hope to use make use these experiences. Would you then recommend avoiding cDNA amplification by using e.g. the Nextera protocol which needs less starting material? Or something else?

      Many thanks,
      Marcus

      Comment


      • #4
        I have been thinking about your problem and really need more information. It isn't the percent duplication that is bothering me -- RNA-seq projects can show a high percentage due to a low "sequence space" (i.e., the total number of nucleotides in transcriptomes is relatively small compared to how many sequences we can produce these days) as much as the detection of over-represented sequences. This usually indicates to me that there was a bottleneck in the laboratory procedure that cause only a few molecules to be available for sequencing pre-amplification. Your first sequence -- CAA...GAG -- has over 260K molecules in the final pool so why was that one molecule amplified so much? The normal answer is that there was a small pool of molecules to begin with. On the other hand the overall percentages are rather similar (0.15% to 0.24%). I am not sure what FastQ's cutoff is but 0.15% would not be surprising.

        So perhaps the better explanation is that of:

        "Every sequence in a library occurs a large number of times ... If a library has very limited diversity is then subjected to high throughput sequencing then it’s possible that every sequence seen in the library is likely to be seen many times. ... The root cause of this type of result is low diversity in the original library. "

        It would be interesting to see if the rest of your sequences are highly duplicated. In other words are those 5 sequences the only ones over-represented? Or are there a bunch of sequences below 0.15% but higher than, oh say, 0.05%? And if so then how many different sequences total make up 50% of your sequenced reads?


        To pound away at the point, if there are 5 sequences at 0.15%+ of the reads and 1,000,000 sequences at 0.001% of the reads then this is troublesome but no worse than many transcriptome projects. However if there are 50 sequences at 0.10%+ and these 50 sequences comprise 50%+ of the total reads then this will indicate that a bottleneck occurred.


        I do not think that GenomiPhi is at fault here. Rather it is that you had a limited number of original starting molecules to start with.

        But I could be completely wrong.

        Comment


        • #5
          Note:

          Taking the sequences

          #Sequence Count Percentage Possible Source
          CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG 269367 0.24784156137009358 No Hit
          ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG 192996 0.17757345917719164 No Hit
          ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT 174958 0.16097689729695483 No Hit
          TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT 164622 0.15146685939950902 No Hit
          TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC 163800 0.15071054640108597 No Hit

          converting to fasta ...
          >x1
          CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG
          >x2
          ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG
          >x3
          ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT
          >x4
          TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT
          >x5
          TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC

          running on-line blat against hg19 ( http://genome.ucsc.edu/cgi-bin/hgBlat ) gets these results ...


          Code:
          BLAT Search Results
          
             ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
          ---------------------------------------------------------------------------------------------------
          browser details x1                23    16    38    50 100.0%    11   -   77378493  77378515     23
          browser details x3                50     1    50    50 100.0%     2   +  154270955 154271004     50
          browser details x3                27    23    50    50 100.0%    12   -   71040725  71040822     98
          browser details x3                20    27    50    50  91.7%     2   +  115802770 115802793     24
          browser details x4                31     9    40    50 100.0%     4   +   62444206  62745451 301246
          browser details x4                22    20    41    50 100.0%     5   +  120809005 120809026     22
          browser details x4                21    12    32    50 100.0%     3   +  145927464 145927484     21
          browser details x4                20    11    30    50 100.0%     5   -  111939399 111939418     20
          browser details x5                50     1    50    50 100.0%     2   +   20472928  20472977     50
          2 of the 5 reads are hitting human genome exactly

          I'm stumped.
          Last edited by Richard Finney; 06-22-2012, 01:21 PM.

          Comment


          • #6
            For my sake try the following on the sample you gave. Use your FastA data (which I call 'file.fa')

            grep -v '>' file.fa | sort | uniq -c | sort -rn > file.count

            The above takes your FastA file and sorts and counts the unique reads -- assumes reads have the sequences on a single line.

            perl -nale '$count++; $total+=$F[0]; print "Total reads: $total\nDifferent reads: $count"' file.count

            The above counts the reads.

            Take the 'total reads' number from the above and put in as XXXXXX

            perl -nale '$count++; $n50+=$F[0]; exit if $n50>(XXXXXX/2); END{print "N50 count: $count"}' t.count

            The above gives us a N50 number.

            There are many ways to do the above but what I hope to get out of this is:

            Total reads
            Count of different reads
            How many of the different reads represent 50% of the total reads


            BTW: FastQC does stop reporting at 0.1% of the total reads. Or at least close to that cutoff -- I did not test it exactly.

            Comment


            • #7
              Ok Rick - this is what I did and what I got; let me know if anything was misunderstood:

              grep -v '>' file.fa | sort | uniq -c | sort -rn > file.count
              $ head file.count
              115796 CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAGGAGTTAAACCAAGCAAGTTATGACGGCATGCCGTCATAACTTGCTTGGTTT
              98240 AGGAGAATCAATTTAGTCTGATACATTTCAAATAAATGAATCAGACAGAGGAACCCCCTAAAACCTTTATAGAAGAGGAGGCCGAAAAAATATCTATTAAA
              91530 TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTCAA
              83481 ATTGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTC
              82107 ATGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGACTATGCAAAGTATAACATCTGGCTACCAGGCATTTAATGTTAAATGCCTGG
              75899 CCAGGAGAATCAATTTAGTCTGATACATTTCAAATAAATGAATCAGACAGAGGAACCCCCTAAAACCTTTATAGAAGAGGAGGCCGAAAAAATATCTATTA
              74966 ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAGCGATGATGAGCTAGCTAAAAAGGCTATCGAGGAGTTAAACCAAGCAAGTTA
              74472 TTGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTCA
              74225 GTATGCGGATGGATTGATATTTATGGAATCACAATTCCAAACTACTCAGAAATATCAATCGGTAGATCCGGATATCATGGATGAAGCTATTGCGGTAGGCT
              73870 GGTTTAGGTCATGGGCTGATCACCCTAGAGAAAATATTAAAGAGAATGATGGATCCATGTTCAGTTGGAGTCCAGCTTCGTACTACAAATGAGTGCCATAA


              $ perl -nale '$count++; $total+=$F[0]; print "Total reads: $total\nDifferent reads: $count"' file.count >t.count
              $ head t.count
              Total reads: 115796
              Different reads: 1
              Total reads: 214036
              Different reads: 2
              Total reads: 305566
              Different reads: 3
              Total reads: 389047
              Different reads: 4
              Total reads: 471154
              Different reads: 5

              $ tail t.count
              Total reads: 83991911
              Different reads: 20859301
              Total reads: 83991912
              Different reads: 20859302
              Total reads: 83991913
              Different reads: 20859303
              Total reads: 83991914
              Different reads: 20859304
              Total reads: 83991915
              Different reads: 20859305

              $ perl -nale '$count++; $n50+=$F[0]; exit if $n50>(83991915/2); END{print "N50 count: $count"}' t.count
              N50 count: 41718610

              In case you're interested in the other fastqc reports you can get them here:

              The sample above is the quality filtered version (fastq_quality_filter -i 1_1.fastq -q 20 -p 80 -Q 33 |fastx_artifacts_filter -Q33 |fastq_quality_trimmer -Q33 -l 50 -t 20) of the sample above.

              Hope this helps to clarify things.
              Thanks
              Marcus

              Comment


              • #8
                Hi Richard,

                These meta-transcriptomics samples are from an environment where a majority of cells actually are human. I guess this explains the many hg19 hits.

                Marcus

                Comment


                • #9
                  Hi Richard,

                  Please let me know if you need any more info to figure this one out - I really appreciate your help on this!

                  Also, if I were to remove all exact duplicates, how would I go about doing it while retaining the paired-end info? If I for examples use fastx_collapser that info is lost. I need it to able to a) align properly against reference genomes using bowtie2, and b) assemble the meta-transcriptomes using e.g. Velvet/OASES.

                  Thanks for any help on this,
                  Marcus

                  Comment


                  • #10
                    Well ... I'm not a wet lab kinda guy .. more of a computer guy.

                    Is this bacterial infected human tissue (like in colitis or crohns) ?

                    It's rna seq, so it should hit assembled exons human and otherwise,
                    not human dna. It's not necessarily surprising that bacterial or viral DNA match human DNA (blat bacteria/viral genome against human and you'll get lots of good matches, some work by manipulating host gene expression).

                    You can run these against NCBI blast ( http://blast.ncbi.nlm.nih.gov , use (wgs) Whole-genome shotgun contigs ) by hand and they are hitting bacteria.

                    Having a lot of recurrent sequences is not necessarily a crime. It could be real (i.e. the bacteria is rampantly expressing that sequence). It could be that your getting "amplification" by your wetlab process. Which is often deal-able with. You can throw the dupes away, particularly if your looking for mutations. If you're measuring expression, you've got more of a problem.
                    _____

                    The rest of this post is pure tangential speculation and may provoke an insight ... or could be nonsense ...

                    I can grab your 6-23-2012 post "head file.count" data and count the 6,7, 8 mers.

                    TTAATTAA (reverse compliment is palindromic) shows up 6 times on 3 reads.

                    What is this? It this a restriction enzyme ? I'm clueless. I don't know this stuff. google leads me here : http://www.neb.com/nebecomm/products/productR0547.asp

                    Highly recurring 6,7,8 mers are
                    GAATCA 5 <--- whoo ooooh... the movie ... (well it's close)
                    AATTAA 6
                    AATTAAT 6
                    AATTAATT 6
                    ATTAATTA 6
                    TAATTA 6
                    TAATTAA 6
                    TAATTAAT 6
                    TGTTAT 6
                    TTAATTA 6
                    TTAATTAA 6
                    ATTAAT 9
                    ATTAATT 9
                    TTAATT 9

                    Comment


                    • #11
                      Sorry for the delayed response. I have been off of SeqAnswers for about a week. Had tons of messages to wade through when I came back.

                      I am still wondering if your sample is strange or not. rna-seq projects by definition have a lot of repetition however one would hope that any given transcript is represented by multiple overlapping reads instead of a single read amplified multiple times as it seems to be in your case, at least for a handful of reads.

                      Sort of thinking out loud here ...

                      You have 84M reads of which 21M are unique. Some of those unique reads may be simply due to sequencing error. In other words there may be some reads with the same starting base ("start sites") that should be considered non-unique. So 21M is an upper-limit.

                      A human transcriptome -- I know, you are mainly bacterial -- is, well, it depends on the organ and environment but say somewhere around 30M bases minimum and thus roughly about 30M start sites. If you were just doing human and if each transcript was represented one then we would expect the 84M reads to be read-duplicated around a 3X distribution; i.e., 84M divided by 30M. But obviously not each transcript is represented the same number of times thus the distribution will range from, oh say 3/100 to 3*100 -- a range of 10,000 fold. In other words I would expect the maximum number of duplicated reads to be 300. All very rough numbers but given that you have duplicates in the upper 10s of thousands then my suspicion is that something is not correct.

                      Going back to your post, removing duplicates is tricky since you would also be removing the count information.

                      I still do not have good feel for just how many reads are duplicated at a high level. Is there an obvious drop-off point when you do a 'cut -f 1 file.count'? Doing that command should produce:

                      115796
                      98240
                      91530
                      83481
                      82107
                      75899
                      74966
                      74472
                      74225
                      73870
                      ...

                      All of the numbers above are close enough to each other to be considered the same. The question is if the slope continues slowly downward or if there is an abrupt dropoff. If so then we may have evidence of a PCR duplication event. Perhaps you can toss the numbers into a spreadsheet and give us a graph?

                      In the meantime I am going back through the the transcriptome projects I have handy in order to do the same type of analyses I am asking of you so that we have something to compare.

                      Comment


                      • #12
                        Thanks again both Richards!
                        Here are those slopes at different scales; there doesn't seem to be any drop-off... Do you see the same?
                        Attached Files

                        Comment


                        • #13
                          If I am understanding what the question is and what has been done thus far, it seems like there has been no use of read pair information to determine the duplication levels. In contrast to genomic DNA sequencing, you would expect overrepresented sequences in transcriptome sequencing. For instance, if a particular transcript is sequenced to a depth over 100 and you are using 100 bp reads, you would expect duplicate sequences that are not the result of PCR amplification or optical duplicates from cluster generation but instead are the true signal that you are trying to capture. When I look at the duplication levels in my RNA-Seq data, I always consider both ends of a pair, because it is much less likely (though there is still some chance, especially for really abundant transcripts) that sequences with the same start and end coordinates for both ends of the pair are due to true signal versus PCR/optical duplicates.

                          I have done this two ways, and am not sure which approach you might want to try:
                          1) Concatenate both reads (in this case 100bp + 100bp) of a pair together (I modified a script from Galaxy for this purpose) and then run FastQC or a similar tool on the 200 bp sequence to determine the % duplication before alignment
                          2) Align the sequences first and then use PicardTools to mark and/or remove duplicates in the resulting BAM file using paired end information

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 08:47 AM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          57 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X