Seqanswers Leaderboard Ad

**westerman** · 06-21-2012, 06:27 AM

Obviously it is possible to start with such small quantity of material that duplicates start appearing. I have had that problem occasionally. Your best best is (c) -- something else -- that "something" being re-sequencing with a larger amount of starting material so that you get good valid counts. Failing that you could just throw away samples with high duplications (I presume you have reps for each sample). This will cause your statistics to be more shaky. Lacking either of those options then doing (b) would be your next step. The statistics will once again become shaky as you start dealing with smaller numbers but at least you will have something. I would not go for (a) because trying to get good statistics does not seem feasible.

Good luck with the analysis.

**Marcus** · 06-22-2012, 03:11 AM

Thanks for your reply Rick! Just to clarify, would you say I have simply been over-sequencing the bacterial diversity as a result of too limited starting material, and that this is a case of "A low level of duplication is seen across a whole library" in http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/ ?

Could over-amplification using too much GenomiPhi random PCR also be the case you think?

Unfortunately I can't just throw away samples where this occurs since it happens in a majority of them. Neither can I go back and re-sequence them since there is no more biopsy material. I will probable carry on both with and without removing these exact duplicates to see where that leaves me. As this project is a pilot for a larger one using more samples, and hopefully better lib prep methods, I hope to use make use these experiences. Would you then recommend avoiding cDNA amplification by using e.g. the Nextera protocol which needs less starting material? Or something else?

Many thanks,
Marcus

**westerman** · 06-22-2012, 12:25 PM

I have been thinking about your problem and really need more information. It isn't the percent duplication that is bothering me -- RNA-seq projects can show a high percentage due to a low "sequence space" (i.e., the total number of nucleotides in transcriptomes is relatively small compared to how many sequences we can produce these days) as much as the detection of over-represented sequences. This usually indicates to me that there was a bottleneck in the laboratory procedure that cause only a few molecules to be available for sequencing pre-amplification. Your first sequence -- CAA...GAG -- has over 260K molecules in the final pool so why was that one molecule amplified so much? The normal answer is that there was a small pool of molecules to begin with. On the other hand the overall percentages are rather similar (0.15% to 0.24%). I am not sure what FastQ's cutoff is but 0.15% would not be surprising.

So perhaps the better explanation is that of:

"Every sequence in a library occurs a large number of times ... If a library has very limited diversity is then subjected to high throughput sequencing then it’s possible that every sequence seen in the library is likely to be seen many times. ... The root cause of this type of result is low diversity in the original library. "

It would be interesting to see if the rest of your sequences are highly duplicated. In other words are those 5 sequences the only ones over-represented? Or are there a bunch of sequences below 0.15% but higher than, oh say, 0.05%? And if so then how many different sequences total make up 50% of your sequenced reads?

To pound away at the point, if there are 5 sequences at 0.15%+ of the reads and 1,000,000 sequences at 0.001% of the reads then this is troublesome but no worse than many transcriptome projects. However if there are 50 sequences at 0.10%+ and these 50 sequences comprise 50%+ of the total reads then this will indicate that a bottleneck occurred.

I do not think that GenomiPhi is at fault here. Rather it is that you had a limited number of original starting molecules to start with.

But I could be completely wrong.

**Richard Finney** · 06-22-2012, 01:09 PM

Note:

Taking the sequences

#Sequence Count Percentage Possible Source
CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG 269367 0.24784156137009358 No Hit
ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG 192996 0.17757345917719164 No Hit
ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT 174958 0.16097689729695483 No Hit
TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT 164622 0.15146685939950902 No Hit
TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC 163800 0.15071054640108597 No Hit

converting to fasta ...
>x1
CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG
>x2
ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG
>x3
ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT
>x4
TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT
>x5
TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC

running on-line blat against hg19 ( http://genome.ucsc.edu/cgi-bin/hgBlat ) gets these results ...

Code:

BLAT Search Results

   ACTIONS      QUERY           SCORE START  END QSIZE IDENTITY CHRO STRAND  START    END      SPAN
---------------------------------------------------------------------------------------------------
browser details x1                23    16    38    50 100.0%    11   -   77378493  77378515     23
browser details x3                50     1    50    50 100.0%     2   +  154270955 154271004     50
browser details x3                27    23    50    50 100.0%    12   -   71040725  71040822     98
browser details x3                20    27    50    50  91.7%     2   +  115802770 115802793     24
browser details x4                31     9    40    50 100.0%     4   +   62444206  62745451 301246
browser details x4                22    20    41    50 100.0%     5   +  120809005 120809026     22
browser details x4                21    12    32    50 100.0%     3   +  145927464 145927484     21
browser details x4                20    11    30    50 100.0%     5   -  111939399 111939418     20
browser details x5                50     1    50    50 100.0%     2   +   20472928  20472977     50

2 of the 5 reads are hitting human genome exactly

I'm stumped.

**westerman** · 06-22-2012, 03:10 PM

For my sake try the following on the sample you gave. Use your FastA data (which I call 'file.fa')

grep -v '>' file.fa | sort | uniq -c | sort -rn > file.count

The above takes your FastA file and sorts and counts the unique reads -- assumes reads have the sequences on a single line.

perl -nale '$count++; $total+=$F[0]; print "Total reads: $total\nDifferent reads: $count"' file.count

The above counts the reads.

Take the 'total reads' number from the above and put in as XXXXXX

perl -nale '$count++; $n50+=$F[0]; exit if $n50>(XXXXXX/2); END{print "N50 count: $count"}' t.count

The above gives us a N50 number.

There are many ways to do the above but what I hope to get out of this is:

Total reads
Count of different reads
How many of the different reads represent 50% of the total reads

BTW: FastQC does stop reporting at 0.1% of the total reads. Or at least close to that cutoff -- I did not test it exactly.

**Marcus** · 06-23-2012, 04:18 AM

Ok Rick - this is what I did and what I got; let me know if anything was misunderstood:

grep -v '>' file.fa | sort | uniq -c | sort -rn > file.count
$ head file.count
115796 CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAGGAGTTAAACCAAGCAAGTTATGACGGCATGCCGTCATAACTTGCTTGGTTT
98240 AGGAGAATCAATTTAGTCTGATACATTTCAAATAAATGAATCAGACAGAGGAACCCCCTAAAACCTTTATAGAAGAGGAGGCCGAAAAAATATCTATTAAA
91530 TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTCAA
83481 ATTGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTC
82107 ATGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGACTATGCAAAGTATAACATCTGGCTACCAGGCATTTAATGTTAAATGCCTGG
75899 CCAGGAGAATCAATTTAGTCTGATACATTTCAAATAAATGAATCAGACAGAGGAACCCCCTAAAACCTTTATAGAAGAGGAGGCCGAAAAAATATCTATTA
74966 ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAGCGATGATGAGCTAGCTAAAAAGGCTATCGAGGAGTTAAACCAAGCAAGTTA
74472 TTGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTTGCAGTGATGGGAGTCCTTACTTTAGGATCAGTTCAGCCTGTTATGGCTCA
74225 GTATGCGGATGGATTGATATTTATGGAATCACAATTCCAAACTACTCAGAAATATCAATCGGTAGATCCGGATATCATGGATGAAGCTATTGCGGTAGGCT
73870 GGTTTAGGTCATGGGCTGATCACCCTAGAGAAAATATTAAAGAGAATGATGGATCCATGTTCAGTTGGAGTCCAGCTTCGTACTACAAATGAGTGCCATAA

$ perl -nale '$count++; $total+=$F[0]; print "Total reads: $total\nDifferent reads: $count"' file.count >t.count
$ head t.count
Total reads: 115796
Different reads: 1
Total reads: 214036
Different reads: 2
Total reads: 305566
Different reads: 3
Total reads: 389047
Different reads: 4
Total reads: 471154
Different reads: 5

$ tail t.count
Total reads: 83991911
Different reads: 20859301
Total reads: 83991912
Different reads: 20859302
Total reads: 83991913
Different reads: 20859303
Total reads: 83991914
Different reads: 20859304
Total reads: 83991915
Different reads: 20859305

$ perl -nale '$count++; $n50+=$F[0]; exit if $n50>(83991915/2); END{print "N50 count: $count"}' t.count
N50 count: 41718610

In case you're interested in the other fastqc reports you can get them here:

http://morpheus.ucc.ie/temp/fastqc_results.zip

The sample above is the quality filtered version (fastq_quality_filter -i 1_1.fastq -q 20 -p 80 -Q 33 |fastx_artifacts_filter -Q33 |fastq_quality_trimmer -Q33 -l 50 -t 20) of the sample above.

Hope this helps to clarify things.
Thanks
Marcus

**Marcus** · 06-23-2012, 04:20 AM

Hi Richard,

These meta-transcriptomics samples are from an environment where a majority of cells actually are human. I guess this explains the many hg19 hits.

Marcus

**Marcus** · 06-29-2012, 06:37 AM

Hi Richard,

Please let me know if you need any more info to figure this one out - I really appreciate your help on this!

Also, if I were to remove all exact duplicates, how would I go about doing it while retaining the paired-end info? If I for examples use fastx_collapser that info is lost. I need it to able to a) align properly against reference genomes using bowtie2, and b) assemble the meta-transcriptomes using e.g. Velvet/OASES.

Thanks for any help on this,
Marcus

**Richard Finney** · 06-29-2012, 08:06 AM

Well ... I'm not a wet lab kinda guy .. more of a computer guy.

Is this bacterial infected human tissue (like in colitis or crohns) ?

It's rna seq, so it should hit assembled exons human and otherwise,
not human dna. It's not necessarily surprising that bacterial or viral DNA match human DNA (blat bacteria/viral genome against human and you'll get lots of good matches, some work by manipulating host gene expression).

You can run these against NCBI blast ( http://blast.ncbi.nlm.nih.gov , use (wgs) Whole-genome shotgun contigs ) by hand and they are hitting bacteria.

Having a lot of recurrent sequences is not necessarily a crime. It could be real (i.e. the bacteria is rampantly expressing that sequence). It could be that your getting "amplification" by your wetlab process. Which is often deal-able with. You can throw the dupes away, particularly if your looking for mutations. If you're measuring expression, you've got more of a problem.
_____

The rest of this post is pure tangential speculation and may provoke an insight ... or could be nonsense ...

I can grab your 6-23-2012 post "head file.count" data and count the 6,7, 8 mers.

TTAATTAA (reverse compliment is palindromic) shows up 6 times on 3 reads.

What is this? It this a restriction enzyme ? I'm clueless. I don't know this stuff. google leads me here : http://www.neb.com/nebecomm/products/productR0547.asp

Highly recurring 6,7,8 mers are
GAATCA 5 <--- whoo ooooh... the movie ... (well it's close)
AATTAA 6
AATTAAT 6
AATTAATT 6
ATTAATTA 6
TAATTA 6
TAATTAA 6
TAATTAAT 6
TGTTAT 6
TTAATTA 6
TTAATTAA 6
ATTAAT 9
ATTAATT 9
TTAATT 9

**westerman** · 06-29-2012, 09:23 AM

Sorry for the delayed response. I have been off of SeqAnswers for about a week. Had tons of messages to wade through when I came back.

I am still wondering if your sample is strange or not. rna-seq projects by definition have a lot of repetition however one would hope that any given transcript is represented by multiple overlapping reads instead of a single read amplified multiple times as it seems to be in your case, at least for a handful of reads.

Sort of thinking out loud here ...

You have 84M reads of which 21M are unique. Some of those unique reads may be simply due to sequencing error. In other words there may be some reads with the same starting base ("start sites") that should be considered non-unique. So 21M is an upper-limit.

A human transcriptome -- I know, you are mainly bacterial -- is, well, it depends on the organ and environment but say somewhere around 30M bases minimum and thus roughly about 30M start sites. If you were just doing human and if each transcript was represented one then we would expect the 84M reads to be read-duplicated around a 3X distribution; i.e., 84M divided by 30M. But obviously not each transcript is represented the same number of times thus the distribution will range from, oh say 3/100 to 3*100 -- a range of 10,000 fold. In other words I would expect the maximum number of duplicated reads to be 300. All very rough numbers but given that you have duplicates in the upper 10s of thousands then my suspicion is that something is not correct.

Going back to your post, removing duplicates is tricky since you would also be removing the count information.

I still do not have good feel for just how many reads are duplicated at a high level. Is there an obvious drop-off point when you do a 'cut -f 1 file.count'? Doing that command should produce:

115796
98240
91530
83481
82107
75899
74966
74472
74225
73870
...

All of the numbers above are close enough to each other to be considered the same. The question is if the slope continues slowly downward or if there is an abrupt dropoff. If so then we may have evidence of a PCR duplication event. Perhaps you can toss the numbers into a spreadsheet and give us a graph?

In the meantime I am going back through the the transcriptome projects I have handy in order to do the same type of analyses I am asking of you so that we have something to compare.

**Marcus** · 07-05-2012, 08:11 AM

Thanks again both Richards!
Here are those slopes at different scales; there doesn't seem to be any drop-off... Do you see the same?

Attached Files

slopes.jpg (37.8 KB, 68 views)

**elizzybethy** · 07-20-2012, 06:41 AM

If I am understanding what the question is and what has been done thus far, it seems like there has been no use of read pair information to determine the duplication levels. In contrast to genomic DNA sequencing, you would expect overrepresented sequences in transcriptome sequencing. For instance, if a particular transcript is sequenced to a depth over 100 and you are using 100 bp reads, you would expect duplicate sequences that are not the result of PCR amplification or optical duplicates from cluster generation but instead are the true signal that you are trying to capture. When I look at the duplication levels in my RNA-Seq data, I always consider both ends of a pair, because it is much less likely (though there is still some chance, especially for really abundant transcripts) that sequences with the same start and end coordinates for both ends of the pair are due to true signal versus PCR/optical duplicates.

I have done this two ways, and am not sure which approach you might want to try:
1) Concatenate both reads (in this case 100bp + 100bp) of a pair together (I modified a script from Galaxy for this purpose) and then run FastQC or a similar tool on the 200 bp sequence to determine the % duplication before alignment
2) Align the sequences first and then use PicardTools to mark and/or remove duplicates in the resulting BAM file using paired end information

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

high sequence duplication levels for Illumina RNA-Seq meta-transcriptomics

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News