Seqanswers Leaderboard Ad

**westerman** · 09-06-2012, 06:15 AM

Originally posted by JQL View Post

It seems to me that the duplication will affect the accurate counts of the transcripts.

To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

**GenoMax** · 09-06-2012, 06:27 AM

Originally posted by JQL View Post

What software can do this duplication removal? I check out the fastX, it doesn't seem have that functionality. Suggestions?

thanks!

PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

**JQL** · 09-06-2012, 07:03 AM

Thanks for your thoughts. I think I would agree with you. I would probably leave the duplicates alone then.

Originally posted by westerman View Post

To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

**JQL** · 09-06-2012, 07:11 AM

thanks GenoMax for the link.
I may experiment a little bit. Remove the duplicates and rerun the fastQC and see what happens.

Originally posted by GenoMax View Post

PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

**NRP** · 09-06-2012, 07:28 AM

I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

**JQL** · 09-06-2012, 11:26 AM

I am currently only interested in differential expressions.

thanks for sharing your thoughts.

Originally posted by NRP View Post

I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

**JQL** · 09-06-2012, 12:58 PM

Another related question:

While I agree it is probably better to leave the duplicated sequences alone for differential expression study, there are also some over-represented sequences (ORS) in my samples. In fastQC report, some of those top ORS are shown to be adapter seqeunces, others shown to have no hits. They probably don't accounts for large percentage of duplicated sequences (5% maybe?), do you guys remove those adaptor sequences?

**NRP** · 09-07-2012, 05:18 AM

Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

**DZhang** · 09-07-2012, 05:40 AM

Hi,

I just want to add that we need to also consider the potential sources of the duplication. Is it due to high coverage or PCR-amplification during library prep. It is never a clean cut but you need to assess which one is more dominant as they have different impacts to certain quantitation studies.

Best regards,
Douglas

https://www.contigexpress.com

**JQL** · 09-07-2012, 08:42 AM

I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

I haven't tried trim galore yet.

Originally posted by NRP View Post

Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

**NRP** · 09-07-2012, 09:09 AM

I've never tried fastx clipper, but in trim galore you can specify the sequence to trim & adjust the match stringency so that might help.

**JQL** · 09-07-2012, 12:56 PM

grep -c ADAPTER found 1M adapter, which is about 4.4%, consistent with the fastQC report. Not sure how fastx clipper found and removed 4.7M adapter sequences.

I guess, Trim Galore seems to be a better option.

Originally posted by JQL View Post

I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

I haven't tried trim galore yet.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

duplicated reads in fastQC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News