SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplication levels ege RNA Sequencing 4 05-23-2014 08:16 AM
High duplication levels in FASTQC flobpf Bioinformatics 3 11-27-2013 12:28 PM
high sequence duplication levels for Illumina RNA-Seq meta-transcriptomics Marcus RNA Sequencing 12 07-20-2012 06:41 AM
What might cause the "Sequence Duplication Levels" failures in FastQC report? elrohir610 Bioinformatics 6 05-07-2012 09:38 PM
Fastqc sequence duplication levels Bruce E Illumina/Solexa 1 07-29-2011 07:13 AM

Reply
 
Thread Tools
Old 09-15-2015, 08:22 PM   #1
Saeideh
Member
 
Location: Iran

Join Date: Aug 2015
Posts: 25
Default Sequence Duplication Levels failure

Hiii

Good [morning | afternoon | evening | night]

I used fastqc to qualify my data. At the beginning I had failure in (Pair base sequence content, Per base GC content, Per sequence GC content and Sequence duplication levels ). I noticed the most error was due to 9 first bases, so I trimmed them by trimmomatic. After that I still get error in (Per sequence GC content and Sequence duplication levels).

For per sequence GC content, it is more than normal.

For Sequence duplication levels the graph raises up after 9.

(1)What should I do with them? Is it due to contamination?

Btw my "Sequence duplication levels" has only one red line and no blue line. (2)Why it is like that? Is it related to the version? My fastqc is version v0.10.1

I attached both results in a pdf file.

(3)I know trimmomatic cut the noises, but how much I can trim my sequences without affecting my following analysis? (Of course I can cut a 90 base pairs sequence to a 20 base pairs but for further analysis it is not reliable. For example for cufflinks to measure differential gene expression) So what is the limitation for trimming?

I am so sorry for so many questions.

Thank you in advance for helping me
Attached Files
File Type: pdf Duplication&GC.pdf (333.7 KB, 29 views)
Saeideh is offline   Reply With Quote
Old 09-15-2015, 09:16 PM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 824
Default

FastQC frequently worries people when there's no need to worry, and doesn't always point out the things that are most important. I've got a few questions:
  • Are these RNA reads?
  • What is the expected GC fraction of your target genome?
  • How much DNA was present in the sample?
  • Have spike-ins (e.g. ERCC, lambda) been used?
  • What are the overrepresented sequences?

In a best-case scenario, the double peak in the GC graph and the over-represented sequences could be explained by a spike-in taking up a large proportion of the reads, which would happen if the DNA hadn't been accurately quantified. Alternatively, a targeted sequencing of multiple genes might produce a similar effect.
gringer is offline   Reply With Quote
Old 09-15-2015, 09:32 PM   #3
Saeideh
Member
 
Location: Iran

Join Date: Aug 2015
Posts: 25
Default

These are cDNA reads (made from RNA)
I don't know the expected GC fraction of target genome (The data is for someone else and I should analyze it and enhance it).
No spike-ins were used.
There are three overrepresented sequences:
  1. CGCTCGCCGCTACTACGGGAATCGCTTTTGCTTTCTTTTCCTCTGGCTAC
  2. GATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAATGC
  3. TGGATACCTAGGTACCCAGAGACGAGGAAGGGCGTAGCAAGCGACGAAAT
Saeideh is offline   Reply With Quote
Old 09-15-2015, 10:09 PM   #4
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 824
Default

Well, a BLAST of all those sequences returns 100% identity matches to chloroplast genomes (probably rice).

My guess is that what you're seeing here is cDNA reads that haven't been properly depleted for high-abundance transcripts, so there is a large amount of contaminant sequences in the data. My ball-park assumption from looking at the GC graph would be that there is about 30% chloroplast sequence in there.

If at all possible, I'd recommend that your collaborator re-sequences these samples including a RiboZero preparation:

http://www.illumina.com/products/rib...val-plant.html

Otherwise, run a mapping only to the chloroplast sequence of the target (e.g. Oryza sativa) and exclude those sequences (e.g. HISAT2 has "--un-conc" and "--un" options for doing precisely that), then re-run FastQC to see if it changes things. Even with that 30% contamination (assuming it's expected), you still should get reasonable results.
gringer is offline   Reply With Quote
Old 09-16-2015, 02:04 AM   #5
Saeideh
Member
 
Location: Iran

Join Date: Aug 2015
Posts: 25
Default

Your answer surprised me. Yeap it's for rice and Oryza sativa. And the way you found the source of contamination made me excited. Smart answers

So now I should find for rice chloroplast sequence and then exclude that from reads. but I don't know how to do it with HISAT as you mentioned. I have to learn it first.

Thank you~Thank you~Thank you
Saeideh is offline   Reply With Quote
Old 09-16-2015, 02:30 AM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 824
Default

Quote:
Originally Posted by Saeideh View Post
And the way you found the source of contamination made me excited.
Yes, BLAST is very useful. I'm glad that NCBI still provides a service for "where is this sequence from", despite all the newer locally-faster search tools that are available.

Quote:
I don't know how to do it with HISAT as you mentioned. I have to learn it first.
Learning HISAT2 would be a good idea, as it's the latest in a new generation of ultra-fast mappers, and has almost identical command-line parameters to Bowtie2. Another option would be STAR, which has a really great manual and might be easier to pick up and use as a naive high-throughput sequencing bioinformatician.
gringer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO