SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 07:48 AM
Is left end (_1.fq) of normally mapped reads always positive strand? bioiion Illumina/Solexa 2 02-15-2012 07:19 AM
Tophat - Left Kept Reads jkozubek Bioinformatics 2 07-27-2011 05:40 AM
Is there a but in BWA 3'-end trimming? Yilong Li Bioinformatics 0 04-06-2011 04:02 AM
What's left of 5 million reads Manu Illumina/Solexa 3 08-18-2010 03:47 AM

Reply
 
Thread Tools
Old 06-05-2011, 11:51 AM   #1
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default Trimming left end (5') of reads??

Can anyone explain why there is a sequence bias in the first 15bp of Illumina reads? I am pretty sure this is not an adapter leftover. The researchers who did lettuce transcriptome identified the same issue, with results at:
http://atgc-illumina.googlecode.com/...AKozik_V09.pdf
And we saw the same bias in the first 15bp of our reads also. I think I read somewhere that it's caused by GC content. Even after removing low & medium quality reads, we still see the bias in the first 10-15nt. Can anyone explain?
blindtiger454 is offline   Reply With Quote
Old 06-06-2011, 07:36 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

Short answer, the random hexamer priming is "not so random". Illumina has acknowledged this in one of their FAQs:

Quote:
Q482. Why is GC high in the first few bases?
It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.
There was also a publication which investigated this:

Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010 Apr.;

Last edited by kmcarr; 04-12-2013 at 12:35 PM. Reason: Hyperlink reference
kmcarr is offline   Reply With Quote
Old 06-06-2011, 10:57 AM   #3
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??
blindtiger454 is offline   Reply With Quote
Old 06-06-2011, 11:29 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

Quote:
Originally Posted by blindtiger454 View Post
Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??
I have carefully studied the UC Davis poster in the past and what strikes me is that the effect of trimming the 5' end appears nearly identical to that of trimming the 3' end so I'm not convinced of their conclusion that it is important to trim the initial 15nt. However I have heard from other researchers that they do present a particular problem for de novo assembly with de bruijn graph assemblers (which is just about all of the most popular short read assemblers, including velvet). The thinking is that the k-mer diversity of the first 15nt is significantly lower than the remainder of the read which seems to cause problems for the assembler.

If you are doing a de novo assembly why not give it a try both ways and see what your results are?

On the other hand if I am mapping the reads to a genome (vs de novo) I never trim the 5' ends of RNA-Seq reads and I find they map perfectly well.
kmcarr is offline   Reply With Quote
Old 06-06-2011, 08:15 PM   #5
blindtiger454
Member
 
Location: Omaha, NE

Join Date: Oct 2010
Posts: 30
Default

Thanks for the information. Our reads are 55bp, and it is from a tetraploid plant. Given the large amount paralogues and allelic diversity in plants, I want to do minimal trimming for the assembly. It's bad enough having 55bp. The UC Davis folks had 80bp reads. If I trimmed my reads down to 40bp, I'm afraid the assembler will incorrectly assembly paralogues. Sometimes 15 nucleotides is all the difference between 2 closely related transcripts/genes.
blindtiger454 is offline   Reply With Quote
Old 07-06-2012, 02:42 AM   #6
IBseq
Member
 
Location: uk

Join Date: Jul 2012
Posts: 56
Default FASTQ Trimmer tool

hi guys,
I'm new to this forum...can anyone tell how do I know homa many bases should I trim with FASTQ Trimmer?Wht is the ideal score and which values do I have to look at?(Q1, median or Q3)

Thanks!
IBseq is offline   Reply With Quote
Old 10-10-2012, 09:50 AM   #7
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

bump
carmeyeii is offline   Reply With Quote
Old 10-10-2012, 10:29 AM   #8
IBseq
Member
 
Location: uk

Join Date: Jul 2012
Posts: 56
Default

I sorted that out...if anyone needs info glad to help
IBseq is offline   Reply With Quote
Old 10-18-2012, 04:23 AM   #9
blanco
Member
 
Location: Iceland

Join Date: Apr 2012
Posts: 28
Default

Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?

Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco
Attached Files
File Type: pdf adapter_contaminations.pdf (84.0 KB, 573 views)
blanco is offline   Reply With Quote
Old 10-18-2012, 04:54 AM   #10
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

Quote:
Originally Posted by blanco View Post
Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?


Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco
You can get adapter-dimer (where the DNA insert size is effectively 0) meaning that you only sequence adapter (hence it appears at the 5' end). If this is the case, I believe using cutadapt willl just remove those reads from your fastq file (maybe someone can confirm).
Those peaks don't look like dimer to me, more the random priming issue. When you get bad adapter, you can actually read the adapter sequence in your %base graph (see attached plot of a run that had 10% adapter dimer).
Attached Images
File Type: png adpater-dimer.png (79.9 KB, 388 views)
TonyBrooks is offline   Reply With Quote
Old 03-27-2013, 05:34 PM   #11
rmred
Junior Member
 
Location: Malaysia

Join Date: Mar 2013
Posts: 1
Default

I got the same problem to and produce exactly the same ACGT bias for the first 15bp/cycle. And I've asked the representative for Illumina and they mentioned that this is due to the hexamer random priming as mentioned above.
rmred is offline   Reply With Quote
Old 06-25-2013, 09:28 AM   #12
isett
Junior Member
 
Location: GA

Join Date: Nov 2012
Posts: 1
Default

What if it's WGS and not RNA-Seq. I see the same thing with the NexteraXT kit on the MiSeq. Is it a non-random recognition site for the Tagmentation enzyme?
isett is offline   Reply With Quote
Old 08-05-2013, 10:05 AM   #13
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Default Hi IBseq

Quote:
Originally Posted by IBseq View Post
I sorted that out...if anyone needs info glad to help
I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.
nareshvasani is offline   Reply With Quote
Old 09-23-2013, 05:12 AM   #14
Tengfei Liu
Junior Member
 
Location: China

Join Date: Aug 2013
Posts: 1
Default

Quote:
Originally Posted by nareshvasani View Post
I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.
Tengfei Liu is offline   Reply With Quote
Old 09-25-2013, 07:04 AM   #15
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 121
Default

Quote:
Originally Posted by nareshvasani View Post
I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.
I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.

Last edited by Michael.Ante; 09-25-2013 at 07:23 AM. Reason: typo
Michael.Ante is offline   Reply With Quote
Old 09-25-2013, 07:59 AM   #16
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Default

Thanka a bunch.


Quote:
Originally Posted by Tengfei Liu View Post
You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.
nareshvasani is offline   Reply With Quote
Old 09-25-2013, 07:59 AM   #17
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Smile

I did the same way.

Thanks for feedback.

Quote:
Originally Posted by Michael.Ante View Post
I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.
nareshvasani is offline   Reply With Quote
Old 06-24-2014, 06:23 PM   #18
Oyster_lab
Junior Member
 
Location: Australia

Join Date: Jan 2014
Posts: 8
Default

So, just following up on this topic. It has been incredibly helpful. We shouldn't trim the first bases at the 5' end and try to perform the de novo assembly that way correct?

Thanks!
Oyster_lab is offline   Reply With Quote
Old 06-24-2014, 06:41 PM   #19
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It depends on the library prep. Illumina fragment libraries typically have adapters on the right (3') end, so if you trimmed to the left from the adapter you'd lose all of your genomic sequence. For long mate pair libraries, the answer depends on the protocol.
Brian Bushnell is offline   Reply With Quote
Old 06-24-2014, 06:47 PM   #20
Oyster_lab
Junior Member
 
Location: Australia

Join Date: Jan 2014
Posts: 8
Default

Thanks for your reply, Brian.
I have mRNA Illumina 100bp paired end reads. I have already removed the adapters, but still have that same the high variation on GC% at the 5' end. For the library prep, TruSeq mRNA prep was used, that's why I am guessing I have the same 5' end bias described before on my dataset. Any thoughts?
Oyster_lab is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO