PDA

View Full Version : Trimming left end (5') of reads??


blindtiger454
06-05-2011, 11:51 AM
Can anyone explain why there is a sequence bias in the first 15bp of Illumina reads? I am pretty sure this is not an adapter leftover. The researchers who did lettuce transcriptome identified the same issue, with results at:
http://atgc-illumina.googlecode.com/files/PAG_2010_AKozik_V09.pdf
And we saw the same bias in the first 15bp of our reads also. I think I read somewhere that it's caused by GC content. Even after removing low & medium quality reads, we still see the bias in the first 10-15nt. Can anyone explain?

kmcarr
06-06-2011, 07:36 AM
Short answer, the random hexamer priming is "not so random". Illumina has acknowledged this in one of their FAQs:

Q482. Why is GC high in the first few bases?
It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.

There was also a publication which investigated this:

Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010 Apr (http://nar.oxfordjournals.org/content/38/12/e131).;

blindtiger454
06-06-2011, 10:57 AM
Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??

kmcarr
06-06-2011, 11:29 AM
Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??

I have carefully studied the UC Davis poster in the past and what strikes me is that the effect of trimming the 5' end appears nearly identical to that of trimming the 3' end so I'm not convinced of their conclusion that it is important to trim the initial 15nt. However I have heard from other researchers that they do present a particular problem for de novo assembly with de bruijn graph assemblers (which is just about all of the most popular short read assemblers, including velvet). The thinking is that the k-mer diversity of the first 15nt is significantly lower than the remainder of the read which seems to cause problems for the assembler.

If you are doing a de novo assembly why not give it a try both ways and see what your results are?

On the other hand if I am mapping the reads to a genome (vs de novo) I never trim the 5' ends of RNA-Seq reads and I find they map perfectly well.

blindtiger454
06-06-2011, 08:15 PM
Thanks for the information. Our reads are 55bp, and it is from a tetraploid plant. Given the large amount paralogues and allelic diversity in plants, I want to do minimal trimming for the assembly. It's bad enough having 55bp. The UC Davis folks had 80bp reads. If I trimmed my reads down to 40bp, I'm afraid the assembler will incorrectly assembly paralogues. Sometimes 15 nucleotides is all the difference between 2 closely related transcripts/genes.

IBseq
07-06-2012, 02:42 AM
hi guys,
I'm new to this forum...can anyone tell how do I know homa many bases should I trim with FASTQ Trimmer?Wht is the ideal score and which values do I have to look at?(Q1, median or Q3)

Thanks!

carmeyeii
10-10-2012, 09:50 AM
bump :) :):)

IBseq
10-10-2012, 10:29 AM
I sorted that out...if anyone needs info glad to help

blanco
10-18-2012, 04:23 AM
Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?

Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco

TonyBrooks
10-18-2012, 04:54 AM
Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

So why does the adapter appear at the beginning of the read and not at the end?


Am I misunderstanding something? I would love to have a clarification of this.

Thanks,
blanco

You can get adapter-dimer (where the DNA insert size is effectively 0) meaning that you only sequence adapter (hence it appears at the 5' end). If this is the case, I believe using cutadapt willl just remove those reads from your fastq file (maybe someone can confirm).
Those peaks don't look like dimer to me, more the random priming issue. When you get bad adapter, you can actually read the adapter sequence in your %base graph (see attached plot of a run that had 10% adapter dimer).

rmred
03-27-2013, 05:34 PM
I got the same problem to and produce exactly the same ACGT bias for the first 15bp/cycle. And I've asked the representative for Illumina and they mentioned that this is due to the hexamer random priming as mentioned above.

isett
06-25-2013, 09:28 AM
What if it's WGS and not RNA-Seq. I see the same thing with the NexteraXT kit on the MiSeq. Is it a non-random recognition site for the Tagmentation enzyme?

nareshvasani
08-05-2013, 10:05 AM
I sorted that out...if anyone needs info glad to help

I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

Tengfei Liu
09-23-2013, 05:12 AM
I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.


You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.

Michael.Ante
09-25-2013, 07:04 AM
I need help. Can you please help me to trim both ends 5' and 3'?

Thanks in advance.

I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.

nareshvasani
09-25-2013, 07:59 AM
Thanka a bunch.
:)

You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.

nareshvasani
09-25-2013, 07:59 AM
I did the same way.

Thanks for feedback.

I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.