SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 08:48 AM
Illumina adapter trimming figo1019 Illumina/Solexa 12 06-03-2014 12:32 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
Adapter Trimming Nextera mm.perrineau Illumina/Solexa 1 09-12-2012 11:56 AM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 11-29-2012, 10:01 AM   #1
SEQnovice
Junior Member
 
Location: Toronto

Join Date: Nov 2012
Posts: 6
Unhappy Confusion regarding Illumina Adapter Trimming!

Dear Experts,
Please accept my apologies if this has been posted elsewhere. I am new to the analysis of RNA-seq data, and I am confused regarding trimming of my adapters from the FASTQ files using cutadapt. I have read through some of the posts but they have gotten me more confused!
The details of my RNA-seq data are as follows:

- The platform is Illumina, TruSeq
- The FASTQ files are pair-ended (so I have an R1.fastq and R2.fastq for each of my samples). It is unknown which of the R1 and R2 represent the 'forward' or 'reverse' reads.
- The files have been demultiplexed, so I have a barcode per sample which matches a specific barcode in a corresponding indexed adapter.
- I have been provided with a Universal adapter and 5'-3' indexed adapters. I have checked the indexed adapters and they are all exactly identical except at the 6bp barcode in the middle of the sequence.

Please kindly help me with the following:

1. I am still trying to understand how Illumina TruSeq works but on principle, should the trimming be done at the 3' only, or also at the 5' end of the read? Or is it that only the Universal Adapter should be trimmed at the 5', and the indexed adapters at the 3'?

NB1: Read length in 101bp as observed in FastQC. This was expected in the experimental setup but makes me wonder if I have any adapters to begin with.
NB2: I have used FastQC to look at a sample of my data (around 198,000 seqs), I didn't find any overrpresented sequences but I did find increased 5-mer representation in the first 10 base pairs of my pairs (which I am assuming to be the 5' end?). There are also more GC fluctuations in those first 10bps as well.

2. What is the minimum overlap that is effective to consitute a 'match' between the adapter and the read? Cutadapt has a default value of 3...but wouldn't that necessarily promote 'false matching' as well and lead to culling of sequences that don't have the adapter? I am considering a higher cutoff for the overlap, say 5bp, given the k-mer overrepresentations observed in FastQC.

3. When providing the adapter sequences, seeing that the indexed adapters only differ at the barcode, is it still prudent to provide the entire sequence of the indexed adapters, in addition to entire sequence of the universal adapter? What is the bare minimum sequence people have provided for their adapters, both indexed and universal? Does it make a difference?

4. I am assuming that the same indexed 5'-3' adapter is provided when trimming from both the R1 and R2 reads. I have not attempted to trim the reverse complement or the reversed sequence from either R1 or R2. If I am mistaken in this approach please correct me!

My apologies for the multiple questions. Thank you in advance for your help with this!
Much obliged!
SEQNovice

Last edited by SEQnovice; 11-29-2012 at 11:02 AM.
SEQnovice is offline   Reply With Quote
Old 11-30-2012, 07:51 AM   #2
SEQnovice
Junior Member
 
Location: Toronto

Join Date: Nov 2012
Posts: 6
Default

My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
Thank you,
SEQNovice
SEQnovice is offline   Reply With Quote
Old 11-30-2012, 08:27 AM   #3
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Quote:
Originally Posted by SEQnovice View Post
My questions have not been answered. Could someone kindly reply to some of them or at least direct me to the proper threads where this may have been discussed? I am new to this field and any feedback would be much appreciated!
Thank you,
SEQNovice
Patience, and searching. Please give your question more than 20 hours before bumping it.
ECO is offline   Reply With Quote
Old 11-30-2012, 09:06 AM   #4
SEQnovice
Junior Member
 
Location: Toronto

Join Date: Nov 2012
Posts: 6
Default

My apologies, this is my first post here! Thanks for the tip, and if you do have any feedback I would appreciate it though!
SEQnovice is offline   Reply With Quote
Old 01-17-2013, 03:17 AM   #5
blanco
Member
 
Location: Iceland

Join Date: Apr 2012
Posts: 28
Default

I am also interested to know the answer to some of these questions.

Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?
blanco is offline   Reply With Quote
Old 01-17-2013, 03:52 AM   #6
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Quote:
Originally Posted by blanco View Post
I am also interested to know the answer to some of these questions.

Perhaps to put it more simply: When trimming paired end reads, should the cutadapt command be exactly the same for both forward and reverse reads?
Using the same command on both reads will most likely cause your paired-end files to go out of sync. We have written a small solution that calls Cutadapt with (what we think) sensible parameters (Trim Galore, available here); in it's default setting , e.g. trim_galore --paired file1.fq file 2.fq, it will trim Illumina adapters from both reads, quality trim reads to a Phred score of 20 and handle paired-end files as you would expect.
fkrueger is offline   Reply With Quote
Old 01-17-2013, 04:57 AM   #7
blanco
Member
 
Location: Iceland

Join Date: Apr 2012
Posts: 28
Default

Thanks for your quick reply fkrueger - this looks to be something really useful. I have already asked one question in the appropriate thread: http://seqanswers.com/forums/showthr...ht=trim+galore
blanco is offline   Reply With Quote
Old 01-29-2014, 08:03 AM   #8
Fernando Seixas
Junior Member
 
Location: Portugal

Join Date: Oct 2013
Posts: 8
Default

Hi all,

Saw the 1st post of this thread and realized that I see exactly the same patterns described in point 1_NB2 - increased 5-mer representation in the first 10 base pairs, and GC fluctuations in those first 10bps as well (although very slight; and the same happens in the per base sequence content). Even after adapter trimming with cutadapt at both 5' and 3' ends and quality trimming (on Trimmomatic) these 'problems' persist. Any ideas of what might be causing this?

Also, and I don't know if this relates with the previous question, the per sequence GC content hasn't an exactly normal distribution - there's a slight bump at the right part of the distribution.

Thanks!
__________________
Fernando
Fernando Seixas is offline   Reply With Quote
Old 01-29-2014, 08:11 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

There are several posts here that cover illumina sequencing and FastQC. Search for "fastqc duplication".

If one of the posts does not answer your question then can you post example plots?

Last edited by GenoMax; 01-30-2014 at 04:36 PM.
GenoMax is offline   Reply With Quote
Old 01-30-2014, 10:45 AM   #10
Fernando Seixas
Junior Member
 
Location: Portugal

Join Date: Oct 2013
Posts: 8
Default

Thanks for the reply. But one thing I forgot to mention is the kind of data I have. It's whole genome sequencing data from hiseq2000 machine using Truseq library prep. And if I'm not wrong (my eyes are tired of so much reading xD), all the explanations I found for those behaviours I mentioned above refer to RNA-seq data, at least for the first 10 bp base content instability..

FastQC images of the problematic parameters are attached.

For the kmer analysis I attached both 7-mer and 10-mer analysis. I can see a repetitive pattern of 7bp if I allign the k-mers (CCTGGCTCCTGGCT) so looked for all possible 7bp sequences inside this pattern but still couldn't associate any of these to adapters/primers.

Thanks!
Attached Images
File Type: png per_base_sequence_content.png (13.8 KB, 66 views)
File Type: png per_base_gc_content.png (9.4 KB, 43 views)
File Type: png per_sequence_gc_content.png (30.1 KB, 33 views)
File Type: png kmer_profiles.png (58.2 KB, 65 views)
File Type: png kmer-10.png (97.4 KB, 48 views)
__________________
Fernando
Fernando Seixas is offline   Reply With Quote
Old 01-30-2014, 05:22 PM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

The first two plots look ok. Is this a "GC" rich organism? Looks like there is some kind of duplication of sequences. Are the qualities acceptable across the entire read?

Last edited by GenoMax; 01-30-2014 at 05:29 PM.
GenoMax is offline   Reply With Quote
Old 01-31-2014, 04:09 AM   #12
Fernando Seixas
Junior Member
 
Location: Portugal

Join Date: Oct 2013
Posts: 8
Default

But, is it really normal to have that slight fluctuations in the first 10 bp? Regarding the GC content, this data is from a mammalian genome. But even when I removed pcr duplicate these problems persisted.
And yes, the QS are good in the entire reads.

Another thing I forgot to mention is that this is PE data.
__________________
Fernando
Fernando Seixas is offline   Reply With Quote
Old 01-31-2014, 04:26 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by Fernando Seixas View Post
But, is it really normal to have that slight fluctuations in the first 10 bp?
Yes. Here is a "good" sample example report posted on the FastQC site. http://www.bioinformatics.babraham.a...qc_report.html

Quote:
But even when I removed pcr duplicate these problems persisted.
Another thing I forgot to mention is that this is PE data.
What is the aim of your experiment? Are you trying to do de novo assemblies or is there a closely related genome you can use as a reference?

As Simon (author of FastQC) had mentioned in some past posts here it is difficult for him to set "limits" for various tests in FastQC that are universally applicable. So having a dataset get a "fail" in one or more categories in FastQC does not automatically mean that there is a problem with the sample.

Have you tried doing analysis with the QC'ed data? How do those results look?

Last edited by GenoMax; 01-31-2014 at 04:28 AM.
GenoMax is offline   Reply With Quote
Old 01-31-2014, 05:08 AM   #14
Fernando Seixas
Junior Member
 
Location: Portugal

Join Date: Oct 2013
Posts: 8
Default

Is for denovo assembly. I understand what you said about the limits of the FastQC not being universally applicable but even though I should worry about the GC content and k-mer plot, no?

An no, I'm still stuck in this part because I don't fell confident enough to go to the next steps.

Thanks!
__________________
Fernando
Fernando Seixas is offline   Reply With Quote
Old 01-31-2014, 06:01 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Look at it this way. If there is a problem with the sample/library itself (at this point if the qualities are good then there is likely no technical issue with sequencing) you would not be able to do much short of redoing the experiment over.

Why not press ahead and give the de novo assembly a try. It may fail and you would be out of some compute cycles/time. Since it is a mammalian genome it is probably large(ish) so you are going to have to deal with a number of other computational challenges. Do you have enough sequence (theoretically) with adequate depth (10-15x or more) for the assembly tests?
GenoMax is offline   Reply With Quote
Reply

Tags
adapter trimming, barcode sequencing, cutadapt, illumina fastq, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO