SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
TruSeq adapter sequences? BIG_SNP Illumina/Solexa 35 03-16-2014 06:21 PM
TruSeq adapter sequences kirankm Sample Prep / Library Generation 4 05-10-2012 07:32 AM
SOLiD Adapter Sequences DrDTonge SOLiD 0 08-25-2011 01:14 AM
primer/adapter sequences nikiwilson Sample Prep / Library Generation 2 06-21-2011 02:36 PM
Adapter sequences SeqTruth General 2 04-27-2011 08:05 AM

Reply
 
Thread Tools
Old 09-20-2010, 10:14 AM   #1
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default cutadapt: A tool that removes adapter sequences

I'm pleased to announce the tool 'cutadapt', which we have been using in our research group for adapter removal in high-throughput sequencing data. Removing adapter sequences from reads is necessary when the read length of the sequencing machine is longer than the molecule that is sequenced, for example when sequencing small RNAs.

Since special code is included to handle color space data correctly, the tool may be especially useful for people who do not use Applied Biosystem's Corona pipeline.

cutadapt is under the MIT license.

Please see the web page for a feature list and a link to a downloadable package:
http://cutadapt.googlecode.com/
mmartin is offline   Reply With Quote
Old 09-21-2010, 12:40 AM   #2
Torst
Senior Member
 
Location: The University of Melbourne, AUSTRALIA

Join Date: Apr 2008
Posts: 275
Default

Quote:
Originally Posted by mmartin View Post
I'm pleased to announce the tool 'cutadapt
http://cutadapt.googlecode.com/
It seems your code only runs under Python 2.6 ?

For Centos 5.x, which is a bit behind, I had to install the "python26" packages and change the #!/usr/bin/python to #!/usr/bin/python26.
Torst is offline   Reply With Quote
Old 09-21-2010, 12:54 AM   #3
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Yes, Python 2.6 is needed, thanks for the pointer. It wouldn't be hard to support Python 2.5, but some 2.6 features make the transition to the Python 3 syntax easier, so I would like to stick to it. I have updated the homepage to reflect the requirement of Python 2.6.
mmartin is offline   Reply With Quote
Old 11-16-2010, 08:16 PM   #4
HiroMishima
Member
 
Location: Nagasaki, Japan

Join Date: Aug 2009
Posts: 15
Default 3'-end partial match of adapters

Hi,

I have a question about Cutadapt version 0.3.

Does Cutadapt cut partial sequences of adapters?

According to "Statistics for adapter" messages, Cutadapt seems to recognize 3'-end partial match of adapters. However, only full-matched adapter sequences are removed in output files.
HiroMishima is offline   Reply With Quote
Old 11-17-2010, 03:35 AM   #5
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Yes, cutadapt recognizes partial adapters. That is, if your adapter is ADAPTER and your read is MYSEQUENCEADAP, then the resulting sequence is MYSEQUENCE. In fact, these are some examples of input sequences that will result in MYSEQUENCE:
MYSEQUENCEADAPTER
MYSEQUENCEADAP
MYSEQUENCEADPAPTERSOMETHINGELSE

Could you give an example of the problematic read you encounter and the output of cutadapt for that read?
mmartin is offline   Reply With Quote
Old 11-17-2010, 04:32 AM   #6
HiroMishima
Member
 
Location: Nagasaki, Japan

Join Date: Aug 2009
Posts: 15
Default

Quote:
Originally Posted by mmartin View Post
Could you give an example of the problematic read you encounter and the output of cutadapt for that read?
I found that I used two -a options and used adapter sequences were almost reverse complement each other. Probably I do not have to use two -a options in this case. Hopefully, these examples clarify the situation.

sample.fastq:
Code:
@read1
GATCCTCCTGGAGCTGGCTGATACCAGTATACCAGTGCTGATTGTTGAATTTCAGGAATTTCTCAAGCTCGGTAGC
+
hhhhhhhhhhahhhhhehhffhghhehdgghhheddggfhfhhgffhddhhfffhhffhfgggffddfdfffcdfb
@read2
CTCGAGAATTCTGGATCCTCTCTTCTGCTACCTTTGGGATTTGCTTGCTCTTGGTTCTCTAGTTCTTGTAGTGGTG
+
hhhhhhhhhhhhhhhhhhhhhhhhhhgghghhhhhhhhgaddeeadaa^dadaa_aaaaababca_aa__^[T^[Z
And next result is OK:
Code:
$python cutadapt -a CTCGAGAATTCTGGATCCTC sample.fastq

@read1
CTGGAGCTGGCTGATACCAGTATACCAGTGCTGATTGTTGAATTTCAGGAATTTCTCAAGCTCGGTAGC
+
hhhahhhhhehhffhghhehdgghhheddggfhfhhgffhddhhfffhhffhfgggffddfdfffcdfb
@read2
TCTTCTGCTACCTTTGGGATTTGCTTGCTCTTGGTTCTCTAGTTCTTGTAGTGGTG
+
hhhhhhgghghhhhhhhhgaddeeadaa^dadaa_aaaaababca_aa__^[T^[Z
However, in next results, read1 still contains "GATCCTC" in the 5' end:
Code:
$python cutadapt -a CTCGAGAATTCTGGATCCTC -a GAGGATCCAGAATTCTCGAGTT sample.fastq

@read1
GATCCTCCTGGAGCTGGCTGATACCAGTATACCAGTGCTGATTGTTGAATTTCAGGAATTTCTCAAGCTCGGTAGC
+
hhhhhhhhhhahhhhhehhffhghhehdgghhheddggfhfhhgffhddhhfffhhffhfgggffddfdfffcdfb
@read2
TCTTCTGCTACCTTTGGGATTTGCTTGCTCTTGGTTCTCTAGTTCTTGTAGTGGTG
+
hhhhhhgghghhhhhhhhgaddeeadaa^dadaa_aaaaababca_aa__^[T^[Z
HiroMishima is offline   Reply With Quote
Old 11-17-2010, 08:51 AM   #7
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Hi, actually, you do have to use two -a options since currently reverse complements are not automatically searched for.

I managed to reproduce the problem you encountered and I have prepared a new release that hopefully fixes it. You can download v0.4 from the homepage and see whether the bug is actually fixed. Thanks for reporting this!
mmartin is offline   Reply With Quote
Old 11-17-2010, 09:09 AM   #8
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

I haven't looked into the details of the program, but I wonder how straightforward it would be to use the program to filter out and discard the entire reads that match an adapter, rather just removing that part and re-using the trimmed read?
gaffa is offline   Reply With Quote
Old 11-17-2010, 10:12 AM   #9
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Since this isn't too hard, I just added that feature. cutadapt now has the option "--discard", which does exactly that: If an adapter is found in the read, then the read is discarded and not trimmed.
mmartin is offline   Reply With Quote
Old 11-17-2010, 05:30 PM   #10
HiroMishima
Member
 
Location: Nagasaki, Japan

Join Date: Aug 2009
Posts: 15
Default

Quote:
Originally Posted by mmartin View Post
Hi, actually, you do have to use two -a options since currently reverse complements are not automatically searched for.

I managed to reproduce the problem you encountered and I have prepared a new release that hopefully fixes it. You can download v0.4 from the homepage and see whether the bug is actually fixed. Thanks for reporting this!
Everything's perfect! cutadapt 0.5.1 worked well with two -a options.

I believe that cutadapt is one of the best adopter sequence trimmer especially in term of simpleness and speed.

Thanks again for prompt update.
HiroMishima is offline   Reply With Quote
Old 11-18-2010, 05:24 AM   #11
sdavis
Member
 
Location: Maryland

Join Date: Jan 2010
Posts: 14
Default

This looks a very useful tool. Could I suggest that you accept gzipped fastq files as an alternative input format as a simple convenience?
sdavis is offline   Reply With Quote
Old 11-18-2010, 08:57 AM   #12
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Good idea. Since this was on my to do list as well, I have just implemented this feature and released cutadapt 0.6.
mmartin is offline   Reply With Quote
Old 11-18-2010, 11:35 AM   #13
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

cool, that was fast!
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 11-18-2010, 11:51 AM   #14
thinkRNA
Member
 
Location: Carlsbad,CA

Join Date: Jan 2010
Posts: 94
Default

can you please add an option to remove all N's or C's etc? I think this will be helpful. Also, can you describe in detail how error rate is calculated?
thinkRNA is offline   Reply With Quote
Old 11-18-2010, 12:21 PM   #15
gaffa
Member
 
Location: Gothenburg/Uppsala, Sweden

Join Date: Oct 2010
Posts: 82
Default

Quote:
Originally Posted by mmartin View Post
Since this isn't too hard, I just added that feature. cutadapt now has the option "--discard", which does exactly that: If an adapter is found in the read, then the read is discarded and not trimmed.
Fantastic!
gaffa is offline   Reply With Quote
Old 12-02-2010, 02:38 AM   #16
hash
Junior Member
 
Location: London

Join Date: May 2010
Posts: 3
Default Great tool...

Hi,

This is a great tool by the way. I was hoping for someone to have already implemented exactly this tool as I was scratching my head as to do it myself. Thanks a lot.

May I make a couple of suggestions in terms of functionality. Would it be possible for you to add in a feature that only keeps trimmed reads if they are above a certain length (e.g. If this parameter was set to 20 and original sequence length is 36 and 17bp adapter was trimmed, then the sequence would not be included in the output because only 19bp of sequence would be remaining; could be set to 0 as default). Also, it would be useful to track which sequence was trimmed as adapter and where in the original sequence it was trimmed from in terms of location. Maybe an optional dump .fastq file would help for this which would contain the trimmed adapter sequence and additional information as to where in the original sequence it was found and how many mismatches were allowed (e.g. if a sequence is 36bp and adapter is found at 1 to 15bp with 0 mismatches, then maybe you could append this information to the '+' line in the fastq file as 1_15_0; the rest of the fields for a fastq sequence entry i.e. '@' would be the same). With the adapter .fastq output it would then be possible to parse the adapter sequence as required.

These are just suggestions by the way. I can see this tool becoming very useful to me and I have already introduced it to all of the bioinformaticians in my lab.

Look forward to reading you response.
hash is offline   Reply With Quote
Old 12-02-2010, 02:46 AM   #17
hash
Junior Member
 
Location: London

Join Date: May 2010
Posts: 3
Default Error rate...

Please correct me if I am wrong but I have played around with cutadapt and this is what I understood from the error rate. It can be calculated by multiplying the error rate by the length of the adapter found. For example, if you set your error rate to 0.1 (default) and you find an adapter sequence of length 10bp then 1 mismatch is permitted. However, if you adapter sequence is below 9bp then 0 mismatches would be permitted at the same error rate since 0.9 rounds down to 0. I am not sure how this applies to insertions and deletions but I found it to be the case with mismatches.

Hope that helps.
hash is offline   Reply With Quote
Old 12-02-2010, 01:11 PM   #18
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

Hello and thanks for the feedback! Your suggestions are very reasonable. In fact, discarding reads that are too short after trimming is already on my to do list. I'll hopefully be able to implement that very soon (it's simple but I need to find the time). The annotated FASTQ file has a bit lower priority for me, but I have added your suggestion to the bugtracker so I won't forget it.

Your observation regarding the error rate is correct. The error rate is calculated over the part of the adapter that is actually found in the sequence. Also, when there are insertions or deletions, these count as one error, but currently any gap within the adapter increases its length by one. I'm thinking about changing that behavior. I'll also add a better explanation to the README file.
mmartin is offline   Reply With Quote
Old 12-03-2010, 04:57 AM   #19
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

I have added the option "--minimum-length" (or simply -m) to cutadapt. Download version 0.7 to get the feature, or retrieve the source from Subversion.
mmartin is offline   Reply With Quote
Old 12-09-2010, 02:46 AM   #20
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 75
Default

I have released cutadapt version 0.8. Important changes are:
  • The default behavior now is to assume that an adapter has been ligated to the 3' end. This should be the correct behavior for at least the SOLiD small RNA protocol (SREK) and also for the Illumina protocol. See the README for details.
  • A different scoring function improves trimming: Some reads that should have been trimmed weren't.
  • 20% faster on my test data set
mmartin is offline   Reply With Quote
Reply

Tags
adapter trimming, color space, microrna, mirna, solid

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO