SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Paired-end Illumina RNA-seq adapter trimming fabrice Bioinformatics 8 01-05-2015 08:48 AM
Fastxtoolkit nucleotide distribution issue jeny Bioinformatics 4 08-06-2012 05:59 PM
SOLiD Barcoded Adapter Trimming DrDTonge Bioinformatics 4 12-06-2011 08:33 AM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM
Adapter trimming in MAQ for SOLiD lgoff Bioinformatics 0 05-11-2009 10:55 AM

Reply
 
Thread Tools
Old 11-25-2010, 05:42 AM   #1
Mark
Member
 
Location: Raleigh, NC

Join Date: Nov 2008
Posts: 48
Default FASTXtoolkit adapter trimming

Hi All

I recently downloaded the FASTX toolkit and tried to use it for trimming fastq reads of adapter sequences. This did not work, the tool simply discarded any reads containing adapter sequences though this is not seemingly its function according to the documentation. I wrote to the help contact for the tool but recieved no response (see below for details). Has anyone used this tool for this purpose successfully?

Thanks for your help

Mark

#############################################
Hello

I recently downloaded the FASTX toolkit (fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2) and attempted to use the fastx_clipper tool. I created a test fastq file (3 of the four sequences contain the default adapter CCTTAAGG):

@test1
CCTTAAGGAAAAAAAAAAGGGGGGGGGG
+test1
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test2
CCTTAAGGAAAAAAAAAGGGGGGGGGGG
+test2
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test4
CCTTAAGGTTGACGTGATCGACACCTGG
+test4
[[[[[[[[[[[[[[[[[[[[[[[[[[[[

And then executed the command (as shown on FASTX toolkit website)

-bash-3.2$ fastx_clipper -v -i test.fastq -a CCTTAAGG
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
Clipping Adapter: CCTTAAGG
Min. Length: 5
Input: 4 reads.
Output: 1 reads.
discarded 0 too-short reads.
discarded 3 adapter-only reads.
discarded 0 N reads.

As you can see, the three reads that contain the adapter are discarded as “adapter-only reads” which (in my way of looking at things) they are not nor are they too short (default <=5) after any trimming. What is going on here? Does this tool actually trim reads or only discard them if they are found. If the former would you please tell me what I am doing incorrectly? Also if the former, is it possible to supply the tool with multiple adapters to trim?

Thanks for your help

Mark
Mark is offline   Reply With Quote
Old 11-25-2010, 06:17 AM   #2
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

I can't help you with the FASTX toolkit, but here is how to do it with Biopieces (www.biopieces.org).


Code:
read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAAGGGGGGGGGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test1
---
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAGGGGGGGGGGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test2
---
SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
ADAPTOR_POS: -1
SEQ_LEN: 28
SEQ_NAME: test3
---
SCORES: [[[[[[[[[[[[[[[[[[[[
SEQ: TTGACGTGATCGACACCTGG
ADAPTOR_POS: 0
SEQ_LEN: 20
SEQ_NAME: test4
---

Use grab to get the entries that were trimmed and finally use write_fastq to create a new file:

Code:
read_fastq -i test.fastq | remove_adaptor -a CCTTAAGG -r before | grab -e 'ADAPTOR_POS>=0' | write_fastq -o test_trimmed.fastq -x

Cheers,


Martin

Last edited by maasha; 11-25-2010 at 06:24 AM.
maasha is offline   Reply With Quote
Old 11-25-2010, 06:23 AM   #3
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Oh, and if you want to trim multiple adaptors either process the fastq file several times or just use remove_adaptor multiple times:

Code:
read_fastq -i test.fastq |
remove_adaptor -a CCTTAAGG -r before |
remove_adaptor -a GACACCTGG -r after

SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAAGGGGGGGGGG
SEQ_NAME: test1
SEQ_LEN: 20
ADAPTOR_POS: -1
---
SCORES: HHHHHHHHHHHHHHHHHHHH
SEQ: AAAAAAAAAGGGGGGGGGGG
SEQ_NAME: test2
SEQ_LEN: 20
ADAPTOR_POS: -1
---
SCORES: HHHHHHHHHHHHHHHHHHHHHHHHHHHH
SEQ: AGAGAGAGAGAGAGAGAGAGAGAGAGAG
SEQ_NAME: test3
SEQ_LEN: 28
ADAPTOR_POS: -1
---
SCORES: [[[[[[[[[[[
SEQ: TTGACGTGATC
SEQ_NAME: test4
SEQ_LEN: 11
ADAPTOR_POS: 11
---


M

Last edited by maasha; 11-25-2010 at 07:09 AM.
maasha is offline   Reply With Quote
Old 11-25-2010, 07:22 AM   #4
gghl
Junior Member
 
Location: NY

Join Date: Aug 2010
Posts: 1
Default

Hi Mark,

Based on my understanding, the fastx_clipper first finds the adaptor seqeunce you give and then trims off adaptor and nucleotide sequenes after the adaptor. I think fastx_clipper is designed for removeing adaptor after the insert seqeunces. And this is why in your test fastq file, reads of test 1, 2 and 4 were considered as adaptor-only reads.

I think if what you want is to remove 5' end adaptor in front of the insert seuqences, the fastx_trimmer might be able to help.

Best wishes,
gghl
gghl is offline   Reply With Quote
Old 04-15-2011, 01:27 PM   #5
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.
earonesty is offline   Reply With Quote
Old 04-16-2011, 08:40 AM   #6
Mark
Member
 
Location: Raleigh, NC

Join Date: Nov 2008
Posts: 48
Default

Thanks I'll give it a try

I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?
Mark is offline   Reply With Quote
Old 04-16-2011, 12:01 PM   #7
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Quote:
Originally Posted by Mark View Post
Thanks I'll give it a try

I noticed at your site a tool for stitching pe reads called fastq-join. It doesn't appear to be available yet. When will it be?
You can just grab the code... it's POSIX C++ and should compile easily:

http://code.google.com/p/ea-utils/so...r/fastq-join.c

g++ -O3 fastq-join.c -o fastq-join
earonesty is offline   Reply With Quote
Old 04-21-2011, 01:20 PM   #8
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Note: I made a change recently to properly use the "better quality base" in the overlapping region... there was a bug in it that someone pointed out. If you're using it, you'll want the newer version.
earonesty is offline   Reply With Quote
Old 06-13-2011, 04:25 PM   #9
angelawu
Member
 
Location: Stanford, CA

Join Date: Feb 2010
Posts: 12
Default Question about fastq-mcf

Hi,

I encountered an issue when using fastq-mcf on my GA2 generated 1x36 reads, and wondering if you could shed some light.

So I made my fasta file with all the TruSeq adapter sequences in there, and ran fastq-mcf using that file, -P Phred scale set to 33 for my files are in Sanger fastq format. All other parameters were left as default.

After trimming was completed, the outfile reports removing about 10 million reads out of 24 million.

I run the trimmed file through FastQC, and under the "over-represented sequences" tab, I see that partial adapter sequences (e.g. starting from bp #2) are still over-represented in my file, which suggests that they were not trimmed.

My question is, does fastq-mcf remove partial matches to adapter sequences provided, as well as full? If so, am I doing something wrong with the way I am using the tool?

I am pretty new to bioinformatics, so sorry if this is a stupid question...

Thank you!
angelawu is offline   Reply With Quote
Old 06-14-2011, 07:41 AM   #10
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

EXAMPLE OUTPUT:

Code:
Scale used: 2.2
Threshold used: 101 out of 40000
Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
Files: 1
Total reads: 250000
Too short after clip: 53
Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
Trimmed 24474 reads by an average of 10.81 bases on quality < 10

Last edited by earonesty; 06-14-2011 at 07:43 AM.
earonesty is offline   Reply With Quote
Old 06-14-2011, 10:47 AM   #11
angelawu
Member
 
Location: Stanford, CA

Join Date: Feb 2010
Posts: 12
Default

Quote:
Originally Posted by earonesty View Post
1. It does remove partial matches. It searches only from one "end" of the file. The default settings are very conservative, so if it's removing 10 million reads, that's an enormous number - you may want to change the settings to be more aggressive for that data.

2. Can you post the summary output... it should say why sequences were removed/clipped, and why etc.

3. Until very recently, GAII's output base-64 by default, not 33, so you may want to double-check that.

EXAMPLE OUTPUT:

Code:
Scale used: 2.2
Threshold used: 101 out of 40000
Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 193 at the 'end' of 'example.fastq', clip set to 6
Adapter FLUIDIGM Index-SP (AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG): counted 1063 at the 'end' of 'example.fastq', clip set to 4
Files: 1
Total reads: 250000
Too short after clip: 53
Clipped 'end' reads: Count: 16612, Mean: 18.12, Sd: 17.44
Trimmed 24474 reads by an average of 10.81 bases on quality < 10


Hi earonesty,

Thanks for getting back to me!
Here is an example of the output I received:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Files: 1
Total reads: 21964185
Too short after clip: 35672
Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
Trimmed 394023 reads by an average of 7.15 bases on quality < 10


So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

My sample DNA: ADAPTER1adapter2
read: dapte

I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

Thanks,

Angela
angelawu is offline   Reply With Quote
Old 06-14-2011, 11:17 AM   #12
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

Quote:
Originally Posted by angelawu View Post
Hi earonesty,

Thanks for getting back to me!
Here is an example of the output I received:

Scale used: 2.2
Threshold used: 101 out of 40000
Adapter TruSeq-Adapter1 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter2 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter3 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter4 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG): counted 10159 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter5 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter6 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter7 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter8 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter9 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter10 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter11 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Adapter TruSeq-Adapter12 (GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG): counted 10158 at the 'start' of './sanger-fastq/s_2.fastq', clip set to 1
Files: 1
Total reads: 21964185
Too short after clip: 35672
Clipped 'start' reads: Count: 13283064, Mean: 1.58, Sd: 1.15
Trimmed 394023 reads by an average of 7.15 bases on quality < 10


So far, what I understand is that my samples probably have a lot of adapter-pair ligations in there without any genomic insert. This leads to the entirety of my 36bp read being a portion of the index/adapter. And since the adapter is much longer than 36bp, i think those reads are not being removed. e.g.:

My sample DNA: ADAPTER1adapter2
read: dapte

I say this because when I put the cleaned up reads through FastQC again, I see that all the "Over-represented Sequences" that are TruSeq Indexes are still present in my file.

I've managed to resolve this particular issue by basically copying in the sequence given by FastQC as the overrepresented sequence, and using those in a fasta file as the adapter sequences. It works well for my case, so maybe there is nothing wrong with the toolkit, and it's just my particular sample?

Yes, I understand that Illumina reads use Phred64, but I always convert directly to Sanger Phred33 as soon as I get my files, which is why I put the -P 33 option in there.

Thanks,

Angela
- Your adapter file seems to have the same sequence over and over? I'm not sure how that will affect things. TruSeq-Adapter2 is the same as TruSeq-Adapter1.... etc. Try just using 1 per unique sequence. This probably won't help.

- Out of 40000 reads, 10000 had an exact match for 15 base pairs of adapter sequence. That's a lot. So when it says "clip set to 1" it will clip any matching subsequence.

- It only discarded 35672 reads and only a few bases. That's surprising to me considering the number of sequences it found in the subsample with exact matches. I would expect a higher rate of discards, and a higher number of mean bases clipped.

- This is a situation where I wish I could see about 100K reads from your sample and just run it a few times to see what happened why it did that. It should be walking the adapter along the sequence looking for the best match. It seems to be stopping early on....or perhaps the sequences that match the adapter are somewhere else (at the end...?) and it guessed wrong (you can force -e)

- There's also an undocumented "-d" option that spits out lots of debug info that I find useful.
earonesty is offline   Reply With Quote
Old 06-14-2011, 11:22 AM   #13
angelawu
Member
 
Location: Stanford, CA

Join Date: Feb 2010
Posts: 12
Default

Oh, the adapter sequences are not identical. If you look closely at the middle portion of the sequences, there is a barcode in the middle that is different for each sequence. But I also do not think this would make any difference...

In any case, I think I have a solution to my particular application, so I don't know how much time I want to spend debugging this, but thanks for reminding me of the -d option, which will surely come in handy later on as well. The -e option may be the trick, since the barcode only begins in the middle of the adapter sequence?

Thanks once again!
angelawu is offline   Reply With Quote
Old 06-15-2011, 06:20 AM   #14
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

I think the barcode in the middle was making it odd. Also, I think your solution is great.
earonesty is offline   Reply With Quote
Old 06-19-2011, 03:19 PM   #15
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default

I have tryed your ea-utils. But it seems as the same FASTXtoolkit adapter trimming. ea-utils also remove the whole read which contained adapter.

Quote:
Originally Posted by earonesty View Post
We rewrote a lot of fastx's toolkit stuff, and posted it here: https://code.google.com/p/ea-utils/. It attempts to do things like adapter removal, trimming, etc... without as much configuration by detecting presence of adapters located in a common file.
fabrice is offline   Reply With Quote
Old 06-20-2011, 06:00 AM   #16
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

It only removes the whole read if the remaining length is less than the minimum length threshold, which, by default is 15:

-l N Minimum remaining sequence length (15)

15's pretty low, but you can lower it.
earonesty is offline   Reply With Quote
Old 06-20-2011, 06:11 AM   #17
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default

But when I run it on the test data. It seems fastq-mcf remove the whole reads. Do I set something wrong?

fastq-mcf adapters.fa test.fastq -q 0 -l 1 -o test.trim
Scale used: 2.2
Threshold used: 1 out of 4
Adapter date_day (CCTTAAGG): counted 3 at the 'end' of 'test.fastq', clip set to 1
Files: 1
Total reads: 4
Too short after clip: 3

----------------adapters.fa----------------
>date_day
CCTTAAGG
>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

------------------test.fastq------------------------
@test1
CCTTAAGGAAAAAAAAAAGGGGGGGGGG
+test1
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test2
CCTTAAGGAAAAAAAAAGGGGGGGGGGG
+test2
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test3
AGAGAGAGAGAGAGAGAGAGAGAGAGAG
+test3
HHHHHHHHHHHHHHHHHHHHHHHHHHHH
@test4
CCTTAAGGTTGACGTGATCGACACCTGG
+test4
[[[[[[[[[[[[[[[[[[[[[[[[[[[[
fabrice is offline   Reply With Quote
Old 06-20-2011, 07:34 AM   #18
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

You're right, I'd introduced a bug in detecting starts. I'm sending a release and adding it to my tests. The problem was it should have been saying at 'start' of sequence... not 'end'. By thinking it was an 'end' adapter, it was trimming from the right... hence removing the whole sequence.

(REVISION: The problem was very short adapter sequences in the input file. fastq-mcf can't be used that way. It uses subsampling to get the adapter thresholds, and requires a minimum of 15 character matches during the subsampling pass)

Last edited by earonesty; 06-20-2011 at 10:40 AM.
earonesty is offline   Reply With Quote
Old 06-20-2011, 07:52 AM   #19
fabrice
Member
 
Location: paris

Join Date: Oct 2009
Posts: 86
Default

In my opions, most of adaptor is at the start of reads. Is it right?

In my RNA-seq data, I just see a pattern at the start, not the end.

In fact, I cannot understand the means of these options.

-s N.N Log scale for clip pct to threshold (2.5)
-t N % occurance threshold before clipping (0.25)
-m N Minimum clip length, overrides scaled auto (1)
-p N Maximum adapter difference percentage (20) This is for mismatch?
-k N sKew percentage causing trimming (2)
fabrice is offline   Reply With Quote
Old 06-20-2011, 07:58 AM   #20
earonesty
Member
 
Location: United States of America

Join Date: Mar 2011
Posts: 52
Default

They do deserve some elaboration in the help:

# -s N.N Log scale for clip pct to threshold (2.5)

Scaling factor which causes more frequent occurrences to clip heavily. The negative log base "scale" of the ratio of adapters found becomes the minimum match-length for the adapter.

# -t N % occurance threshold before clipping (0.25)

Minimum number of times an adapter needs to be found before clipping is considered necessary


# -m N Minimum clip length, overrides scaled auto (1)

Replaces scaling algorithm (above) with hardcoded limit

# -p N Maximum adapter difference percentage (20) This is for mismatch?

Yes this is the max # mismatch

# -k N sKew percentage causing trimming (2)

If one of the bases is only 2% of the reads, then that cycle is considered "skewed".... if it's at the edge of a read, it's trimmmed off.
earonesty is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO