Seqanswers Leaderboard Ad

**dawe** · 05-21-2010, 09:32 PM

You may try the fastx toolkit or play with the good old EMBOSS suite :-)

**yifangt** · 03-03-2013, 01:20 AM

Same question

I have the same question, but seems no direct answer on it I could find so far. FASTX_tools not suitable as fastx_trimmer needs the position of the adaptor, fastx_clipper only clip off the sequence after the adaptor, and not quite sure biopieces did the right thing after several tries. The tricky part is the bi-direction of the insert, so that there are four sets of border sequences as markers to be clipped off. Say:

Code:

5-TGGCCAATTnnnnnnnnnnTGCTAGCACTAG-3
3-ACCGGTTAAnnnnnnnnnnACGATCGTGATC-5

nnnnnn are the insert sequence.
So that

Code:

TGCTAGCTAG--->vector--seq---
AATTGGCCA--->vector--seq---

should be clipped off
and

Code:

--vector--seq---<---TGGCCAATT
--vector--seq---<---CTAGTGCTAGCA

should be clipped off too.

I am not sure all those avaiblable tools take these into consideration. Hope any of the authors could address this question. Thanks in advance!

YT

**mastal** · 03-03-2013, 08:09 AM

How to trim Vector and Contamination from Illumina reads?

Hi guys,

If you are working with Illumina data, try trimmomatic,

USADELLAB.org - Trimmomatic: A flexible read trimming tool for Illumina NGS data

http://www.usadellab.org/cms/index.php?page=trimmomatic

Best wishes,
Maria

**swbarnes2** · 03-03-2013, 12:22 PM

Did you try aligning to the E.coli and vector sequences, and then filtering the .bam?

**yifangt** · 03-03-2013, 04:35 PM

Thanks swbarnes2!
I did align them to the vectors, but my point is NOT to disgard those mapped reads as they are border of sequence of my BAC insert. There seems tools in biopiece, but I have problem with the installation, fastx_tools for sure only treat part of my problem, at least I did not figure out the way to do the job.

mastal, I have looked into your suite, I could not figure out the way to do my job to clip off the border sequences of each read, not based on quality, but on insert border sequences, which vary among reads. Different from adaptor from RNA-seq etc.

Appreciate any experties though. Thanks again!

**maasha** · 03-04-2013, 01:11 AM

Biopieces should be able to do this. Why dont you make a couple of small tests to see? You may need to reverse complement sequences or adaptors, but that is what a test will show you. Here is my little test (note that I use x instead of N since N is the IUPAC code for A, T, C or G - which will match anything):

Code:

maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG 
SEQ_NAME: test1
SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
SEQ_LEN: 31
---

Note that the reason x is included in the matched pattern is that we default allow 10% mismatches.

Now to get the adaptors trimmed from the second entry you simply need to supply the appropriate adaptors - and run through another round of find_adaptor:

Code:

maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC
SEQ_NAME: test1
SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: ACCGGTTAA
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xACGATCGTGATC
---

And finally clip_adaptor:

Code:

maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC | clip_adaptor
SEQ_NAME: test1
SEQ: xxxxxxxxx
SEQ_LEN: 9
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: xxxxxxxxx
SEQ_LEN: 9
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: ACCGGTTAA
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xACGATCGTGATC
---

**yifangt** · 03-04-2013, 07:16 AM

clip off vector border sequence

Thanks Martin!
That's what I was trying. Unfortunately I met problem with your biopieces installation related to Ruby issues. I have not yet sort it out with my Ubuntu system, and I have post it in the google group. Appreciate if you could have a look at it and give some suggestion.
Thanks a lot again!

YT

**yifangt** · 03-04-2013, 12:34 PM

Hi Martin!

An update for removing vector sequences. Two things I realized need pay attension to:
1) the -f -r arguments for the adaptor sequence of the other strand should be the opposite of your last reply as the sequences are reverse complemented. i,e, the second adaptor_find command should be:

Code:

read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor [COLOR="Red"]-r[/COLOR] ACCGGTTAA [COLOR="Red"]-f [/COLOR]ACGATCGTGATC

2) there seems bugs for the adaptor combination, e.g. seq14 as the combination of seq1 and seq4, for which the adaptors should be trimmed off. They were detected, but not clipped.
if the adaptor sequence was right at the end of the read, see >seq03_head_last.
An example of what I did is:

Code:

>seq01
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxxxxxxxx
>seq02
XXXXX222XXXXXXXXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq03
GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXXXX
>seq04
XXX4444XXXXXXXXXXXXXXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq12
AGTCGACCTGCAGGCATGCAAGCTTxxx111XX222XXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq34
GTGACACTATAGAATACTCAAGCTTXXX333XXX4444XXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq13
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxXXX333XXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[COLOR="Red"]xxxxx333[/COLOR]
>seq14
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq32
GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXX222XXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq20
xxxxxxxxxxxCTATAGTGTCACCTAAATAGCTTGGXXXXXXX222XXXXXXXXXXXXX
>seq03_head_last
XXXXXXXXXXX[COLOR="Red"]GTGACACTATAGAATACTCAAGCTT[/COLOR]
>seq03_head_last_n_tail
XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTTXXxxxx3xxtailXXXXXXXXX

Code:

read_fasta -i demo_seq.fa | find_adaptor -f AGTCGACCTGCAGGCATGCAAGCTT -r CTATAGTGTCACCTAAATAGCTTGG | find_adaptor -f GTGACACTATAGAATACTCAAGCTT -r GCATGCCTGCAGGTCGACTCTAGAG  | clip_adaptor

The output is:

Code:

SEQ_NAME: seq01
SEQ: xxxxxxx111xxxxxxxxxxxxxxxxxxx
SEQ_LEN: 29
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
---
SEQ_NAME: seq02
SEQ: XXXXX222XXXXXXXXXXXXXXXXXXX
SEQ_LEN: 27
ADAPTOR_POS_RIGHT: 27
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
---
SEQ_NAME: seq03
SEQ: XXX333XXXXXXXXXXXXXXXXXX
SEQ_LEN: 24
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq04
SEQ: XXX4444XXXXXXXXXXXXXXXXXXXXXXXXX
SEQ_LEN: 32
ADAPTOR_POS_RIGHT: 32
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq12
SEQ: xxx111XX222XXXXXXXXXXXXX
SEQ_LEN: 24
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
ADAPTOR_POS_RIGHT: 49
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
---
SEQ_NAME: seq34
SEQ: XXX333XXX4444XXXXXXXXXXXXX
SEQ_LEN: 26
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
ADAPTOR_POS_RIGHT: 51
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq13
[COLOR="Red"]SEQ: xxxxx333[/COLOR]
SEQ_LEN: 8
ADAPTOR_POS_LEFT: 62
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq14
SEQ: xxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXX
SEQ_LEN: 37
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
ADAPTOR_POS_RIGHT: 62
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq32
SEQ: XXX333XXXXXXXXXXXXXXXX222XXXXXXXX
SEQ_LEN: 33
ADAPTOR_POS_RIGHT: 58
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq20
SEQ: xxxxxxxxxx
SEQ_LEN: 10
ADAPTOR_POS_RIGHT: 10
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: xCTATAGTGTCACCTAAATAGCTTGG
---
[COLOR="Red"]SEQ_NAME: seq03_head_last
SEQ: XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[/COLOR]
SEQ_LEN: 36
ADAPTOR_POS_LEFT: 10
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq03_head_last_n_tail
[COLOR="Red"]SEQ: XXxxxx3xxtailXXXXXXXXX[/COLOR]
SEQ_LEN: 22
ADAPTOR_POS_LEFT: 10
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---

You can see that the sequence

Code:

[COLOR="Red"]>seq03_head_last[/COLOR]

should have been clipped off to have empty sequence as the adaptor is at the end. However, this is correct if there is extra sequence attached to the end, cf.

Code:

seq03_head_last_n_tail

Did I miss anything with that? Thanks!

**maasha** · 03-06-2013, 12:50 AM

Thanks yifangt, I will post this to the Biopieces Google Group and answer there.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to trim Vector and Contanmination from Illumian reads?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News