SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
trim adapter from Illumina Genome Analyzer IIe miRNA reads NicoBxl Bioinformatics 5 01-02-2014 06:31 AM
how to trim solid reads length? lei Bioinformatics 7 12-14-2012 08:55 AM
Trim Illumina reads? sapearl Bioinformatics 3 08-10-2011 09:35 AM
Vector contamination? gconcepcion Illumina/Solexa 5 02-08-2011 06:14 AM
efficiently trim solexa reads weizhu Illumina/Solexa 1 01-04-2010 12:22 AM

Reply
 
Thread Tools
Old 05-21-2010, 09:25 AM   #1
wangchy
Junior Member
 
Location: Auburn

Join Date: Mar 2010
Posts: 3
Default How to trim Vector and Contanmination from Illumian reads?

We did a few pooled BAC clone Illumina sequencing, since the BAC has vector and Ecoli Genome contamination, and we need to get rid of these sequences.

We had CLC Bio Genomics Workbecnk, but it didn't work efficiently to remove vector sequences. Is there any other alternative software for the sequence trimming.
wangchy is offline   Reply With Quote
Old 05-21-2010, 10:32 PM   #2
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

You may try the fastx toolkit or play with the good old EMBOSS suite :-)
dawe is offline   Reply With Quote
Old 03-03-2013, 01:20 AM   #3
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default Same question

I have the same question, but seems no direct answer on it I could find so far. FASTX_tools not suitable as fastx_trimmer needs the position of the adaptor, fastx_clipper only clip off the sequence after the adaptor, and not quite sure biopieces did the right thing after several tries. The tricky part is the bi-direction of the insert, so that there are four sets of border sequences as markers to be clipped off. Say:
Code:
5-TGGCCAATTnnnnnnnnnnTGCTAGCACTAG-3
3-ACCGGTTAAnnnnnnnnnnACGATCGTGATC-5
nnnnnn are the insert sequence.
So that
Code:
TGCTAGCTAG--->vector--seq---
AATTGGCCA--->vector--seq---
should be clipped off
and
Code:
--vector--seq---<---TGGCCAATT
--vector--seq---<---CTAGTGCTAGCA
should be clipped off too.

I am not sure all those avaiblable tools take these into consideration. Hope any of the authors could address this question. Thanks in advance!

YT
yifangt is offline   Reply With Quote
Old 03-03-2013, 08:09 AM   #4
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default How to trim Vector and Contamination from Illumina reads?

Hi guys,

If you are working with Illumina data, try trimmomatic,

http://www.usadellab.org/cms/index.php?page=trimmomatic

Best wishes,
Maria
mastal is offline   Reply With Quote
Old 03-03-2013, 12:22 PM   #5
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Did you try aligning to the E.coli and vector sequences, and then filtering the .bam?
swbarnes2 is offline   Reply With Quote
Old 03-03-2013, 04:35 PM   #6
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default

Thanks swbarnes2!
I did align them to the vectors, but my point is NOT to disgard those mapped reads as they are border of sequence of my BAC insert. There seems tools in biopiece, but I have problem with the installation, fastx_tools for sure only treat part of my problem, at least I did not figure out the way to do the job.

mastal, I have looked into your suite, I could not figure out the way to do my job to clip off the border sequences of each read, not based on quality, but on insert border sequences, which vary among reads. Different from adaptor from RNA-seq etc.

Appreciate any experties though. Thanks again!

Last edited by yifangt; 03-03-2013 at 04:39 PM.
yifangt is offline   Reply With Quote
Old 03-04-2013, 01:11 AM   #7
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Biopieces should be able to do this. Why dont you make a couple of small tests to see? You may need to reverse complement sequences or adaptors, but that is what a test will show you. Here is my little test (note that I use x instead of N since N is the IUPAC code for A, T, C or G - which will match anything):

Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG 
SEQ_NAME: test1
SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
SEQ_LEN: 31
---
Note that the reason x is included in the matched pattern is that we default allow 10% mismatches.

Now to get the adaptors trimmed from the second entry you simply need to supply the appropriate adaptors - and run through another round of find_adaptor:

Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC
SEQ_NAME: test1
SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
SEQ_LEN: 31
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: ACCGGTTAA
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xACGATCGTGATC
---
And finally clip_adaptor:

Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC | clip_adaptor
SEQ_NAME: test1
SEQ: xxxxxxxxx
SEQ_LEN: 9
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: TGGCCAATT
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
---
SEQ_NAME: test2
SEQ: xxxxxxxxx
SEQ_LEN: 9
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 9
ADAPTOR_PAT_LEFT: ACCGGTTAA
ADAPTOR_POS_RIGHT: 18
ADAPTOR_LEN_RIGHT: 13
ADAPTOR_PAT_RIGHT: xACGATCGTGATC
---

Last edited by maasha; 03-06-2013 at 12:56 AM.
maasha is offline   Reply With Quote
Old 03-04-2013, 07:16 AM   #8
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default clip off vector border sequence

Thanks Martin!
That's what I was trying. Unfortunately I met problem with your biopieces installation related to Ruby issues. I have not yet sort it out with my Ubuntu system, and I have post it in the google group. Appreciate if you could have a look at it and give some suggestion.
Thanks a lot again!

YT

Last edited by yifangt; 03-04-2013 at 07:20 AM.
yifangt is offline   Reply With Quote
Old 03-04-2013, 12:34 PM   #9
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default

Hi Martin!

An update for removing vector sequences. Two things I realized need pay attension to:
1) the -f -r arguments for the adaptor sequence of the other strand should be the opposite of your last reply as the sequences are reverse complemented. i,e, the second adaptor_find command should be:
Code:
read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -r ACCGGTTAA -f ACGATCGTGATC
2) there seems bugs for the adaptor combination, e.g. seq14 as the combination of seq1 and seq4, for which the adaptors should be trimmed off. They were detected, but not clipped.
if the adaptor sequence was right at the end of the read, see >seq03_head_last.
An example of what I did is:
Code:
>seq01
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxxxxxxxx
>seq02
XXXXX222XXXXXXXXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq03
GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXXXX
>seq04
XXX4444XXXXXXXXXXXXXXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq12
AGTCGACCTGCAGGCATGCAAGCTTxxx111XX222XXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq34
GTGACACTATAGAATACTCAAGCTTXXX333XXX4444XXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq13
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxXXX333XXXXXXXXXXGTGACACTATAGAATACTCAAGCTTxxxxx333
>seq14
AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
>seq32
GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXX222XXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
>seq20
xxxxxxxxxxxCTATAGTGTCACCTAAATAGCTTGGXXXXXXX222XXXXXXXXXXXXX
>seq03_head_last
XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT
>seq03_head_last_n_tail
XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTTXXxxxx3xxtailXXXXXXXXX
Code:
read_fasta -i demo_seq.fa | find_adaptor -f AGTCGACCTGCAGGCATGCAAGCTT -r CTATAGTGTCACCTAAATAGCTTGG | find_adaptor -f GTGACACTATAGAATACTCAAGCTT -r GCATGCCTGCAGGTCGACTCTAGAG  | clip_adaptor
The output is:
Code:
SEQ_NAME: seq01
SEQ: xxxxxxx111xxxxxxxxxxxxxxxxxxx
SEQ_LEN: 29
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
---
SEQ_NAME: seq02
SEQ: XXXXX222XXXXXXXXXXXXXXXXXXX
SEQ_LEN: 27
ADAPTOR_POS_RIGHT: 27
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
---
SEQ_NAME: seq03
SEQ: XXX333XXXXXXXXXXXXXXXXXX
SEQ_LEN: 24
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq04
SEQ: XXX4444XXXXXXXXXXXXXXXXXXXXXXXXX
SEQ_LEN: 32
ADAPTOR_POS_RIGHT: 32
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq12
SEQ: xxx111XX222XXXXXXXXXXXXX
SEQ_LEN: 24
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
ADAPTOR_POS_RIGHT: 49
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
---
SEQ_NAME: seq34
SEQ: XXX333XXX4444XXXXXXXXXXXXX
SEQ_LEN: 26
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
ADAPTOR_POS_RIGHT: 51
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq13
SEQ: xxxxx333
SEQ_LEN: 8
ADAPTOR_POS_LEFT: 62
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq14
SEQ: xxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXX
SEQ_LEN: 37
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
ADAPTOR_POS_RIGHT: 62
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
---
SEQ_NAME: seq32
SEQ: XXX333XXXXXXXXXXXXXXXX222XXXXXXXX
SEQ_LEN: 33
ADAPTOR_POS_RIGHT: 58
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
ADAPTOR_POS_LEFT: 0
ADAPTOR_LEN_LEFT: 25
ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq20
SEQ: xxxxxxxxxx
SEQ_LEN: 10
ADAPTOR_POS_RIGHT: 10
ADAPTOR_LEN_RIGHT: 26
ADAPTOR_PAT_RIGHT: xCTATAGTGTCACCTAAATAGCTTGG
---
SEQ_NAME: seq03_head_last
SEQ: XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT
SEQ_LEN: 36
ADAPTOR_POS_LEFT: 10
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---
SEQ_NAME: seq03_head_last_n_tail
SEQ: XXxxxx3xxtailXXXXXXXXX
SEQ_LEN: 22
ADAPTOR_POS_LEFT: 10
ADAPTOR_LEN_LEFT: 26
ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
---
You can see that the sequence
Code:
>seq03_head_last
should have been clipped off to have empty sequence as the adaptor is at the end. However, this is correct if there is extra sequence attached to the end, cf.
Code:
seq03_head_last_n_tail
Did I miss anything with that? Thanks!

Last edited by yifangt; 03-04-2013 at 12:41 PM.
yifangt is offline   Reply With Quote
Old 03-06-2013, 12:50 AM   #10
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Thanks yifangt, I will post this to the Biopieces Google Group and answer there.
maasha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO