![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
trim adapter from Illumina Genome Analyzer IIe miRNA reads | NicoBxl | Bioinformatics | 5 | 01-02-2014 06:31 AM |
how to trim solid reads length? | lei | Bioinformatics | 7 | 12-14-2012 08:55 AM |
Trim Illumina reads? | sapearl | Bioinformatics | 3 | 08-10-2011 09:35 AM |
Vector contamination? | gconcepcion | Illumina/Solexa | 5 | 02-08-2011 06:14 AM |
efficiently trim solexa reads | weizhu | Illumina/Solexa | 1 | 01-04-2010 12:22 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Auburn Join Date: Mar 2010
Posts: 3
|
![]()
We did a few pooled BAC clone Illumina sequencing, since the BAC has vector and Ecoli Genome contamination, and we need to get rid of these sequences.
We had CLC Bio Genomics Workbecnk, but it didn't work efficiently to remove vector sequences. Is there any other alternative software for the sequence trimming. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: 45°30'25.22"N / 9°15'53.00"E Join Date: Apr 2009
Posts: 258
|
![]()
You may try the fastx toolkit or play with the good old EMBOSS suite :-)
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Canada Join Date: Feb 2011
Posts: 61
|
![]()
I have the same question, but seems no direct answer on it I could find so far. FASTX_tools not suitable as fastx_trimmer needs the position of the adaptor, fastx_clipper only clip off the sequence after the adaptor, and not quite sure biopieces did the right thing after several tries. The tricky part is the bi-direction of the insert, so that there are four sets of border sequences as markers to be clipped off. Say:
Code:
5-TGGCCAATTnnnnnnnnnnTGCTAGCACTAG-3 3-ACCGGTTAAnnnnnnnnnnACGATCGTGATC-5 So that Code:
TGCTAGCTAG--->vector--seq--- AATTGGCCA--->vector--seq--- and Code:
--vector--seq---<---TGGCCAATT --vector--seq---<---CTAGTGCTAGCA I am not sure all those avaiblable tools take these into consideration. Hope any of the authors could address this question. Thanks in advance! YT |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: uk Join Date: Mar 2009
Posts: 667
|
![]()
Hi guys,
If you are working with Illumina data, try trimmomatic, http://www.usadellab.org/cms/index.php?page=trimmomatic Best wishes, Maria |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]()
Did you try aligning to the E.coli and vector sequences, and then filtering the .bam?
|
![]() |
![]() |
![]() |
#6 |
Member
Location: Canada Join Date: Feb 2011
Posts: 61
|
![]()
Thanks swbarnes2!
I did align them to the vectors, but my point is NOT to disgard those mapped reads as they are border of sequence of my BAC insert. There seems tools in biopiece, but I have problem with the installation, fastx_tools for sure only treat part of my problem, at least I did not figure out the way to do the job. mastal, I have looked into your suite, I could not figure out the way to do my job to clip off the border sequences of each read, not based on quality, but on insert border sequences, which vary among reads. Different from adaptor from RNA-seq etc. Appreciate any experties though. Thanks again! Last edited by yifangt; 03-03-2013 at 04:39 PM. |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
Biopieces should be able to do this. Why dont you make a couple of small tests to see? You may need to reverse complement sequences or adaptors, but that is what a test will show you. Here is my little test (note that I use x instead of N since N is the IUPAC code for A, T, C or G - which will match anything):
Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG SEQ_NAME: test1 SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG SEQ_LEN: 31 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 9 ADAPTOR_PAT_LEFT: TGGCCAATT ADAPTOR_POS_RIGHT: 18 ADAPTOR_LEN_RIGHT: 13 ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG --- SEQ_NAME: test2 SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC SEQ_LEN: 31 --- Now to get the adaptors trimmed from the second entry you simply need to supply the appropriate adaptors - and run through another round of find_adaptor: Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC SEQ_NAME: test1 SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG SEQ_LEN: 31 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 9 ADAPTOR_PAT_LEFT: TGGCCAATT ADAPTOR_POS_RIGHT: 18 ADAPTOR_LEN_RIGHT: 13 ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG --- SEQ_NAME: test2 SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC SEQ_LEN: 31 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 9 ADAPTOR_PAT_LEFT: ACCGGTTAA ADAPTOR_POS_RIGHT: 18 ADAPTOR_LEN_RIGHT: 13 ADAPTOR_PAT_RIGHT: xACGATCGTGATC --- Code:
maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC | clip_adaptor SEQ_NAME: test1 SEQ: xxxxxxxxx SEQ_LEN: 9 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 9 ADAPTOR_PAT_LEFT: TGGCCAATT ADAPTOR_POS_RIGHT: 18 ADAPTOR_LEN_RIGHT: 13 ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG --- SEQ_NAME: test2 SEQ: xxxxxxxxx SEQ_LEN: 9 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 9 ADAPTOR_PAT_LEFT: ACCGGTTAA ADAPTOR_POS_RIGHT: 18 ADAPTOR_LEN_RIGHT: 13 ADAPTOR_PAT_RIGHT: xACGATCGTGATC --- Last edited by maasha; 03-06-2013 at 12:56 AM. |
![]() |
![]() |
![]() |
#8 |
Member
Location: Canada Join Date: Feb 2011
Posts: 61
|
![]()
Thanks Martin!
That's what I was trying. Unfortunately I met problem with your biopieces installation related to Ruby issues. I have not yet sort it out with my Ubuntu system, and I have post it in the google group. Appreciate if you could have a look at it and give some suggestion. Thanks a lot again! YT Last edited by yifangt; 03-04-2013 at 07:20 AM. |
![]() |
![]() |
![]() |
#9 |
Member
Location: Canada Join Date: Feb 2011
Posts: 61
|
![]()
Hi Martin!
An update for removing vector sequences. Two things I realized need pay attension to: 1) the -f -r arguments for the adaptor sequence of the other strand should be the opposite of your last reply as the sequences are reverse complemented. i,e, the second adaptor_find command should be: Code:
read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -r ACCGGTTAA -f ACGATCGTGATC if the adaptor sequence was right at the end of the read, see >seq03_head_last. An example of what I did is: Code:
>seq01 AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxxxxxxxx >seq02 XXXXX222XXXXXXXXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG >seq03 GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXXXX >seq04 XXX4444XXXXXXXXXXXXXXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG >seq12 AGTCGACCTGCAGGCATGCAAGCTTxxx111XX222XXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG >seq34 GTGACACTATAGAATACTCAAGCTTXXX333XXX4444XXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG >seq13 AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxXXX333XXXXXXXXXXGTGACACTATAGAATACTCAAGCTTxxxxx333 >seq14 AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG >seq32 GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXX222XXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG >seq20 xxxxxxxxxxxCTATAGTGTCACCTAAATAGCTTGGXXXXXXX222XXXXXXXXXXXXX >seq03_head_last XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT >seq03_head_last_n_tail XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTTXXxxxx3xxtailXXXXXXXXX Code:
read_fasta -i demo_seq.fa | find_adaptor -f AGTCGACCTGCAGGCATGCAAGCTT -r CTATAGTGTCACCTAAATAGCTTGG | find_adaptor -f GTGACACTATAGAATACTCAAGCTT -r GCATGCCTGCAGGTCGACTCTAGAG | clip_adaptor Code:
SEQ_NAME: seq01 SEQ: xxxxxxx111xxxxxxxxxxxxxxxxxxx SEQ_LEN: 29 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT --- SEQ_NAME: seq02 SEQ: XXXXX222XXXXXXXXXXXXXXXXXXX SEQ_LEN: 27 ADAPTOR_POS_RIGHT: 27 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG --- SEQ_NAME: seq03 SEQ: XXX333XXXXXXXXXXXXXXXXXX SEQ_LEN: 24 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT --- SEQ_NAME: seq04 SEQ: XXX4444XXXXXXXXXXXXXXXXXXXXXXXXX SEQ_LEN: 32 ADAPTOR_POS_RIGHT: 32 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG --- SEQ_NAME: seq12 SEQ: xxx111XX222XXXXXXXXXXXXX SEQ_LEN: 24 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT ADAPTOR_POS_RIGHT: 49 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG --- SEQ_NAME: seq34 SEQ: XXX333XXX4444XXXXXXXXXXXXX SEQ_LEN: 26 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT ADAPTOR_POS_RIGHT: 51 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG --- SEQ_NAME: seq13 SEQ: xxxxx333 SEQ_LEN: 8 ADAPTOR_POS_LEFT: 62 ADAPTOR_LEN_LEFT: 26 ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT --- SEQ_NAME: seq14 SEQ: xxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXX SEQ_LEN: 37 ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT ADAPTOR_POS_RIGHT: 62 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG --- SEQ_NAME: seq32 SEQ: XXX333XXXXXXXXXXXXXXXX222XXXXXXXX SEQ_LEN: 33 ADAPTOR_POS_RIGHT: 58 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG ADAPTOR_POS_LEFT: 0 ADAPTOR_LEN_LEFT: 25 ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT --- SEQ_NAME: seq20 SEQ: xxxxxxxxxx SEQ_LEN: 10 ADAPTOR_POS_RIGHT: 10 ADAPTOR_LEN_RIGHT: 26 ADAPTOR_PAT_RIGHT: xCTATAGTGTCACCTAAATAGCTTGG --- SEQ_NAME: seq03_head_last SEQ: XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT SEQ_LEN: 36 ADAPTOR_POS_LEFT: 10 ADAPTOR_LEN_LEFT: 26 ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT --- SEQ_NAME: seq03_head_last_n_tail SEQ: XXxxxx3xxtailXXXXXXXXX SEQ_LEN: 22 ADAPTOR_POS_LEFT: 10 ADAPTOR_LEN_LEFT: 26 ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT --- Code:
>seq03_head_last Code:
seq03_head_last_n_tail Last edited by yifangt; 03-04-2013 at 12:41 PM. |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Denmark Join Date: Apr 2009
Posts: 153
|
![]()
Thanks yifangt, I will post this to the Biopieces Google Group and answer there.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|