Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to trim Vector and Contanmination from Illumian reads?

    We did a few pooled BAC clone Illumina sequencing, since the BAC has vector and Ecoli Genome contamination, and we need to get rid of these sequences.

    We had CLC Bio Genomics Workbecnk, but it didn't work efficiently to remove vector sequences. Is there any other alternative software for the sequence trimming.

  • #2
    You may try the fastx toolkit or play with the good old EMBOSS suite :-)

    Comment


    • #3
      Same question

      I have the same question, but seems no direct answer on it I could find so far. FASTX_tools not suitable as fastx_trimmer needs the position of the adaptor, fastx_clipper only clip off the sequence after the adaptor, and not quite sure biopieces did the right thing after several tries. The tricky part is the bi-direction of the insert, so that there are four sets of border sequences as markers to be clipped off. Say:
      Code:
      5-TGGCCAATTnnnnnnnnnnTGCTAGCACTAG-3
      3-ACCGGTTAAnnnnnnnnnnACGATCGTGATC-5
      nnnnnn are the insert sequence.
      So that
      Code:
      TGCTAGCTAG--->vector--seq---
      AATTGGCCA--->vector--seq---
      should be clipped off
      and
      Code:
      --vector--seq---<---TGGCCAATT
      --vector--seq---<---CTAGTGCTAGCA
      should be clipped off too.

      I am not sure all those avaiblable tools take these into consideration. Hope any of the authors could address this question. Thanks in advance!

      YT

      Comment


      • #4
        How to trim Vector and Contamination from Illumina reads?

        Hi guys,

        If you are working with Illumina data, try trimmomatic,



        Best wishes,
        Maria

        Comment


        • #5
          Did you try aligning to the E.coli and vector sequences, and then filtering the .bam?

          Comment


          • #6
            Thanks swbarnes2!
            I did align them to the vectors, but my point is NOT to disgard those mapped reads as they are border of sequence of my BAC insert. There seems tools in biopiece, but I have problem with the installation, fastx_tools for sure only treat part of my problem, at least I did not figure out the way to do the job.

            mastal, I have looked into your suite, I could not figure out the way to do my job to clip off the border sequences of each read, not based on quality, but on insert border sequences, which vary among reads. Different from adaptor from RNA-seq etc.

            Appreciate any experties though. Thanks again!
            Last edited by yifangt; 03-03-2013, 04:39 PM.

            Comment


            • #7
              Biopieces should be able to do this. Why dont you make a couple of small tests to see? You may need to reverse complement sequences or adaptors, but that is what a test will show you. Here is my little test (note that I use x instead of N since N is the IUPAC code for A, T, C or G - which will match anything):

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG 
              SEQ_NAME: test1
              SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
              SEQ_LEN: 31
              ---
              Note that the reason x is included in the matched pattern is that we default allow 10% mismatches.

              Now to get the adaptors trimmed from the second entry you simply need to supply the appropriate adaptors - and run through another round of find_adaptor:

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC
              SEQ_NAME: test1
              SEQ: TGGCCAATTxxxxxxxxxxTGCTAGCACTAG
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: ACCGGTTAAxxxxxxxxxxACGATCGTGATC
              SEQ_LEN: 31
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: ACCGGTTAA
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xACGATCGTGATC
              ---
              And finally clip_adaptor:

              Code:
              maasha@mel:~$ read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor -f ACCGGTTAA -r ACGATCGTGATC | clip_adaptor
              SEQ_NAME: test1
              SEQ: xxxxxxxxx
              SEQ_LEN: 9
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: TGGCCAATT
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xTGCTAGCACTAG
              ---
              SEQ_NAME: test2
              SEQ: xxxxxxxxx
              SEQ_LEN: 9
              ADAPTOR_POS_LEFT: 0
              ADAPTOR_LEN_LEFT: 9
              ADAPTOR_PAT_LEFT: ACCGGTTAA
              ADAPTOR_POS_RIGHT: 18
              ADAPTOR_LEN_RIGHT: 13
              ADAPTOR_PAT_RIGHT: xACGATCGTGATC
              ---
              Last edited by maasha; 03-06-2013, 12:56 AM.

              Comment


              • #8
                clip off vector border sequence

                Thanks Martin!
                That's what I was trying. Unfortunately I met problem with your biopieces installation related to Ruby issues. I have not yet sort it out with my Ubuntu system, and I have post it in the google group. Appreciate if you could have a look at it and give some suggestion.
                Thanks a lot again!

                YT
                Last edited by yifangt; 03-04-2013, 07:20 AM.

                Comment


                • #9
                  Hi Martin!

                  An update for removing vector sequences. Two things I realized need pay attension to:
                  1) the -f -r arguments for the adaptor sequence of the other strand should be the opposite of your last reply as the sequences are reverse complemented. i,e, the second adaptor_find command should be:
                  Code:
                  read_fasta -i test.fna | find_adaptor -f TGGCCAATT -r TGCTAGCACTAG | find_adaptor [COLOR="Red"]-r[/COLOR] ACCGGTTAA [COLOR="Red"]-f [/COLOR]ACGATCGTGATC
                  2) there seems bugs for the adaptor combination, e.g. seq14 as the combination of seq1 and seq4, for which the adaptors should be trimmed off. They were detected, but not clipped.
                  if the adaptor sequence was right at the end of the read, see >seq03_head_last.
                  An example of what I did is:
                  Code:
                  >seq01
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxxxxxxxx
                  >seq02
                  XXXXX222XXXXXXXXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq03
                  GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXXXX
                  >seq04
                  XXX4444XXXXXXXXXXXXXXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq12
                  AGTCGACCTGCAGGCATGCAAGCTTxxx111XX222XXXXXXXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq34
                  GTGACACTATAGAATACTCAAGCTTXXX333XXX4444XXXXXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq13
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxxxXXX333XXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[COLOR="Red"]xxxxx333[/COLOR]
                  >seq14
                  AGTCGACCTGCAGGCATGCAAGCTTxxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXXXGCATGCCTGCAGGTCGACTCTAGAG 
                  >seq32
                  GTGACACTATAGAATACTCAAGCTTXXX333XXXXXXXXXXXXXXXX222XXXXXXXXXCTATAGTGTCACCTAAATAGCTTGG
                  >seq20
                  xxxxxxxxxxxCTATAGTGTCACCTAAATAGCTTGGXXXXXXX222XXXXXXXXXXXXX
                  >seq03_head_last
                  XXXXXXXXXXX[COLOR="Red"]GTGACACTATAGAATACTCAAGCTT[/COLOR]
                  >seq03_head_last_n_tail
                  XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTTXXxxxx3xxtailXXXXXXXXX
                  Code:
                  read_fasta -i demo_seq.fa | find_adaptor -f AGTCGACCTGCAGGCATGCAAGCTT -r CTATAGTGTCACCTAAATAGCTTGG | find_adaptor -f GTGACACTATAGAATACTCAAGCTT -r GCATGCCTGCAGGTCGACTCTAGAG  | clip_adaptor
                  The output is:
                  Code:
                  SEQ_NAME: seq01
                  SEQ: xxxxxxx111xxxxxxxxxxxxxxxxxxx
                  SEQ_LEN: 29
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ---
                  SEQ_NAME: seq02
                  SEQ: XXXXX222XXXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 27
                  ADAPTOR_POS_RIGHT: 27
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  SEQ_NAME: seq03
                  SEQ: XXX333XXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 24
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq04
                  SEQ: XXX4444XXXXXXXXXXXXXXXXXXXXXXXXX
                  SEQ_LEN: 32
                  ADAPTOR_POS_RIGHT: 32
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq12
                  SEQ: xxx111XX222XXXXXXXXXXXXX
                  SEQ_LEN: 24
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ADAPTOR_POS_RIGHT: 49
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  SEQ_NAME: seq34
                  SEQ: XXX333XXX4444XXXXXXXXXXXXX
                  SEQ_LEN: 26
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ADAPTOR_POS_RIGHT: 51
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq13
                  [COLOR="Red"]SEQ: xxxxx333[/COLOR]
                  SEQ_LEN: 8
                  ADAPTOR_POS_LEFT: 62
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq14
                  SEQ: xxxxxxx111xxxxxxxxxxXXX4444XXXXXXXXXX
                  SEQ_LEN: 37
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: AGTCGACCTGCAGGCATGCAAGCTT
                  ADAPTOR_POS_RIGHT: 62
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XGCATGCCTGCAGGTCGACTCTAGAG
                  ---
                  SEQ_NAME: seq32
                  SEQ: XXX333XXXXXXXXXXXXXXXX222XXXXXXXX
                  SEQ_LEN: 33
                  ADAPTOR_POS_RIGHT: 58
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: XCTATAGTGTCACCTAAATAGCTTGG
                  ADAPTOR_POS_LEFT: 0
                  ADAPTOR_LEN_LEFT: 25
                  ADAPTOR_PAT_LEFT: GTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq20
                  SEQ: xxxxxxxxxx
                  SEQ_LEN: 10
                  ADAPTOR_POS_RIGHT: 10
                  ADAPTOR_LEN_RIGHT: 26
                  ADAPTOR_PAT_RIGHT: xCTATAGTGTCACCTAAATAGCTTGG
                  ---
                  [COLOR="Red"]SEQ_NAME: seq03_head_last
                  SEQ: XXXXXXXXXXXGTGACACTATAGAATACTCAAGCTT[/COLOR]
                  SEQ_LEN: 36
                  ADAPTOR_POS_LEFT: 10
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  SEQ_NAME: seq03_head_last_n_tail
                  [COLOR="Red"]SEQ: XXxxxx3xxtailXXXXXXXXX[/COLOR]
                  SEQ_LEN: 22
                  ADAPTOR_POS_LEFT: 10
                  ADAPTOR_LEN_LEFT: 26
                  ADAPTOR_PAT_LEFT: XGTGACACTATAGAATACTCAAGCTT
                  ---
                  You can see that the sequence
                  Code:
                  [COLOR="Red"]>seq03_head_last[/COLOR]
                  should have been clipped off to have empty sequence as the adaptor is at the end. However, this is correct if there is extra sequence attached to the end, cf.
                  Code:
                  seq03_head_last_n_tail
                  Did I miss anything with that? Thanks!
                  Last edited by yifangt; 03-04-2013, 12:41 PM.

                  Comment


                  • #10
                    Thanks yifangt, I will post this to the Biopieces Google Group and answer there.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X