Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trim barcode off

    Hi,
    I have a sample ID file based on barcode as below and I would like to trim barcode off from this file. The sequences are from Illumina using ddRADseq method.

    @FCC1LPDACXX:1:1101:1478:2239#GTNNNNTT/1
    TGACGCCATGCAGGCGATGAATGTGGAATATGATGAATCTTTCCTGGAGTGGCTTGAAATAATATTGCAGAATGCCTCTGAATACTGGCCTGCTCTTATTCATACGCGCGGTTTTTCCCGTACAACCCTATGGCAGTGCAACCAGCAGTGCAATCATGTCATTAGCTCATCAGTTTAGAATAGATGTCCAAAAAGGATAT
    +
    bbbeeeeegggegiiifiihiiiggihdffhidfihhh[cgffhfghfheghhhhYG__\bdd_\db`ggd_^_VZ]_bYZ``]_Z`caaY[TY]KYTZ`^a_eeeeegOO[bfhhhgihefhiighhihihiiihgihiiiiiggfgeeeeeeddddcdddddcccccccccccddccbbcbcdcc`bcccaccccbcb


    Can anyone suggest me how can I trimmed barcode off? Thanks

  • #3
    It seems, you used an external barcode; therefore it is not part of your sequence but part of the ID (GTNNNNTT).
    AFAIK you don't run into any trouble, while having the barcode in the read ID.
    If you still want to get rid of it, use awk to trim every fourth line.

    Comment


    • #4
      Yes, Michael.Ante is right. My answer is wrong.
      Trimmomatic is generally used to remove the adapter sequences within the read sequence.
      In this case, the barcode was sequenced separately, and appears in the ID.
      I can't think of a good reason to want to remove it from the ID, but awk could be used to remove it, as Michael.Ante suggested.
      Last edited by blancha; 04-30-2014, 09:00 AM.

      Comment


      • #5
        I hope that barcode (GTNNNNTT) represents some form of masking because if it truly has 4 N's then the sequence must look pretty ugly.

        Comment


        • #6
          I was thinking the same thing.
          It is very odd. Half the bases in the barcode are Ns, yet there are no Ns in the sequence read below.
          It could be a form of masking, as you said, but I don't know what would be the point of the masking.

          Comment


          • #7
            I assume 'N' in the bar code indicates a wildcard; in other words, all barcodes that start with GT and end with TT would be grouped together.

            Comment


            • #8
              Originally posted by Michael.Ante View Post
              It seems, you used an external barcode; therefore it is not part of your sequence but part of the ID (GTNNNNTT).
              AFAIK you don't run into any trouble, while having the barcode in the read ID.
              If you still want to get rid of it, use awk to trim every fourth line.
              Thanks Michael.Ante. But, how do I know that the barcode is not present in the sequence?

              Comment


              • #9
                Originally posted by shis View Post
                Thanks Michael.Ante. But, how do I know that the barcode is not present in the sequence?
                I suppose you are referring to adapters (and not barcodes)? In illumina technology, barcode/tag reads are read independently and are never part of the actual sequence read.

                Comment


                • #10
                  If it's RAD-Seq data, it could well have both adapters and barcodes in the reads.

                  Has the data already been demultiplexed to separate the reads into different files by barcode?

                  A barcode with several Ns in it suggests that the Illumina index read did not go very well, and you can't really assign that read to a particular barcode.

                  I'm not sure about ddRAD-Seq, but in RAD-Seq data you expect to see an MID (multiplex identifier) and the restriction enzyme site at the start of the read.

                  At the 5' end of the read, bases 9-14, 'TGCAGG', could be the restriction site for Sbf1, one of the enzymes often used in RAD-Seq.
                  Last edited by mastal; 04-30-2014, 12:22 PM.

                  Comment


                  • #11
                    Originally posted by mastal View Post
                    If it's RAD-Seq data, it could well have both adapters and barcodes in the reads.

                    Has the data already been demultiplexed to separate the reads into different files by barcode?

                    A barcode with several Ns in it suggests that the Illumina index read did not go very well, and you can't really assign that read to a particular barcode.

                    I'm not sure about ddRAD-Seq, but in RAD-Seq data you expect to see an MID (multiplex identifier) and the restriction enzyme site at the start of the read.

                    At the 5' end of the read, bases 9-14, 'TGCAGG', could be the restriction site for Sbf1, one of the enzymes often used in RAD-Seq.
                    Yes, the data has already been demultiplexed into samples ID files based on barcode.

                    Comment


                    • #12
                      Yes, the data has already been demultiplexed into samples ID files based on barcode.
                      In this case, the barcode is almost never appearing in the read-sequence.

                      Thanks Michael.Ante. But, how do I know that the barcode is not present in the sequence?
                      Just make an FastQC report from the demultiplexed libraries. You can check there the "per base sequence content". If you still have a barcode present (e.g. GTNNNNTT),you would observe this sequence at the reads' start:
                      Pos 1 a 'G', pos2, pos 7 & pos 8 a 'T'.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Today, 08:47 AM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      59 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      54 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X