Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • HeidiLee
    Member
    • Jul 2011
    • 20

    Split SFF file by Adaptors

    Hi All,

    I was assigned a work to split a SFF file into a number of adapter specific SFF files.
    If I read the SFF file into R as a SFFContainer, the reads slot looks like:

    ----------------------------------------------------------------
    A QualityScaledDNAStringSet instance containing:

    A DNAStringSet instance of length 377894
    width seq names
    [1] 211 TCAGAAGAGGATTCGATCTCG...GCCAAGCACACAGGGATAGG G2FU2:4:10
    [2] 80 TCAGAAGAGGATTCGATTATA...TTCTCTCTCACAAGTTACAC G2FU2:4:47
    [3] 46 TCAGAAGAGGATTCGTCTGCT...GTTGTCTTCTCTAAAATGCT G2FU2:4:49
    [4] 180 TCAGTAAGGAGAACGATAGGC...GCCAAGGCAGACAGGGATAG G2FU2:5:15
    [5] 133 TCAGCTAAGGTAACGATCTGA...TGTGTACATATCATGAGAGT G2FU2:5:16
    [6] 65 TCAGCTAAGGTAACGATATTT...GTCATTCAAATGTCAAGTGA G2FU2:5:48
    [7] 72 TCAGCTAAGGTAACGATGATC...TTAAGAAGTAAAATATAATA G2FU2:7:47
    [8] 36 TCAGTAAGGAGAACGATTAGGTAACTTAATAAAAAT G2FU2:8:47
    [9] 50 TCAGCAAGGTAACGTTGATAT...ACTGAGATACTTATCTTATT G2FU2:8:49
    ... ... ...
    [377886] 296 TCAGTAAGGAGAACGATCTTT...GCACAGACGGGAAGGTAGAG G2FU2:1146:1271
    [377887] 292 TCAGTAAGGAGAACGATGACT...CAGCAGCACAGAGGCGAGAG G2FU2:1146:1272
    [377888] 191 TCAGTAAGGAGAACGATACTC...CAAGGCACACAGGGGATAGG G2FU2:1147:1252
    [377889] 287 TCAGCAGAAGGAACGATGATC...AGAGCGAGCAAGCAGACAGG G2FU2:1147:1254
    [377890] 292 TCAGTAAGGAGAACGATATCG...CTACTCGAGGAGACAGGTAG G2FU2:1147:1258
    [377891] 281 TCAGCAGAAGGAACGATCGTC...GCGAAGGCAGCACAGGAGTA G2FU2:1147:1262
    [377892] 274 TCAGCTAAGGTAACGATCAAA...CCGATGCCCATAGAGTGCAG G2FU2:1147:1269
    [377893] 283 TCAGCTAAGGTAACGATGACT...CAAGGCACACAGGGAGTAGG G2FU2:1147:1271
    [377894] 301 TCAGCTAAGGTAACGATATTC...AGACACGGAGGTAGAGTGTA G2FU2:1147:1274

    A PhredQuality instance of length 377894
    width seq names
    [1] 211 AAAAA:>;>382(16549@00...+4.4&+++*11,,0%**33. G2FU2:4:10
    [2] 80 @7==B=@@@>?>37<7714:8...-(*-***-**--(*-(*--/ G2FU2:4:47
    [3] 46 A3225.000/13-21/00---...**&-**-&--***--&**-1 G2FU2:4:49
    [4] 180 BBCCCC>B>BCC>BBBBBC>C....-&++/+0...1235,33// G2FU2:5:15
    [5] 133 >3300,0+1(--(01110000...*********-*1**-**-*- G2FU2:5:16
    [6] 65 >59::585:28<2;9456:<....-*-**(*--%***-.(*-*2 G2FU2:5:48
    [7] 72 ;222313/3-00(01/0*--*...*%-(*-(***--%--*-(-- G2FU2:7:47
    [8] 36 @7==>A:>9>>>7<757.21,0/-//%-(+/224)2 G2FU2:8:47
    [9] 50 ;000-*&-&--(,-*&**--0...0***-----*-&-*-*&**. G2FU2:8:49
    ... ... ...
    [377886] 296 B@@>::/929552<@::188)...+1..,,,**-%++/+***** G2FU2:1146:1271
    [377887] 292 EEEEDD?C?CDD>CCCCCCCC...***,/1****,,.&****** G2FU2:1146:1272
    [377888] 191 DDDCBC>C>[email protected];2:28?A::;BEE.?;84, G2FU2:1147:1252
    [377889] 287 @668@CCC9C?C?DDDECCCE...0//,0****++,,*****1- G2FU2:1147:1254
    [377890] 292 DDDEDD@D>[email protected],,,012,1,,,,4&+++ G2FU2:1147:1258
    [377891] 281 @?AAA@@@:A:><>@@>>>;:...3++.&++/41++++3.1/++ G2FU2:1147:1262
    [377892] 274 CCCCCCC?C:>>7A;=@<<;,...8&++7758,+**+++****1 G2FU2:1147:1269
    [377893] 283 AAAAA>A9A57<14;66.24=...-,5;46,,,+,8;/+++34, G2FU2:1147:1271
    [377894] 301 @@@>=9<6<*+02657631+2...*+*++11,+*%*.******1 G2FU2:1147:1274
    ------------------------------------------------------------------

    And the adapters are like (total 96 adaptors):
    AdaptName AdaptSeq
    1 IonXpress_001 CTAAGGTAAC
    2 IonXpress_002 TAAGGAGAAC
    3 IonXpress_003 AAGAGGATTC
    4 IonXpress_004 TACCAAGATC
    5 IonXpress_005 CAGAAGGAAC
    6 IonXpress_006 CTGCAAGTTC
    7 IonXpress_007 TTCGTGATTC
    8 IonXpress_008 TTCCGATAAC
    9 IonXpress_009 TGAGCGGAAC
    10 IonXpress_010 CTGACCGAAC
    ............................

    Could any one please tell me if you have an idea about the meaning of "adapter specific SFF files"?
    In order to classify each read by the adapters, should I align all adapters on each sequence, some thing similar to the following?


    TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG - sequence
    TCGTATGCCG (scan all positions until the end) - (m=2, i=1, d=1)
    TCGTATGCC - (m=2, i=1, d=0)
    TCGTATGC - (m=2, i=1, d=0)
    TCGTATG - (m=1, i=1, d=0)
    TCGTAT - (m=1, i=1, d=0)
    TCGTA - (m=1, i=1, d=0)
    TCGT - (m=1, i=0, d=0)
    TCG - (m=1, i=0, d=0) -> match!

    Or should I find a specific adapter for each read by the functions on the page 15


    Should I trim down each sequence from clipAdapterLeft position to clipAdapterRight position before any alignment or any other work?

    Thank you very much in advance.

    Best,

    Heidi
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    Heidi,

    Looking at your read data you can see that all of the reads start with the 4 bases "TCAG" which is known as the 'keytag'; this tells the software that the read was a library fragment (as opposed to a control fragment which are not in your data set). Following the keytag is the Multiplex ID (MID) sequence which corresponds to one of your 96 adapter sequences. "Adapter specific SFF files" means to parse the reads in your input file, identify their MID tag and sort them into new output SFF files according to their MID.

    There are a handful of tools available to splitt SFF files by barcode but I recommend getting the Roche/454 software. It is available for free but you have to submit a request through their website. Specifically the program you want to use is called 'sfffile'. This tool can (among other things) read an SFF file and a MID configuration file and output a set of MID (adapter) specific SFF files. Judging by the names of your MID tags (IonExpress_nnn) it would appear that this data was generated on a Life Technologies Ion Torrent instrument, not a Roche/454 but no matter, the SFF format is the same and Roche's software should be able to split the reads. However since the MID tag set is not the default Roche/454 tag set you will need to create a custom MIDConfig.parse file for use with the sfffile program. There are instructions for doing this in the documentation which accompanies the software and you can use the default MIDConfig.parse file as a template.

    Good luck.

    Comment

    • HeidiLee
      Member
      • Jul 2011
      • 20

      #3
      Thanks KMCarr for your explanation. I have a better understanding for SFF file now.
      I am more familiar with R Bioconductor. So I am trying with a Bioconductor package.
      I was able to read in the big SFF file as a big SFFContainer with the function readSFF in the Bioconductor package:R453Plus1Toolbox. I also fond the indexes to split the big SFFContainer. For example, the index (5,11,17,22,25,29,31,33,37) for the first small SFFContainer.
      I couldn't figure out how to extract the (5,11,17,22,25,29,31,33,37)th reads from the big SSFContainer and construct a small SFFContainer and save it as an SFF file. Could anyone please help me with this part of work?
      Thank you very much in advance.

      Heidi

      Comment

      • HeidiLee
        Member
        • Jul 2011
        • 20

        #4
        Dear kmcarr,

        I think I didn't pay much attention to your post last time is because it seemed like I didn't need to write any programs. The project I am working in is kind of exam to see my programming skills.
        I used R to identify the reads which can be classified to a specific adapter. So for each adapter, I have a list of read names. I have total 377894 reads, but for most adapters, there are only tens of reads, some times even less.
        Is this usual case, or do you think I probably made a mistake?

        Thank you very much.

        Heidi

        Comment

        Latest Articles

        Collapse

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        32 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        38 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        43 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        64 views
        0 reactions
        Last Post SEQadmin2  
        Working...