Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Retrieve MiSeq Data still containing index primers etc

    Hi all

    We have a pipeline that we have developed that currently works on both 454 and Ion Torrent data.

    The pipeline is always run on multiplexed data and the sequence input is currently a fastq file that contains all of the information to undertake the demultiplexing of the data and all subsequent analysis is run on all the data from each MID separately.

    Collaborators have now generated similar data using an Illumina MiSeq however when they sent us the data we see that the data is already demultiplexed with tags etc stripped.

    What I want to know is there anyway that a single fastq/sff/etc file can be created during a MiSeq run from the output data (or during data generation) that contains the MIDs etc still on the data and has all the data together in one file?

    I've done extensive reading on this and it seems that the best way to do this is to convert the multiple .bcl files to fastq?

    Is there a better/easier way to do this?

    Thanks!

  • #2
    One can potentially create a single file from a multiplexed run by running CASAVA pipeline with a single barcode like (NNNNNN-NNNNN). This way all data ends up in "undetermined" file along with all tag information intact in the header. You would need to write a script to de-multiplex this data or reformat it in a way your pipeline expects it. This could potentially work on the MiSeq itself (for analysis) though we have not tried it.

    If you are looking to get the tag reads in a separate file (e.g. Qiime) then one can reanalyze the data using the MiSeq reporter after making a change to the "MiSeqReporterConfig" file. Unless you own the MiSeq (or have direct access to it) this may not be an option for you.
    Last edited by GenoMax; 06-05-2014, 05:21 AM.

    Comment


    • #3
      Thanks for your reply, very helpful!

      I hadn't realised that it was possible to output the index information in the header. In the data I've just received the last element of the descriptor is a number as opposed to the index sequence. This may be related to the fact that both the i5 index and i7 index were used?


      It appears to link to the order in which the sequences are listed in the SampleSheet.csv file:

      e.g.
      1st @M02143:21:000000000-A8YDD:1:1101:16271:1876 2:N:0:1 used N701 and S501
      7th @M02143:21:000000000-A8YDD:1:1101:18009:1813 1:N:0:7 used N701 and S502

      So that would mean that we could use the SampleSheet.csv file to 'demultiplex' the data if people are uploading all the data in a single fastq file (or two in the case of paired-end sequencing).

      Thanks again, you helped me find the missing clue to solve the puzzle!!

      Comment


      • #4
        I always forget that actual sequence tag information is included in the header ONLY if the data was processed by CASAVA (i.e. not on the MiSeq) offline. So for the first option I mentioned the analysis will have to be done offline (would not work on MiSeq) to get the tag information in the read ID's.
        Last edited by GenoMax; 06-05-2014, 05:24 AM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        48 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X