Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRF metadata

    What information is in the SRF meta-data other than read lengths and pairing info? Broad question I know but does srf_info print it all out? Is the information in the trace name in a certain format?
    i.e. in
    trace_name: IL20_1065:1:1: + 22:920 ... 499:175 x17333

    Are those numbers always something, or its just a string and they only mean something for the study and file I am looking at?

  • #2
    The trace name doesn't have any explicit meaning defined in SRF, but typically they are automatically generated to ensure the names are unique. You illumina example consists of machine name / run number and then lane, tile, x/y coordinates in order.

    SRF does in theory also have an XML section for run meta-data. The intention was that this would be the SRF equivalent of the TraceInfo.xml that went alone side the old tar-balls in capillary trace submissions to NCBI et al. However NCBI's new SRA finally ended up with something like 5 separate XML schemas with no overall hierarchy able to embed them as a single XML object in the SRF file. I think for other practical reasons too people wanted to submit their metadata separate (eg before the bulk of the data gets uploaded).

    James

    Comment


    • #3
      One thing I forgot to mention - there are also machine/run specific data files that get added to the SRF file; often many times. (We really needed a 3-layer system rather than 2-layer so we could add data common to an entire run.)

      SRF is really just a container for ZTR trace files, in much the same way that tar and zip are containers for various formats. The ZTR format allows for various types of data, called chunks. These can be things like sequence, base qualities, trace peaks, as well as more nebulous things like "TEXT".

      The illumina2srf program embeds various xml config files for the instrument run in the text chunks. There's no direct SRF tool that dumps this (except I guess for the srf2illumina reverse conversion). You can however extract a single sequence in ZTR format and then dump that. Using io_lib commands:

      jkb$ srf_list /fuse/mpsafs/runs/4100/4100_4.srf|head -4
      IL22_4100:4:1:0:193
      IL22_4100:4:1:0:467
      IL22_4100:4:1:0:585
      IL22_4100:4:1:0:612

      jkb$ srf_extract_linear /fuse/mpsafs/runs/4100/4100_4.srf IL22_4100:4:1:0:193 | get_comment

      (Edited for brevity)
      PROGRAM_ID=illumina2srf v2.0.0r72
      I2S_CMDLINE=/software/solexa/bin/illumina2srf -I -b -filter-bad-reads -bustard-dir ...
      ILLUMINA_GA_IPAR_NCLUSTERS=232975
      ILLUMINA_GA_MATRIX_FWD=# Auto-generated frequency response matrix
      > A
      > C
      > G
      > T
      1.41 0.05 -0.00 -0.00
      0.79 0.73 0.01 0.01
      -0.00 0.00 1.17 0.00
      -0.00 -0.00 0.65 0.87

      ILLUMINA_GA_MATRIX_FWD_FILENAME=Matrix/s_4_02_matrix.txt
      ILLUMINA_GA_MATRIX_REV=# Auto-generated frequency response matrix
      > A
      > C
      > G
      > T
      1.35 0.01 0.00 0.00
      0.66 0.62 0.01 0.01
      -0.00 0.00 1.21 0.00
      -0.00 -0.00 0.72 0.99

      ILLUMINA_GA_MATRIX_REV_FILENAME=Matrix/s_4_78_matrix.txt
      ILLUMINA_GA_PHASING_FWD=<Parameters>
      <Phasing>0.006000</Phasing>
      <Prephasing>0.002800</Prephasing>
      </Parameters>

      ILLUMINA_GA_PHASING_FWD_FILENAME=Phasing/s_4_01_phasing.xml
      ILLUMINA_GA_PHASING_REV=<Parameters>
      <Phasing>0.005900</Phasing>
      <Prephasing>0.002100</Prephasing>
      </Parameters>

      ILLUMINA_GA_PHASING_REV_FILENAME=Phasing/s_4_77_phasing.xml
      ILLUMINA_GA_BUSTARD_CONFIG=<?xml version="1.0"?>
      <BaseCallAnalysis>
      <Run Name="Bustard1.5.1_01-12-2009_RTA">
      <BaseCallParameters>
      <ChastityThreshold>0.600000</ChastityThreshold>
      <Matrix Path="">
      ...

      ILLUMINA_GA_BUSTARD_SUMMARY=<?xml version="1.0" ?>
      <?xml-stylesheet type="text/xsl"
      href="BustardSummary.xsl" ?>

      <BustardSummary>
      ...

      ILLUMINA_GA_PIPELINE_VERSION=1.5.1
      ILLUMINA_GA_RAW_DATA_COMPRESSION=none
      ILLUMINA_GA_REBASECALL=1
      ILLUMINA_GA_RUN_FOLDER=091123_IL22_4100
      ILLUMINA_GA_FIRECREST_FOLDER=Intensities
      ILLUMINA_GA_BUSTARD_FOLDER=Bustard1.5.1_01-12-2009_RTA
      ILLUMINA_GA_FIRECREST_CONFIG=<?xml version="1.0" encoding="utf-8"?>
      <ImageAnalysis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Run Name="Intensities">
      <Cycles First="1" Last="152" Number="152" />
      <ImageParameters>
      ...

      etc

      You could also save the output of srf_extract_linear or srf_extract_hash to a file and run trace_dump on it to get the full data, including bases, qualities, etc.

      Most (all?) of the TEXT segment of the ZTRs though is lost when imported to SRA I believe. It's certainly arguable how useful all the XML config files are for the instrument runs (although they're *tiny* compared to the actual data), but I think the matrix files are perhaps of use to researchers as they explain a lot of the manipulation that took place on the data.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Working...
      X