Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRF metadata

    What information is in the SRF meta-data other than read lengths and pairing info? Broad question I know but does srf_info print it all out? Is the information in the trace name in a certain format?
    i.e. in
    trace_name: IL20_1065:1:1: + 22:920 ... 499:175 x17333

    Are those numbers always something, or its just a string and they only mean something for the study and file I am looking at?

  • #2
    The trace name doesn't have any explicit meaning defined in SRF, but typically they are automatically generated to ensure the names are unique. You illumina example consists of machine name / run number and then lane, tile, x/y coordinates in order.

    SRF does in theory also have an XML section for run meta-data. The intention was that this would be the SRF equivalent of the TraceInfo.xml that went alone side the old tar-balls in capillary trace submissions to NCBI et al. However NCBI's new SRA finally ended up with something like 5 separate XML schemas with no overall hierarchy able to embed them as a single XML object in the SRF file. I think for other practical reasons too people wanted to submit their metadata separate (eg before the bulk of the data gets uploaded).

    James

    Comment


    • #3
      One thing I forgot to mention - there are also machine/run specific data files that get added to the SRF file; often many times. (We really needed a 3-layer system rather than 2-layer so we could add data common to an entire run.)

      SRF is really just a container for ZTR trace files, in much the same way that tar and zip are containers for various formats. The ZTR format allows for various types of data, called chunks. These can be things like sequence, base qualities, trace peaks, as well as more nebulous things like "TEXT".

      The illumina2srf program embeds various xml config files for the instrument run in the text chunks. There's no direct SRF tool that dumps this (except I guess for the srf2illumina reverse conversion). You can however extract a single sequence in ZTR format and then dump that. Using io_lib commands:

      jkb$ srf_list /fuse/mpsafs/runs/4100/4100_4.srf|head -4
      IL22_4100:4:1:0:193
      IL22_4100:4:1:0:467
      IL22_4100:4:1:0:585
      IL22_4100:4:1:0:612

      jkb$ srf_extract_linear /fuse/mpsafs/runs/4100/4100_4.srf IL22_4100:4:1:0:193 | get_comment

      (Edited for brevity)
      PROGRAM_ID=illumina2srf v2.0.0r72
      I2S_CMDLINE=/software/solexa/bin/illumina2srf -I -b -filter-bad-reads -bustard-dir ...
      ILLUMINA_GA_IPAR_NCLUSTERS=232975
      ILLUMINA_GA_MATRIX_FWD=# Auto-generated frequency response matrix
      > A
      > C
      > G
      > T
      1.41 0.05 -0.00 -0.00
      0.79 0.73 0.01 0.01
      -0.00 0.00 1.17 0.00
      -0.00 -0.00 0.65 0.87

      ILLUMINA_GA_MATRIX_FWD_FILENAME=Matrix/s_4_02_matrix.txt
      ILLUMINA_GA_MATRIX_REV=# Auto-generated frequency response matrix
      > A
      > C
      > G
      > T
      1.35 0.01 0.00 0.00
      0.66 0.62 0.01 0.01
      -0.00 0.00 1.21 0.00
      -0.00 -0.00 0.72 0.99

      ILLUMINA_GA_MATRIX_REV_FILENAME=Matrix/s_4_78_matrix.txt
      ILLUMINA_GA_PHASING_FWD=<Parameters>
      <Phasing>0.006000</Phasing>
      <Prephasing>0.002800</Prephasing>
      </Parameters>

      ILLUMINA_GA_PHASING_FWD_FILENAME=Phasing/s_4_01_phasing.xml
      ILLUMINA_GA_PHASING_REV=<Parameters>
      <Phasing>0.005900</Phasing>
      <Prephasing>0.002100</Prephasing>
      </Parameters>

      ILLUMINA_GA_PHASING_REV_FILENAME=Phasing/s_4_77_phasing.xml
      ILLUMINA_GA_BUSTARD_CONFIG=<?xml version="1.0"?>
      <BaseCallAnalysis>
      <Run Name="Bustard1.5.1_01-12-2009_RTA">
      <BaseCallParameters>
      <ChastityThreshold>0.600000</ChastityThreshold>
      <Matrix Path="">
      ...

      ILLUMINA_GA_BUSTARD_SUMMARY=<?xml version="1.0" ?>
      <?xml-stylesheet type="text/xsl"
      href="BustardSummary.xsl" ?>

      <BustardSummary>
      ...

      ILLUMINA_GA_PIPELINE_VERSION=1.5.1
      ILLUMINA_GA_RAW_DATA_COMPRESSION=none
      ILLUMINA_GA_REBASECALL=1
      ILLUMINA_GA_RUN_FOLDER=091123_IL22_4100
      ILLUMINA_GA_FIRECREST_FOLDER=Intensities
      ILLUMINA_GA_BUSTARD_FOLDER=Bustard1.5.1_01-12-2009_RTA
      ILLUMINA_GA_FIRECREST_CONFIG=<?xml version="1.0" encoding="utf-8"?>
      <ImageAnalysis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <Run Name="Intensities">
      <Cycles First="1" Last="152" Number="152" />
      <ImageParameters>
      ...

      etc

      You could also save the output of srf_extract_linear or srf_extract_hash to a file and run trace_dump on it to get the full data, including bases, qualities, etc.

      Most (all?) of the TEXT segment of the ZTRs though is lost when imported to SRA I believe. It's certainly arguable how useful all the XML config files are for the instrument runs (although they're *tiny* compared to the actual data), but I think the matrix files are perhaps of use to researchers as they explain a lot of the manipulation that took place on the data.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM
      • seqadmin
        The Impact of AI in Genomic Medicine
        by seqadmin



        Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
        02-26-2024, 02:07 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 03-14-2024, 06:13 AM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-08-2024, 08:03 AM
      0 responses
      71 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-07-2024, 08:13 AM
      0 responses
      80 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-06-2024, 09:51 AM
      0 responses
      68 views
      0 likes
      Last Post seqadmin  
      Working...
      X