Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Output file formats

    Hello,

    Could you please help me to sort out with the output file formats?
    What each column means, and which program (under Windows) to use to read these files?

    Thank you very much!

    (There are three output files,
    First file has a name like s_2_0001_seq.txt and looks like this:

    2 1 912 885 TGGCAAGGAAAATAAAATCAAAAA
    2 1 901 884 TGGTACATATACACCATGAAATAT
    2 1 897 115 TGAAGGACCAGAGTGCCTGGACTT
    2 1 933 879 AAGGCAACAAAAAGAGACTCCATA
    2 1 888 104 TGGGACACATTTAAAGCAACGAGA
    2 1 920 116 AATCCAGAAGTGGGGGCCTGTGCA
    2 1 920 894 TCAAAACTGAAACACTTCCCATCA
    2 1 900 896 TGTCATCCTGAAGTGCAGTGGATA
    2 1 896 921 TTAGGAAAAAACAAAAAACAAAAA
    2 1 886 105 AGGGAAAATGGAAAAATAACAAAC
    2 1 876 955 TACCAAACATTTGAGGCAGAAATG


    Second file is named as s_2_0001_sig2.txt and looks like this:


    2 1 912 885 2925.5 5978.4 913.5 6583.8 1032.0 1638.7 4854.6 221.5 109.3 1888.1 2405.1 2398.7 1271.8 4134.4 737.4 -222.2 2190.1 1825.4 679.2 -2.1 3953.1 277.9 391.4 334.6 1393.2 972.6 3596.5 1032.0 391.4 53.5 2777.0 -167.5 2737.7 277.9 913.5 109.3 2584.9 2398.7 334.6 109.3 2628.7 972.6 448.5 -57.4 2157.6 -705.0 165.3 1700.7 277.9 1091.6 448.5 1819.4 3357.7 -2.1 -2.1 1762.9 2325.5 277.9 1271.8 -222.2 1105.7 1032.0 109.3 109.3 2815.2 -112.5 391.4 1211.5 277.9 334.6 1151.4 1263.3 913.5 1464.4 505.9 165.3 2341.2 505.9 165.3 165.3 2221.7 -492.5 221.5 -112.5 1853.9 505.9 109.3 1393.2 492.8 165.3 448.5 165.3 1906.0 53.5 563.4 563.4

    And the third file is named like s_2_0001_prb.txt, and looks like this:

    -40 -5 -40 5 -40 -40 40 -40 -40 -40 1 -1 -40 40 -40 -40 13 -13 -40 -40 40 -40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 40 -40 -40 -40 6 -6 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 40 -40 -40 -40 2 -2 -40 -40 40 -40 -40 -40 -40 -40 -19 19 -19 19 -40 -40 40 -40 -40 -40 40 -40 -40 -40 18 -31 -40 -18 0 -10 -2 -22 40 -40 -40 -40
    -40 -40 -40 40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 -40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40 -40 40 40 -40 -40 -40 -40 -40 -40 40
    -40 -40 -40 40 -40 -40 40 -40 40 -40 -40 -40 40
    Last edited by rebrendi; 05-14-2008, 04:42 PM.

  • #2
    Hey Rebrendi, I fully expect someone more knowledgeable than myself will chime in soon, but I do know that the PRB file is the per base quality file. There is more info here:

    For the latter one, four numbers per base are listed to present the negative log-transform of the probabilities of four nucleotides (A, C, G, T) to be sequenced at this base position.
    ...from http://rulai.cshl.edu/rmap/

    So it's possible to use the PRB as the sequence as well.

    Hopefully that will tide you over until said smarter person appears!

    Comment


    • #3
      Thank you very much, ECO!

      So, I still need answers to the two other file formats, and the program to read them under windows (if it exists). I wonder, is it possible to extract the nucleotide numbers in the genome from these short raw data? And there are also other tags encoded somewhere in these files.

      Comment


      • #4
        the sig2 files are processed "traces" you can draw a bar chart with them for each sequence. The seq files are the final data - its trivial to convert the seq and prb files into a fastq file - there are tools floating around to do this.

        generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.

        Comment


        • #5
          Originally posted by cgb View Post
          generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.
          well, so what do they mean these columns?
          Last edited by rebrendi; 05-14-2008, 11:52 PM.

          Comment


          • #6
            Lane = 1-8 (which channel of the flowcell)

            X,Y = physical location of the cluster on the flowcell...

            Comment


            • #7
              Originally posted by ECO View Post
              Lane = 1-8 (which channel of the flowcell)

              X,Y = physical location of the cluster on the flowcell...
              thanks....

              Comment


              • #8
                Not quite....

                the flowcell has 8 lanes. lane number is the lane. each lane has up to 330 'tiles' they are numbered in a snakey pattern, the X,Y is the cluster co-ordinate on the given tile

                Comment


                • #9
                  ... on the sig2 files - your row (= cluster) has the same key for the first 4 cols. then you have 4 values for A,C,G,T <Tab> A,C,T,G etc.... up to cycle number

                  note - your quality values are raw Qscores emitted by Bustard and will not be wel calibrated.

                  Comment


                  • #10
                    Originally posted by cgb View Post
                    the sig2 files are processed "traces" you can draw a bar chart with them for each sequence. The seq files are the final data - its trivial to convert the seq and prb files into a fastq file - there are tools floating around to do this.

                    generally the key is the first 4 columns : lane, tiles, x, y for the given cluster that gave the sequence.
                    cgb,
                    can you say more on these programs that convert prb + seq into fastq format?
                    There is this _sequence.txt output per lane as well, that is the reads in seq file minus the QC reads that fail chastity filter. This can then be converted to fastq using one of the MAQ utilities.

                    Any advantage of using seq + prb, instead of the filtered _sequence? I have heard from MAQ, SSAHA and other authors that using the filtered file is preferred to get better alignment results using their tools

                    sm
                    --
                    bioinfosm

                    Comment


                    • #11
                      have a look on the sanger site - if not mail [email protected] or [email protected]

                      Comment


                      • #12
                        Originally posted by bioinfosm View Post
                        cgb,
                        There is this _sequence.txt output per lane as well, that is the reads in seq file minus the QC reads that fail chastity filter. This can then be converted to fastq using one of the MAQ utilities.
                        On our pipeline, the _sequence.txt file only has 32 bases of sequence. If you are using SOAP or Maq, or you are doing more than 36 bases, you don't want to lose all those bases. Maybe you can fool around with the pipeline to get it to output more, but I don't know how. It also uses a non-standard quailty scoring format, but that's not a deal-breaker.

                        I made a <50 line perl thingie to take the .prb and .seq files to make a fastq. If I can do it, it can't be that hard

                        Comment


                        • #13
                          Originally posted by cgb View Post
                          Note - your quality values are raw Qscores emitted by Bustard and will not be well calibrated.

                          Hi cgb,

                          Can you expand on this a bit more please?

                          Cheers,

                          Scott.

                          Comment


                          • #14
                            the scores are supposed to reflect the chances of a basecall being in error, 20 = 1 in 100 etc. If they do this accurately they are "calibrated". Raw Bustard scores are not well calibrated - it tends to over score and underscore bases and shove a lot into a Q40 bin (wrongly). he scores can be adjusted after the fact using several well known methods - the newer (0.4) / 1.0 release of the GAPipeline allows for some degree of recalibration using control lane data.

                            Comment


                            • #15
                              To amplify a bit on cgb's posting: If you align your reads to a known, error-free reference (e.g., PhiX), you can then count the true errors and establish a true error rate. Compare this to the estimated error rate embodied in the Q scores. They should match: Out of all the Q30 bases in all the reads, there should be 1 error in 1000, and so on for each Q value.

                              An easy place to find this information is in the s_<lane>_qreport.txt file produced by Gerald when you do an alignment on the lane (ANALYSIS default or Eland). What you'll see there is that what's called Q40 really has 0.5% errors = Q23.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X