Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Anybody recognize this Illumina data format?

    Dear community,
    I just obtained some Illumina sequencing data, but it is in a format that I am unfamiliar with. The files were labeled "probs.txt". Does anybody recognize this format and can suggest software to parse/convert it to fastq? from the header info it It appears to be paired end illumina data with base calls, but the I am not sure if it is followed by quality values or intensities. The numbers following the basecalls come in sets of 4, and the order of the numbers corresponds with ACGT, in that if the first of the four numbers is the highest then the base is an A, if the 2nd number of a set of 4 is the highest then the base is C, etc...

    3 1 10 1097#0/1 ATCTA........CCTGGCCACC............. 37 -37 -40 -40 -40 -9 -40 9 -22 17 -25 -20 -0 -40 -34 0 -2 -2 -13 -6 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -13 6 -14 -10 -24 12 -21 -13 -39 -21 -19 17 -40 -4 1 -7 -15 -11 2 -3 -13 8 -16 -12 -11 8 -23 -12 12 -14 -18 -20 -13 6 -12 -10 -15 10 -16 -13 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
    3 1 10 1097#0/2 CAGACATCGCGATCGGGTTCGCGATCCGC.CCGAAG -18 16 -33 -21 39 -39 -40 -40 -39 -8 6 -12 34 -38 -37 -40 -30 25 -27 -33 31 -31 -40 -40 -28 -8 -20 8 -17 13 -19 -20 -40 -26 17 -18 -8 5 -16 -10 -40 -10 8 -13 21 -23 -40 -28 -26 -13 -21 12 -9 3 -19 -6 -40 -22 16 -17 -40 -4 -3 -3 -40 -17 16 -26 -30 -13 -16 11 -40 -18 -36 18 -4 -4 -5 -6 -40 -24 15 -16 -24 19 -24 -21 -27 -11 10 -16 21 -24 -32 -24 -24 -15 -19 13 -19 12 -19 -14 -9 6 -13 -14 -16 -8 3 -7 -18 0 -8 -2 -5 -5 -5 -5 -9 1 -4 -12 -14 3 -11 -6 -16 -11 7 -10 14 -14 -34 -22 11 -15 -14 -28 -14 -6 -1 -4

  • #2
    Originally posted by sbberes View Post
    Dear community,
    I just obtained some Illumina sequencing data, but it is in a format that I am unfamiliar with. The files were labeled "probs.txt". Does anybody recognize this format and can suggest software to parse/convert it to fastq? from the header info it It appears to be paired end illumina data with base calls, but the I am not sure if it is followed by quality values or intensities. The numbers following the basecalls come in sets of 4, and the order of the numbers corresponds with ACGT, in that if the first of the four numbers is the highest then the base is an A, if the 2nd number of a set of 4 is the highest then the base is C, etc...
    Code:
    3	1	10	1097#0/1	ATCTA........CCTGGCCACC.............	  37  -37  -40  -40	 -40   -9  -40    9	 -22   17  -25  -20	  -0  -40  -34    0	  -2   -2  -13   -6	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	 -13    6  -14  -10	 -24   12  -21  -13	 -39  -21  -19   17	 -40   -4    1   -7	 -15  -11    2   -3	 -13    8  -16  -12	 -11    8  -23  -12	  12  -14  -18  -20	 -13    6  -12  -10	 -15   10  -16  -13	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5	  -5   -5   -5   -5
    3	1	10	1097#0/2	CAGACATCGCGATCGGGTTCGCGATCCGC.CCGAAG	 -18   16  -33  -21	  39  -39  -40  -40	 -39   -8    6  -12	  34  -38  -37  -40	 -30   25  -27  -33	  31  -31  -40  -40	 -28   -8  -20    8	 -17   13  -19  -20	 -40  -26   17  -18	  -8    5  -16  -10	 -40  -10    8  -13	  21  -23  -40  -28	 -26  -13  -21   12	  -9    3  -19   -6	 -40  -22   16  -17	 -40   -4   -3   -3	 -40  -17   16  -26	 -30  -13  -16   11	 -40  -18  -36   18	  -4   -4   -5   -6	 -40  -24   15  -16	 -24   19  -24  -21	 -27  -11   10  -16	  21  -24  -32  -24	 -24  -15  -19   13	 -19   12  -19  -14	  -9    6  -13  -14	 -16   -8    3   -7	 -18    0   -8   -2	  -5   -5   -5   -5	  -9    1   -4  -12	 -14    3  -11   -6	 -16  -11    7  -10	  14  -14  -34  -22	  11  -15  -14  -28	 -14   -6   -1   -4
    It appears that someone has shmushed together the old-style Illumina _seq.txt and _prb.txt files, as well as merging the read1 and read2 files for a paired end run. With the reformatting of your file by wrapping it in CODE tags you can more clearly see how the file is formatted.

    - Column 1 identifies the lane, column 2 the tile, column 3 the X-coordinate and column 4 (up to the #) the Y-coordinate. Together they make up a unique identifier for the cluster.

    - The "#0" is used to identify multiplex IDs (meaningless in this case).

    - The number after the / (1 or 2) indicates whether this is read1 or read2 for this particular cluster.

    - Column 5 is the sequence.

    - The following sets of numbers are quality scores for each base, arranged in groups of 4 which correspond to the probability of each base at the given position. As you correctly surmised they are in the order ACGT. Each group of 4 numbers is separated by a tab and within each group the numbers are separated by one or more spaces.

    I would recommend splitting this file back up into its original form, I would also recommend separating them based on read. Columns 1-5 (identifier and sequence) should be put in a file named readN_seq.txt (where N is 1 or 2 depending on read). All of the probability scores should be put in files named readN_prb.txt. There should be one line in each file for each sequence and you must maintain the proper order, there is no identifying information in the _prb.txt files and the following step assumes they are matched line by line with the _seq.txt files.

    Use the fq_all2std.pl script which is provided as part of the MAQ package to convert these two files to FASTQ. The command to use for this input is seqprb2std. Sample command:

    Code:
    > fq_all2std.pl seqprb2std read1_seq.txt read1_prb.txt > read1.fastq
    Do likewise for read2 and you should have two properly formatted FASTQ files.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    25 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    23 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X