Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • query on file format conversion

    Dear all,
    I am new to the sequencing projects and SRA.
    I am trying to find homologs for an organism which is in SRA.
    The project is SRS005501. when i download the experiment sample, i get a fastq formatted file with 12GB of data.

    it looks like this:
    @SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
    T2.3.032....0.3.3..1...00...2.23.2.3..20...0102022.
    +SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
    !4!(!#$'!!!!$!#!&!!#!!!&$!!!#!##!#!'!!##!!!##$####!
    @SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
    T0.1.020....3.2.0..2...21...2.13.3.0..23...3213103.
    +SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
    !:!9!9?;!!!!:!:!;!!:!!!9?!!!:!%0!<!6!!:8!!!88319&7!
    @SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
    T1.3.132....1.1.0..0...11...1.12.2.1..31...2203332.
    +SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
    !:!2!99=!!!!8!)!4!!)!!!8+!!!3!3&!9!+!!4-!!!1,#,.7,!

    My question is:
    1. How will i get a fasta sequence from this file or for this organism.

    2. Fastq format file definition says that the second line will be sequence, but here i find only numbers and the numbers are not continuous. Why is that second line having same pattern of dots. How every secondth line is having the same format of: TX.X.XXX.... etc.

    3. Many suggestions are that it is in color space format but beyond that i dont find any answer on how to convert this file. I tried encodeFasta.py in corona, solid2fasta with bowtie and both havent worked.

    4. Do we have any specific format name for this.Why does SRA gives this file for download ?.

    5. i dont want to assemble the sequences, i just want to find reads which are homologous to my protein. i hope i dont need assembly for that/
    Please clarify, sorry if i am asking silly question.

    Thanking you,
    Alaguraj.V

  • #2
    This is a FASTQ file containing sequencing reads in colorspace.
    Colors are encoded in numbers, and although conversion into sequencespace is possible, it isn't recommended. Rather do the alignments to your reference sequence in colorspace. There are many postings in this forum describing this.
    The reasons there are dots in your sequence is that the color at this position could not be determined. I hope not all of your reads look like that.

    Comment


    • #3
      If you grab reads from the beginning of a .csfasta file, you tend to be pulling some of the worst reads as they are processed in a spatial order. Which nearly guarantees that the early reads in the file will be near the edge of the flowcell.

      Also, I strongly agree that converting raw colorspace reads to sequence space is almost never the best solution.

      --
      Phillip

      Comment


      • #4
        I did a test where I took 10000 reads mapped in colour space and tried to convert them to basespace and then map them. Only 1/3 was mappable, so as the two previous posters recommend, dont convert your reads to basespace.

        I assume that you have the DNA sequence of your protein of interest. If you do map your reads against it with bfast/bwa/bowtie/bioscope/&c. If you dont have the DNA sequence, I cannot see how you can do this.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        35 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X