Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting GEO database TXT format to fasta

    Hello!

    I'm new to bioinformatics, but I need to perform an analysis of some sort.

    I've downloaded data from GEO database it's a large TXT file consisting of many lines that look this way : SCS_0004:2:1:1053:18066#0/1 AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC [YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W` 5 29 29 chr17:68048647-68172163_36129 3979 + 1 1

    I need to align those short sequences to a mouse chromosome, and I'm using bowtie under windows.

    But the problem is , bowtie doesn't work with this format, can you recommend an easy-to-use tool for windows to convert this format into fasta or just raw?

  • #2
    This could be done with the unix command line but it would be helpful if you can post a few lines enclosed within code brackets like
    Code:
    paste here
    to get a precise idea of the file format.

    Comment


    • #3
      Originally posted by vivek_ View Post
      This could be done with the unix command line but it would be helpful if you can post a few lines enclosed within code brackets like
      Code:
      paste here
      to get a precise idea of the file format.
      Code:
      SCS_0004:2:1:1053:18066#0/1	AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC	[YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W`	5	29	29	chr17:68048647-68172163_36129	3979	+	1	1
      SCS_0004:2:1:1053:18066#0/1	AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC	[YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W`	5	29	29	chr17:68048647-68172163_36130	3979	+	1	1
      SCS_0004:2:1:1053:18066#0/1	AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC	[YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W`	5	29	29	uc008dkh.1	4033	+	1	1
      SCS_0004:2:1:1053:18066#0/1	AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC	[YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W`	5	29	29	chr17:68046720-68172163_36128	3943	+	1	1
      SCS_0004:2:1:1053:18066#0/1	AGCAATATTGACTACANCCTCATCAAAGCCTGTAGGCACC	[YITQR]MST\WN\\TEQU[`]WU]]WPYXXXOXU]`\W`	5	29	29	chr17:68046720-68172163_36127	3943	+	1	1
      SCS_0004:2:1:1054:5070#0/1	TTTCTCTGTCTTGTCCNCCTAGTTTCCCTCCTGTAGGCAC	aaaaaaaaaaaaaaa]EaaaW]]]Yaa\a`[aa]Pa^]VT	2	30	30	chr2:40378133-40378584_42654	395	-	1	1
      SCS_0004:2:1:1054:5070#0/1	TTTCTCTGTCTTGTCCNCCTAGTTTCCCTCCTGTAGGCAC	aaaaaaaaaaaaaaa]EaaaW]]]Yaa\a`[aa]Pa^]VT	2	30	30	chr1:8926487-8927380_99	16	+	1	1
      Something like this. By the way, I don't have unix installed only Mac OS and windows, but as far as I understand Mac OS is a unix-based system, right?
      Last edited by Etherella; 08-30-2012, 03:03 AM.

      Comment


      • #4
        The easiest way would be to write a small script (in python, perl, whatever) to read that in and spit out the same data (sans alignment information) in fastq format. Column 1 is the read id, column 2 is the sequence, and column 3 is the quality score. If you have python installed on your Mac, then the following would probably work (changing INPUT_FILENAME to the name of the file you got from GEO and SOME_OUTPUT_FILE to whatever you want the output to be):
        Code:
        #!/usr/bin/python
        import csv
        
        f = csv.reader(open("INPUT_FILENAME", "r"), dialect="excel-tab")
        output = open("SOME_OUTPUT_FILE", "w")
        
        last = ""
        for line in f :
            if(line[0] != last) :
                output.write(">%s\n" % (line[0]))
                output.write("%s\n" % (line[1]))
                output.write("+\n") 
                output.write("%s\n" % (line[2]))
                last = line[0]
        output.close()
        Something like that would probably work.

        Comment


        • #5
          thanks for the reply, I managed to get it working through galaxy.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X