Dear all,
I am new to the sequencing projects and SRA.
I am trying to find homologs for an organism which is in SRA.
The project is SRS005501. when i download the experiment sample, i get a fastq formatted file with 12GB of data.
it looks like this:
@SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
T2.3.032....0.3.3..1...00...2.23.2.3..20...0102022.
+SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
!4!(!#$'!!!!$!#!&!!#!!!&$!!!#!##!#!'!!##!!!##$####!
@SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
T0.1.020....3.2.0..2...21...2.13.3.0..23...3213103.
+SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
!:!9!9?;!!!!:!:!;!!:!!!9?!!!:!%0!<!6!!:8!!!88319&7!
@SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
T1.3.132....1.1.0..0...11...1.12.2.1..31...2203332.
+SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
!:!2!99=!!!!8!)!4!!)!!!8+!!!3!3&!9!+!!4-!!!1,#,.7,!
My question is:
1. How will i get a fasta sequence from this file or for this organism.
2. Fastq format file definition says that the second line will be sequence, but here i find only numbers and the numbers are not continuous. Why is that second line having same pattern of dots. How every secondth line is having the same format of: TX.X.XXX.... etc.
3. Many suggestions are that it is in color space format but beyond that i dont find any answer on how to convert this file. I tried encodeFasta.py in corona, solid2fasta with bowtie and both havent worked.
4. Do we have any specific format name for this.Why does SRA gives this file for download ?.
5. i dont want to assemble the sequences, i just want to find reads which are homologous to my protein. i hope i dont need assembly for that/
Please clarify, sorry if i am asking silly question.
Thanking you,
Alaguraj.V
I am new to the sequencing projects and SRA.
I am trying to find homologs for an organism which is in SRA.
The project is SRS005501. when i download the experiment sample, i get a fastq formatted file with 12GB of data.
it looks like this:
@SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
T2.3.032....0.3.3..1...00...2.23.2.3..20...0102022.
+SRR026631.1 octect_20090730_fc2_THAPS_CCMP212187_13_53_F3 length=50
!4!(!#$'!!!!$!#!&!!#!!!&$!!!#!##!#!'!!##!!!##$####!
@SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
T0.1.020....3.2.0..2...21...2.13.3.0..23...3213103.
+SRR026631.2 octect_20090730_fc2_THAPS_CCMP212187_13_65_F3 length=50
!:!9!9?;!!!!:!:!;!!:!!!9?!!!:!%0!<!6!!:8!!!88319&7!
@SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
T1.3.132....1.1.0..0...11...1.12.2.1..31...2203332.
+SRR026631.3 octect_20090730_fc2_THAPS_CCMP212187_13_78_F3 length=50
!:!2!99=!!!!8!)!4!!)!!!8+!!!3!3&!9!+!!4-!!!1,#,.7,!
My question is:
1. How will i get a fasta sequence from this file or for this organism.
2. Fastq format file definition says that the second line will be sequence, but here i find only numbers and the numbers are not continuous. Why is that second line having same pattern of dots. How every secondth line is having the same format of: TX.X.XXX.... etc.
3. Many suggestions are that it is in color space format but beyond that i dont find any answer on how to convert this file. I tried encodeFasta.py in corona, solid2fasta with bowtie and both havent worked.
4. Do we have any specific format name for this.Why does SRA gives this file for download ?.
5. i dont want to assemble the sequences, i just want to find reads which are homologous to my protein. i hope i dont need assembly for that/
Please clarify, sorry if i am asking silly question.
Thanking you,
Alaguraj.V
Comment