Seqanswers Leaderboard Ad

**nilshomer** · 03-17-2011, 09:43 AM

The F3 is usually the first end, while the R3 is usually the second end (see the read name). Also, the read is in color space, hence the first T and the digits. Please review literature and this forum surrounding color space (what it is and what it looks like) and color space mapping (see BFAST/BWA/Mosaik/NovoalignCS/SHRiMP).

**glocke** · 03-17-2011, 10:12 AM

i grep'd the fastq files and there are no 'R3' strings in there. Why would the reads be 52 bases long when the paper says the run was 25 bases x 2? It seems likely that I failed to extract the reads from the sra-lite files correctly.

(it's color space, clearly. i'm just confused why the first nucleotide is always T -- inspecting the fastq files i haven't found any reads starting with any other letter. I've read that ABI ligates some sort of primer or something in some cases, which could explain this.)

**nilshomer** · 03-17-2011, 10:19 AM

The only way to decode a color space read is using the adapter (did you read up on color space?). If there are 2x25 and each has an adapter added then coincidentally 2x(25+1) = 52, which is the number of bases you observed. I am wondering if they have done some concatenation magic here.

Anyone else seen this problem when downloading from SRA? You could try to email their supports since the above data is confusing.

**glocke** · 03-17-2011, 10:23 AM

am contacting SRA. thanks for your help.

**SeqAA** · 03-17-2011, 02:42 PM

i'm pretty sure this paper is mate pair reads and not paired end.
I know it says paried end, but they are from circularized libraries which means mate pair. Plus it is from 2008 and the reverse reads were not even around.

also, i've never seen data in that format before.

**nilshomer** · 03-17-2011, 03:32 PM

Originally posted by SeqAA View Post

also, i've never seen data in that format before.

You will not see it again after the SRA shuts down

The second end on mate pair data is the R3 tag, while F5 is the paired end tag). The first end should be F3, so it looks like just the first end is present or you've downloaded fragements?

**glocke** · 03-18-2011, 08:39 AM

i contacted the SRA, who told me that they only received forward reads of length 50. It looks like i'll have to contact the authors of the paper. I'll post here when I solve the problem.

thanks for your help, ppl!

**sinaian** · 03-18-2011, 12:29 PM

For each read, the first 26 characters are F3, the last 25 character prefixed with a 'G' are R3.

**glocke** · 03-21-2011, 07:55 AM

sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

also, sinaian, is there any documentation for this format of reads?

**Richard Finney** · 03-21-2011, 09:04 AM

BWA does colorspace fastq like this BUT (IMPORTANT) the target genome must have been indexed (created) using the using "-c" parameter: example "bwa index -c hg1xORwhatever.fasta". If bwa complains, use additional parameters : "-a bwtsw".

Then run bwa using the -c parameter to align. THIS "-c/colorspace" MUST BE DONE ON BOTH THE INDEXING AND ALIGNMENT COMMANDS!!!!!

**epigen** · 03-21-2011, 09:45 AM

For BWA, you'd have to strip the T, convert 0 to A, 1 to C, 2 to G, 3 to T and anything else to N. That's pseudo-nucleotide representation of color space, which was introduced as a workaround. It has nothing to do with the actual decoding into nucleotide space!

But BWA would map these reads as if they were 50 bp reads, not mate pairs.
You'll just have to write a script that separates the colorspace data and writes them to 2 files: one containing the F3

Code:

@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
T3232213330000233003100102
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
21(()+%'+%40*.%%**)&%&*&%
...

the other containing the R3

Code:

@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
G2230020232002203222030231
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
%%&%%%%%%%%%%%%%%%(+%%%%'
...

It's probably also necessary to get rid of whitespaces in the lines containing SRR015241.1
The T / G primer has no quality so that the quality line is 25 characters instead of 26.
You can then map these files separately with either BFAST or Novoalign (should work directly on this format) or BWA (after conversion into pseudo-nucleotides).

**Richard Finney** · 03-21-2011, 10:25 AM

ALSO ... bwa provides solid2fastq.pl and solid2fastq2.pl programs for converting to a fastq suitable for input to "bwa aln"

**sinaian** · 03-28-2011, 07:00 PM

Originally posted by glocke View Post

sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

also, sinaian, is there any documentation for this format of reads?

Sorry for the late reply. You'll have to separate the two parts in color space before doing anything else to them. I am not aware of any tool or documentation, and my suggestion is only based on my limited trial and error. Honestly I am STUNNED by the way SRA handles SOLiD paired reads.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

paired-end reads apparently unseparated

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News