SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
50 bp paired end reads vs. 100 bp single end reads efoss Bioinformatics 12 01-15-2014 09:05 PM
Can Cuffdiff treat paired-end and single-end reads at the same time? zun RNA Sequencing 3 06-12-2012 06:37 PM
Can paired-end mapping produce more reads than single-end ? warrenemmett Bioinformatics 13 03-21-2012 12:10 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM

Reply
 
Thread Tools
Old 03-17-2011, 08:12 AM   #1
glocke
Junior Member
 
Location: New Jersey

Join Date: Mar 2011
Posts: 5
Default paired-end reads apparently unseparated

Hi, this is my first post!

I've downloaded some ChIP-Seq data from the SRA ( http://www.ncbi.nlm.nih.gov/sra/SRX000425?report=full# ). The originating paper ( http://www.ncbi.nlm.nih.gov/pubmed/18477713 ) says that the reads are SOLiD paired-end, with 25 bp from each end, but the reads themselves are 52 bases long! It seems that the two ends have been ligated and sequenced together.

I am wondering where the extra two bases come from, but the larger problem is that all the alignment programs out there seem to expect that when you have paired-end reads that you'll have TWO lists of reads, one for each half of the pair. (There are two archives attached to this SRA account, but I'm fairly certain that they're not paired. When you add the number of reads they come to the number of reads reported in the paper, and the reported number of matches is more than half that number. besides, the read id's don't appear to match, and the reads are too long.)

What tools can I use to map these reads to a reference genome?

for reference, here are some sample reads:
first file:
Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
T32322133300002330031001022230020232002203222030231
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
!21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
@SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
T01212120333223322020022322232232232222022232033230
+SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
!,*+*()+*(%'+)%%%&%+&%%'%%%%%%%%%%%%%%%%%%%%'+%%%%%
@SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50
T32023002222000323202022222323322200222200220003032
+SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50
!(*)%%%+'%%%*%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%+%%%%%
@SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50
T32021200310022332200020032222332303202203222030030
+SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50
!9)'+*)4')%&&%)%%('&%%'%'%%%%%%%)%%%%%%%%%%%%+%%%%(
second file
Code:
@SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50
T03231223000321133333031113002130221200322111211011
+SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50
!:9<3:99<*8;8<)0<;<%-8;2%%3*5%*.8<,1;6;*%&..'%%-*,%
@SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50
T01032120003210102101003202002003021300310100313323
+SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50
!<*7-;3291:/*0306/';'6<8&/;13'/,6%5&,''*+3--/+*4&%&
@SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50
T30000002320022232001023330002000220231323302003320
+SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50
!,%&'''---+5%%%*-(-2%37''%-&%%+(3-&%*%%'%*&2''%3.%%
@SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50
T01301202310020020101002322020221212212112020001111
+SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50
!52/601)1,&3:%5691*-':74),'%%%&%&+(*)&%)'&&,'&)*)*%
@SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50
T30202333031100120210330331030310222032111001231300
+SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50
!;<<<;;<<<<<1<<1;<56/<<:9:1;31;/<;%%/89/'99<'08)<%0
(all the reads start with T?)
glocke is offline   Reply With Quote
Old 03-17-2011, 10:43 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

The F3 is usually the first end, while the R3 is usually the second end (see the read name). Also, the read is in color space, hence the first T and the digits. Please review literature and this forum surrounding color space (what it is and what it looks like) and color space mapping (see BFAST/BWA/Mosaik/NovoalignCS/SHRiMP).
nilshomer is offline   Reply With Quote
Old 03-17-2011, 11:12 AM   #3
glocke
Junior Member
 
Location: New Jersey

Join Date: Mar 2011
Posts: 5
Default

i grep'd the fastq files and there are no 'R3' strings in there. Why would the reads be 52 bases long when the paper says the run was 25 bases x 2? It seems likely that I failed to extract the reads from the sra-lite files correctly.

(it's color space, clearly. i'm just confused why the first nucleotide is always T -- inspecting the fastq files i haven't found any reads starting with any other letter. I've read that ABI ligates some sort of primer or something in some cases, which could explain this.)
glocke is offline   Reply With Quote
Old 03-17-2011, 11:19 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

The only way to decode a color space read is using the adapter (did you read up on color space?). If there are 2x25 and each has an adapter added then coincidentally 2x(25+1) = 52, which is the number of bases you observed. I am wondering if they have done some concatenation magic here.

Anyone else seen this problem when downloading from SRA? You could try to email their supports since the above data is confusing.
nilshomer is offline   Reply With Quote
Old 03-17-2011, 11:23 AM   #5
glocke
Junior Member
 
Location: New Jersey

Join Date: Mar 2011
Posts: 5
Default

am contacting SRA. thanks for your help.
glocke is offline   Reply With Quote
Old 03-17-2011, 03:42 PM   #6
SeqAA
Guest
 

Posts: n/a
Default

i'm pretty sure this paper is mate pair reads and not paired end.
I know it says paried end, but they are from circularized libraries which means mate pair. Plus it is from 2008 and the reverse reads were not even around.

also, i've never seen data in that format before.
  Reply With Quote
Old 03-17-2011, 04:32 PM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by SeqAA View Post
also, i've never seen data in that format before.
You will not see it again after the SRA shuts down

The second end on mate pair data is the R3 tag, while F5 is the paired end tag). The first end should be F3, so it looks like just the first end is present or you've downloaded fragements?
nilshomer is offline   Reply With Quote
Old 03-18-2011, 09:39 AM   #8
glocke
Junior Member
 
Location: New Jersey

Join Date: Mar 2011
Posts: 5
Default

i contacted the SRA, who told me that they only received forward reads of length 50. It looks like i'll have to contact the authors of the paper. I'll post here when I solve the problem.

thanks for your help, ppl!
glocke is offline   Reply With Quote
Old 03-18-2011, 01:29 PM   #9
sinaian
Junior Member
 
Location: Boston

Join Date: Jan 2011
Posts: 4
Default

For each read, the first 26 characters are F3, the last 25 character prefixed with a 'G' are R3.
sinaian is offline   Reply With Quote
Old 03-21-2011, 08:55 AM   #10
glocke
Junior Member
 
Location: New Jersey

Join Date: Mar 2011
Posts: 5
Default

sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

also, sinaian, is there any documentation for this format of reads?
glocke is offline   Reply With Quote
Old 03-21-2011, 10:04 AM   #11
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

BWA does colorspace fastq like this BUT (IMPORTANT) the target genome must have been indexed (created) using the using "-c" parameter: example "bwa index -c hg1xORwhatever.fasta". If bwa complains, use additional parameters : "-a bwtsw".

Then run bwa using the -c parameter to align. THIS "-c/colorspace" MUST BE DONE ON BOTH THE INDEXING AND ALIGNMENT COMMANDS!!!!!
Richard Finney is offline   Reply With Quote
Old 03-21-2011, 10:45 AM   #12
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default

For BWA, you'd have to strip the T, convert 0 to A, 1 to C, 2 to G, 3 to T and anything else to N. That's pseudo-nucleotide representation of color space, which was introduced as a workaround. It has nothing to do with the actual decoding into nucleotide space!

But BWA would map these reads as if they were 50 bp reads, not mate pairs.
You'll just have to write a script that separates the colorspace data and writes them to 2 files: one containing the F3
Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
T3232213330000233003100102
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3
21(()+%'+%40*.%%**)&%&*&%
...
the other containing the R3
Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
G2230020232002203222030231
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3
%%&%%%%%%%%%%%%%%%(+%%%%'
...
It's probably also necessary to get rid of whitespaces in the lines containing SRR015241.1
The T / G primer has no quality so that the quality line is 25 characters instead of 26.
You can then map these files separately with either BFAST or Novoalign (should work directly on this format) or BWA (after conversion into pseudo-nucleotides).
epigen is offline   Reply With Quote
Old 03-21-2011, 11:25 AM   #13
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

ALSO ... bwa provides solid2fastq.pl and solid2fastq2.pl programs for converting to a fastq suitable for input to "bwa aln"
Richard Finney is offline   Reply With Quote
Old 03-28-2011, 08:00 PM   #14
sinaian
Junior Member
 
Location: Boston

Join Date: Jan 2011
Posts: 4
Angry

Quote:
Originally Posted by glocke View Post
sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.

That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way?

also, sinaian, is there any documentation for this format of reads?
Sorry for the late reply. You'll have to separate the two parts in color space before doing anything else to them. I am not aware of any tool or documentation, and my suggestion is only based on my limited trial and error. Honestly I am STUNNED by the way SRA handles SOLiD paired reads.
sinaian is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO