![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
50 bp paired end reads vs. 100 bp single end reads | efoss | Bioinformatics | 12 | 01-15-2014 09:05 PM |
Can Cuffdiff treat paired-end and single-end reads at the same time? | zun | RNA Sequencing | 3 | 06-12-2012 06:37 PM |
Can paired-end mapping produce more reads than single-end ? | warrenemmett | Bioinformatics | 13 | 03-21-2012 12:10 AM |
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? | danwiththeplan | Bioinformatics | 2 | 09-22-2011 03:06 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: New Jersey Join Date: Mar 2011
Posts: 5
|
![]()
Hi, this is my first post!
I've downloaded some ChIP-Seq data from the SRA ( http://www.ncbi.nlm.nih.gov/sra/SRX000425?report=full# ). The originating paper ( http://www.ncbi.nlm.nih.gov/pubmed/18477713 ) says that the reads are SOLiD paired-end, with 25 bp from each end, but the reads themselves are 52 bases long! It seems that the two ends have been ligated and sequenced together. I am wondering where the extra two bases come from, but the larger problem is that all the alignment programs out there seem to expect that when you have paired-end reads that you'll have TWO lists of reads, one for each half of the pair. (There are two archives attached to this SRA account, but I'm fairly certain that they're not paired. When you add the number of reads they come to the number of reads reported in the paper, and the reported number of matches is more than half that number. besides, the read id's don't appear to match, and the reads are too long.) What tools can I use to map these reads to a reference genome? for reference, here are some sample reads: first file: Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 T32322133300002330031001022230020232002203222030231 +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%' @SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50 T01212120333223322020022322232232232222022232033230 +SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50 !,*+*()+*(%'+)%%%&%+&%%'%%%%%%%%%%%%%%%%%%%%'+%%%%% @SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50 T32023002222000323202022222323322200222200220003032 +SRR015241.3 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_369_F3 length=50 !(*)%%%+'%%%*%%%%&%%%%%%%%%%%%%%%%%%%%%%%%%%%+%%%%% @SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50 T32021200310022332200020032222332303202203222030030 +SRR015241.4 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_487_F3 length=50 !9)'+*)4')%&&%)%%('&%%'%'%%%%%%%)%%%%%%%%%%%%+%%%%( Code:
@SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50 T03231223000321133333031113002130221200322111211011 +SRR015242.1 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_30_F3 length=50 !:9<3:99<*8;8<)0<;<%-8;2%%3*5%*.8<,1;6;*%&..'%%-*,% @SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50 T01032120003210102101003202002003021300310100313323 +SRR015242.2 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_68_F3 length=50 !<*7-;3291:/*0306/';'6<8&/;13'/,6%5&,''*+3--/+*4&%& @SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50 T30000002320022232001023330002000220231323302003320 +SRR015242.3 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_217_F3 length=50 !,%&'''---+5%%%*-(-2%37''%-&%%+(3-&%*%%'%*&2''%3.%% @SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50 T01301202310020020101002322020221212212112020001111 +SRR015242.4 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_312_F3 length=50 !52/601)1,&3:%5691*-':74),'%%%&%&+(*)&%)'&&,'&)*)*% @SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50 T30202333031100120210330331030310222032111001231300 +SRR015242.5 CLARA_20071207_2_CelmonAmp7797_8bit_1000_115_482_F3 length=50 !;<<<;;<<<<<1<<1;<56/<<:9:1;31;/<;%%/89/'99<'08)<%0 |
![]() |
![]() |
![]() |
#2 |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
The F3 is usually the first end, while the R3 is usually the second end (see the read name). Also, the read is in color space, hence the first T and the digits. Please review literature and this forum surrounding color space (what it is and what it looks like) and color space mapping (see BFAST/BWA/Mosaik/NovoalignCS/SHRiMP).
|
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: New Jersey Join Date: Mar 2011
Posts: 5
|
![]()
i grep'd the fastq files and there are no 'R3' strings in there. Why would the reads be 52 bases long when the paper says the run was 25 bases x 2? It seems likely that I failed to extract the reads from the sra-lite files correctly.
(it's color space, clearly. i'm just confused why the first nucleotide is always T -- inspecting the fastq files i haven't found any reads starting with any other letter. I've read that ABI ligates some sort of primer or something in some cases, which could explain this.) |
![]() |
![]() |
![]() |
#4 |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
The only way to decode a color space read is using the adapter (did you read up on color space?). If there are 2x25 and each has an adapter added then coincidentally 2x(25+1) = 52, which is the number of bases you observed. I am wondering if they have done some concatenation magic here.
Anyone else seen this problem when downloading from SRA? You could try to email their supports since the above data is confusing. |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: New Jersey Join Date: Mar 2011
Posts: 5
|
![]()
am contacting SRA. thanks for your help.
|
![]() |
![]() |
![]() |
#6 |
Guest
Posts: n/a
|
![]()
i'm pretty sure this paper is mate pair reads and not paired end.
I know it says paried end, but they are from circularized libraries which means mate pair. Plus it is from 2008 and the reverse reads were not even around. also, i've never seen data in that format before. |
![]() |
![]() |
#7 |
Nils Homer
Location: Boston, MA, USA Join Date: Nov 2008
Posts: 1,285
|
![]()
You will not see it again after the SRA shuts down
![]() The second end on mate pair data is the R3 tag, while F5 is the paired end tag). The first end should be F3, so it looks like just the first end is present or you've downloaded fragements? |
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: New Jersey Join Date: Mar 2011
Posts: 5
|
![]()
i contacted the SRA, who told me that they only received forward reads of length 50. It looks like i'll have to contact the authors of the paper. I'll post here when I solve the problem.
thanks for your help, ppl! |
![]() |
![]() |
![]() |
#9 |
Junior Member
Location: Boston Join Date: Jan 2011
Posts: 4
|
![]()
For each read, the first 26 characters are F3, the last 25 character prefixed with a 'G' are R3.
|
![]() |
![]() |
![]() |
#10 |
Junior Member
Location: New Jersey Join Date: Mar 2011
Posts: 5
|
![]()
sinaian: is there any alignment program that can read this format? If there's no native tool, it seems that I would have to make a script that maps the reads into dna-space (from colorspace), then separate the forward and back reads.
That would be relatively easy, but it seems like a pretty bad idea to map into DNA space before mapping! can anyone think of a better way? also, sinaian, is there any documentation for this format of reads? |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
BWA does colorspace fastq like this BUT (IMPORTANT) the target genome must have been indexed (created) using the using "-c" parameter: example "bwa index -c hg1xORwhatever.fasta". If bwa complains, use additional parameters : "-a bwtsw".
Then run bwa using the -c parameter to align. THIS "-c/colorspace" MUST BE DONE ON BOTH THE INDEXING AND ALIGNMENT COMMANDS!!!!! |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Germany Join Date: May 2010
Posts: 101
|
![]()
For BWA, you'd have to strip the T, convert 0 to A, 1 to C, 2 to G, 3 to T and anything else to N. That's pseudo-nucleotide representation of color space, which was introduced as a workaround. It has nothing to do with the actual decoding into nucleotide space!
But BWA would map these reads as if they were 50 bp reads, not mate pairs. You'll just have to write a script that separates the colorspace data and writes them to 2 files: one containing the F3 Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 T3232213330000233003100102 +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 21(()+%'+%40*.%%**)&%&*&% ... Code:
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3 G2230020232002203222030231 @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_R3 %%&%%%%%%%%%%%%%%%(+%%%%' ... The T / G primer has no quality so that the quality line is 25 characters instead of 26. You can then map these files separately with either BFAST or Novoalign (should work directly on this format) or BWA (after conversion into pseudo-nucleotides). |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
ALSO ... bwa provides solid2fastq.pl and solid2fastq2.pl programs for converting to a fastq suitable for input to "bwa aln"
|
![]() |
![]() |
![]() |
#14 | |
Junior Member
Location: Boston Join Date: Jan 2011
Posts: 4
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|