![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
SRA - SRR*.lite.sra | adrian | Bioinformatics | 2 | 03-19-2012 09:43 AM |
Keep large paired-end Fastq datasets in sync | sklages | Bioinformatics | 4 | 03-17-2011 02:28 AM |
Why are Illumina paired-end SRA datasets made up of 3 FASTQ files? | Bio.X2Y | Illumina/Solexa | 9 | 12-21-2010 11:36 AM |
Different read length sequencing datasets | g781 | Illumina/Solexa | 2 | 07-07-2010 08:45 AM |
Visualization Tools for Large Datasets | mrawlins | Bioinformatics | 4 | 04-28-2010 02:53 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
Hi everyone,
I have a question about pair-ended RNA-seq datasets on SRA. Some sequences file of pair-ended datasets are like SRR0011_1, SRR0011_2 which means these are paired sequences. But I didn't find the same information on some datasets and the reads length of each datasets seems two times than the length of one single RNA-seq reads mentioned in the paper. so do these datasets combined two paired sequences ? Thank you. |
![]() |
![]() |
![]() |
#2 |
Member
Location: Stanford, CA Join Date: May 2010
Posts: 88
|
![]()
what do you mean by "so do these datasets combined two paired sequences ? ", that doesn't quite make sense.
Are you asking how to tell if two files come from paired-end reads, if that information was lost?
__________________
SpliceMap: De novo detection of splice junctions from RNA-seq Download SpliceMap Comment here ![]() |
![]() |
![]() |
![]() |
#3 | |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
hi john,
I have checked some two paired-end reads file, one reads in the file is like: Quote:
yes, I am asking how to find the paired-ended information. here is an example link: http://www.ncbi.nlm.nih.gov/sra/SRX017794?report=full Thank you |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
The srf file format (which is how Illumina data is submitted to the SRA) has all bases for a spot (cluster) stored as a single string. Meta information also stored in the srf file indicates which portions of the that string represent read1 and read2 if it is a paired read (as well is which portion is the index if an MID protocol is run, etc.). When a FASTQ file is extracted from the srf the user must indicated whether they want the read split into its parts or the entire read as a single string. Your example looks like the FASTQ output you would get when you don't specify splitting the output into reads.
In the example you provided there are two possibilities: The srf file is malformed; it does not properly indicated that the data came from a paired end method and the data represents two reads. Alternatively the NCBI may not be properly splitting the data when it creates the FASTQ files. I suggest that you contact the SRA help desk with your questoin: sra@ncbi.nlm.nih.gov |
![]() |
![]() |
![]() |
#5 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
Hi kmcarr,
I will send an email to SRA. Thanks for your help. |
![]() |
![]() |
![]() |
#6 |
Junior Member
Location: Germany Join Date: Mar 2010
Posts: 9
|
![]()
syslm01, have you received an answer from SRA? I want to analyze the same dataset...
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Germany Join Date: Apr 2010
Posts: 19
|
![]()
syslm01, I found the same issue in the same datasets.
I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline. However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that? |
![]() |
![]() |
![]() |
#8 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
Hi pascal and fenan,
I received a letter from SRA. Here is the reply: In the case with SRX017794 and runs SRR037945 and SRR037946 we had a situation when SPOT_DESCRIPTOR has incorrect. To reload data - we need to get fixed srf files from original submitter (that may be impossible) or develop internal way to fix such data set, it will take some time as well. I recommend to split data by yourself for now. I also seperate the file in two files by myself, I found some of these reads are 75bp and some are 76bp, I have no idea about why this happen. |
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
If you are going to split the 152 nt reads manually do as stated above, nt 1-75 for read 1 and nt 77-151 for read 2. Could you provide some more details on what you mean by "the quality of the reads presents some strange properties". |
|
![]() |
![]() |
![]() |
#10 | |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]() Quote:
did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different. |
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Germany Join Date: Apr 2010
Posts: 19
|
![]()
@kmcarr
Thank you very much for the information. It really is what I was looking for. The thing is that you cannot download the srf file but the fastq, and that's why I need to split it manually. Quote:
Thanks again for your help. Last edited by fennan; 05-27-2010 at 04:06 AM. |
|
![]() |
![]() |
![]() |
#12 | |
Member
Location: Germany Join Date: Apr 2010
Posts: 19
|
![]() Quote:
However, in the header of the sam file you can find the command used to create such mapping. Take a look to it and maybe it will help you to figure out how things should be done. Unfortunately, this is not the case for the cufflinks output. I think it would be very useful if cufflinks stored the command line used to create its outputs (maybe it does it already, and I just haven't found where) |
|
![]() |
![]() |
![]() |
#13 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
Hi fennan,
I checked their command line, they use mm9+wold_spikes as references and provide tophat with junction file pooled_200bp_frags.juncs. I'm not sure what these files are, I think that my cause the differences. Do you have any idea? please tell me if you are sure how to deal with the raw data. Thank you very much. |
![]() |
![]() |
![]() |
#14 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
Hi,
I am also not sure about the other datasets: ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX019/SRX019275 The SRR039999_1.fastq.gz and SRR039999_2.fastq.gz are paired reads, but I am not sure the SRR039999.fastq.gz dataset, does it also belong to the SRR039999 ? but I don't find the pair-ended information. Does anyone have experiences with this kind of data? Thanks |
![]() |
![]() |
![]() |
#15 |
Junior Member
Location: San Diego Join Date: Feb 2010
Posts: 4
|
![]()
Hi Folks,
I feel lucky to find this thread because I have been struggling with the same problems. After splitting the unusual FASTQ files, my TopHat results are still quite different from what reported in the recent published paper. Can you tell me where to find the provided SAM file? I want to try the the reported command line. Thanks a lot, Yi-Shiou |
![]() |
![]() |
![]() |
#16 | |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]() Quote:
you could find the SAM file in their suplemental material online. |
|
![]() |
![]() |
![]() |
#17 |
Junior Member
Location: San Diego Join Date: Feb 2010
Posts: 4
|
![]()
Hi syslm01,
I look into the supplementary information page many times but didn't find any SAM file. Did I miss something very obvious or just look into a wrong place? Thanks again, Yi-Shiou |
![]() |
![]() |
![]() |
#18 |
Member
Location: china Join Date: Apr 2010
Posts: 16
|
![]()
hi ychen,
here is the link http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE20846 GSE20846_RAW.tar contains the SAM and gtf files |
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: San Diego Join Date: Feb 2010
Posts: 4
|
![]()
Hi syslm01,
Thanks so much for your help, I really appreciate it. Yi-Shiou |
![]() |
![]() |
![]() |
#20 | |
Junior Member
Location: Germany Join Date: Mar 2010
Posts: 9
|
![]()
Hi,
I splitted the fastq files with fastx into the original paired end files (positions 1-75 and 77-151) and ran TopHat with default settings. In the paper they did use: Quote:
I loaded the results into IGV and took a look at the results. The result is very different in many ways. 1. The coverage seems to be lower for almost every positions. accepted_hits.sam contains about half as much entries as the result from the paper. 2. My base phred quality varies while the quality from the paper is everywhere 40. 3. So many reads could be paired mapped (in IGV: "Pair is mapped = No") I did run Tophat with different parameters, but the result should not be that different? Maybe I missed something when I separated fastq file into two? Does anyone know how to generate such a splice junction file? Many thanks for any advise. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|