SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
SRA - SRR*.lite.sra adrian Bioinformatics 2 03-19-2012 10:43 AM
Keep large paired-end Fastq datasets in sync sklages Bioinformatics 4 03-17-2011 03:28 AM
Why are Illumina paired-end SRA datasets made up of 3 FASTQ files? Bio.X2Y Illumina/Solexa 9 12-21-2010 12:36 PM
Different read length sequencing datasets g781 Illumina/Solexa 2 07-07-2010 09:45 AM
Visualization Tools for Large Datasets mrawlins Bioinformatics 4 04-28-2010 03:53 AM

Reply
 
Thread Tools
Old 05-21-2010, 07:43 AM   #1
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default about SRA paired datasets

Hi everyone,

I have a question about pair-ended RNA-seq datasets on SRA. Some sequences file of pair-ended datasets are like SRR0011_1, SRR0011_2 which means these are paired sequences. But I didn't find the same information on some datasets and the reads length of each datasets seems two times than the length of one single RNA-seq reads mentioned in the paper.

so do these datasets combined two paired sequences ?

Thank you.
syslm01 is offline   Reply With Quote
Old 05-23-2010, 07:29 PM   #2
john_mu
Member
 
Location: Stanford, CA

Join Date: May 2010
Posts: 88
Default

what do you mean by "so do these datasets combined two paired sequences ? ", that doesn't quite make sense.

Are you asking how to tell if two files come from paired-end reads, if that information was lost?
__________________
SpliceMap: De novo detection of splice junctions from RNA-seq
Download SpliceMap Comment here
john_mu is offline   Reply With Quote
Old 05-23-2010, 07:44 PM   #3
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

hi john,
I have checked some two paired-end reads file, one reads in the file is like:
Quote:
@SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
NNNANNNNNNNATCTCTTTAGATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGAAGAAACCTCTGATCCACCTCTAATACATCATTTATTTTTTTTATATTTATATATATGTAAAAAGATATAAAAACAAAGAAG
+SRR037945.1 HWUSI-EAS627_1:2:1:0:1629 length=152
!!!#!!!!!!!#############################################################################################################################################
the sequence length is 152bp, and I know their RNA-seq data is 75bp, so I wonder if these two paired-ended reads are join togather.

yes, I am asking how to find the paired-ended information.

here is an example link: http://www.ncbi.nlm.nih.gov/sra/SRX017794?report=full

Thank you
syslm01 is offline   Reply With Quote
Old 05-24-2010, 05:37 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

The srf file format (which is how Illumina data is submitted to the SRA) has all bases for a spot (cluster) stored as a single string. Meta information also stored in the srf file indicates which portions of the that string represent read1 and read2 if it is a paired read (as well is which portion is the index if an MID protocol is run, etc.). When a FASTQ file is extracted from the srf the user must indicated whether they want the read split into its parts or the entire read as a single string. Your example looks like the FASTQ output you would get when you don't specify splitting the output into reads.

In the example you provided there are two possibilities: The srf file is malformed; it does not properly indicated that the data came from a paired end method and the data represents two reads. Alternatively the NCBI may not be properly splitting the data when it creates the FASTQ files.

I suggest that you contact the SRA help desk with your questoin: sra@ncbi.nlm.nih.gov
kmcarr is offline   Reply With Quote
Old 05-24-2010, 05:50 AM   #5
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Hi kmcarr,

I will send an email to SRA.

Thanks for your help.
syslm01 is offline   Reply With Quote
Old 05-26-2010, 07:36 AM   #6
pascal
Junior Member
 
Location: Germany

Join Date: Mar 2010
Posts: 9
Default

syslm01, have you received an answer from SRA? I want to analyze the same dataset...
pascal is offline   Reply With Quote
Old 05-26-2010, 07:54 AM   #7
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

syslm01, I found the same issue in the same datasets.

I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?
fennan is offline   Reply With Quote
Old 05-26-2010, 08:25 AM   #8
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Hi pascal and fenan,

I received a letter from SRA. Here is the reply:

In the case with SRX017794 and runs SRR037945 and SRR037946 we had a situation when SPOT_DESCRIPTOR has incorrect.
To reload data - we need to get fixed srf files from original submitter (that may be impossible) or develop internal way to fix such data set, it will take some time as well.
I recommend to split data by yourself for now.

I also seperate the file in two files by myself, I found some of these reads are 75bp and some are 76bp, I have no idea about why this happen.
syslm01 is offline   Reply With Quote
Old 05-26-2010, 10:01 AM   #9
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by fennan View Post
However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?
For Illumina sequencing it is normal to collect one additional cycle of data for each read; that is, if the final read length you want is 75nt then you will collect 76 cycles of data but the base from the last cycle is not reported. (This has to do with phasing/prephasing correction. To correct for phasing in cycle n you need data from cycle n+1; thus the last cycle can never have phasing correction applied to is so standard procedure is to trim it off.) To collect 2 X 75 nt paired end reads you would want 152 cycles (2 X 76). If the SRF file had been properly formed the command line option "--use_bases Y75n,Y75n" would have been used. This would signify that within the 152 cycles of raw data, cycles 1-75 are read 1, cycle 76 is to be ignored, cycles 77-151 are read 2 and cycle 152 is ignored. When FASTQ is output from the SRF file by (e.g. by the program srf2fastq) it would split the data into separate fastq files for reads 1 and 2.

If you are going to split the 152 nt reads manually do as stated above, nt 1-75 for read 1 and nt 77-151 for read 2.

Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".
kmcarr is offline   Reply With Quote
Old 05-27-2010, 03:59 AM   #10
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Quote:
Originally Posted by fennan View Post
syslm01, I found the same issue in the same datasets.

I also thought that the two mates of the reads might be concatenated. I run some quality control process for the reads and it confirmed it (if you want I could send them to you). What I did was to write a script that divides the reads in two files "*_1.fastq" and "*_2.fastq" in order to be able to use tophat/cufflinks pipeline.

However I still have some concerns with these data since the quality of the reads presents some strange properties and also I saw that the length of the reads that the original authors report in the sam and gtf files is 75 instead of 76 as I found from the raw data... Any thought on that?
Hi,

did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.
syslm01 is offline   Reply With Quote
Old 05-27-2010, 04:27 AM   #11
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

@kmcarr
Thank you very much for the information. It really is what I was looking for. The thing is that you cannot download the srf file but the fastq, and that's why I need to split it manually.

Quote:
Could you provide some more details on what you mean by "the quality of the reads presents some strange properties".
I have obtained some quality control graphs from the raw data. I could provide them to you if you are interested. The thing that called my attention the most was the difference between the quality of the first and the second read, as well as the low quality of the basis T in the second read. You can see an example of this in the attached image. It represents the basis mean quality per position (T is the blue line), which has been generated from the file "SRR037945.fastq" of the run "SRX017794" (similar graphs are obtained for most of the other fastq files). Do you have any idea why this is happening?

Thanks again for your help.
Attached Images
File Type: jpg basesQualities.jpg (9.6 KB, 26 views)

Last edited by fennan; 05-27-2010 at 05:06 AM.
fennan is offline   Reply With Quote
Old 05-27-2010, 04:35 AM   #12
fennan
Member
 
Location: Germany

Join Date: Apr 2010
Posts: 19
Default

Quote:
Originally Posted by syslm01 View Post
Hi,

did you use the datasets to run tophat and cufflinks ? did the result are same as their provided sam files? I have a try, but my result is different.
That was what I wanted to do at first. I haven't done it yet since I wasn't sure how to deal with the raw data.

However, in the header of the sam file you can find the command used to create such mapping. Take a look to it and maybe it will help you to figure out how things should be done. Unfortunately, this is not the case for the cufflinks output. I think it would be very useful if cufflinks stored the command line used to create its outputs (maybe it does it already, and I just haven't found where)
fennan is offline   Reply With Quote
Old 05-27-2010, 05:39 AM   #13
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Hi fennan,

I checked their command line, they use mm9+wold_spikes as references and provide tophat with junction file pooled_200bp_frags.juncs. I'm not sure what these files are, I think that my cause the differences. Do you have any idea?

please tell me if you are sure how to deal with the raw data.

Thank you very much.
syslm01 is offline   Reply With Quote
Old 05-27-2010, 08:36 AM   #14
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Hi,

I am also not sure about the other datasets: ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX019/SRX019275
The SRR039999_1.fastq.gz and SRR039999_2.fastq.gz are paired reads, but I am not sure the SRR039999.fastq.gz dataset, does it also belong to the SRR039999 ? but I don't find the pair-ended information.

Does anyone have experiences with this kind of data?

Thanks
syslm01 is offline   Reply With Quote
Old 05-29-2010, 08:07 AM   #15
ychen
Junior Member
 
Location: San Diego

Join Date: Feb 2010
Posts: 4
Default

Hi Folks,

I feel lucky to find this thread because I have been struggling with the same problems. After splitting the unusual FASTQ files, my TopHat results are still quite different from what reported in the recent published paper. Can you tell me where to find the provided SAM file? I want to try the the reported command line.

Thanks a lot,

Yi-Shiou
ychen is offline   Reply With Quote
Old 05-29-2010, 08:34 AM   #16
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

Quote:
Originally Posted by ychen View Post
Hi Folks,

I feel lucky to find this thread because I have been struggling with the same problems. After splitting the unusual FASTQ files, my TopHat results are still quite different from what reported in the recent published paper. Can you tell me where to find the provided SAM file? I want to try the the reported command line.

Thanks a lot,

Yi-Shiou
Hi,

you could find the SAM file in their suplemental material online.
syslm01 is offline   Reply With Quote
Old 05-29-2010, 09:17 AM   #17
ychen
Junior Member
 
Location: San Diego

Join Date: Feb 2010
Posts: 4
Default

Hi syslm01,

I look into the supplementary information page many times but didn't find any SAM file. Did I miss something very obvious or just look into a wrong place?

Thanks again,

Yi-Shiou
ychen is offline   Reply With Quote
Old 05-29-2010, 09:59 AM   #18
syslm01
Member
 
Location: china

Join Date: Apr 2010
Posts: 16
Default

hi ychen,

here is the link http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE20846
GSE20846_RAW.tar contains the SAM and gtf files
syslm01 is offline   Reply With Quote
Old 05-29-2010, 10:51 AM   #19
ychen
Junior Member
 
Location: San Diego

Join Date: Feb 2010
Posts: 4
Default

Hi syslm01,

Thanks so much for your help, I really appreciate it.


Yi-Shiou
ychen is offline   Reply With Quote
Old 06-04-2010, 03:22 AM   #20
pascal
Junior Member
 
Location: Germany

Join Date: Mar 2010
Posts: 9
Default

Hi,

I splitted the fastq files with fastx into the original paired end files (positions 1-75 and 77-151) and ran TopHat with default settings. In the paper they did use:

Quote:
tophat -p 8 -F 0.0 -r 50 -m 1 --no-novel-juncs -j ../../pooled_200bp_frags.juncs -o pooled_tophat2 -a 8 /fs/szasmg3/cole/ebwts/m_musculus/mm9/fast_mm9/mm9+wold_spikes s1_1.query75.txt,s2_1.query75.txt s1_2.query75.txt,s2_2.query75.txt
So the difference to my run is, that I left out "--no-novel-juncs" and "-j ../../pooled_200bp_frags.junc". Further I used the default bowtie mm9 reference.

I loaded the results into IGV and took a look at the results. The result is very different in many ways.
1. The coverage seems to be lower for almost every positions. accepted_hits.sam contains about half as much entries as the result from the paper.
2. My base phred quality varies while the quality from the paper is everywhere 40.
3. So many reads could be paired mapped (in IGV: "Pair is mapped = No")

I did run Tophat with different parameters, but the result should not be that different? Maybe I missed something when I separated fastq file into two? Does anyone know how to generate such a splice junction file?

Many thanks for any advise.
pascal is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO