![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can Cuffdiff treat paired-end and single-end reads at the same time? | zun | RNA Sequencing | 3 | 06-12-2012 06:37 PM |
Can paired-end mapping produce more reads than single-end ? | warrenemmett | Bioinformatics | 13 | 03-21-2012 12:10 AM |
Paired-end Bam from single-end aligned sam | ramouz87 | Bioinformatics | 4 | 08-17-2011 01:55 PM |
RNA-seq: Replicates, single-end, paired-end story | pasta | Bioinformatics | 2 | 07-05-2011 12:51 AM |
Does Cufflinks support single-end and paired end data together ? | ersenkavak | Bioinformatics | 1 | 10-22-2010 08:26 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Spain Join Date: Jul 2009
Posts: 133
|
![]()
Dear users,
I am running the same commands for a set of four paired-end lanes in tophat and cufflinks. But when running cufflinks on the tophat .bam output, some of the lanes are detected as single-end while others are correctly detected as paired-end. How could this be? I show you the commands for the first lane which is wrongly detected with cufflinks stdout $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1_QC_sequence.fq /Data/s_1_2_QC_sequence.fq $ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam [12:02:53] Inspecting reads and determining fragment length distribution. > Processed 35625 loci. [*************************] 100% > Map Properties: > Upper Quartile Mass: 444119.98 > Read Type: 105bp single-end > Fragment Length Distribution: Gaussian (default) > Estimated Mean: 217.34 > Estimated Std Dev: 65.65 [12:03:32] Assembling transcripts and estimating abundances. > Processed 35657 loci. [*************************] 100% Your comments are greatly appreciated. Thanks D. Edit: my program versions: TopHat (v1.1.4) Bowtie version: 0.12.7.0 Samtools version: 0.1.11.0 Cufflinks version: 0.9.2 Last edited by dnusol; 01-31-2011 at 04:21 AM. Reason: include version info |
![]() |
![]() |
![]() |
#2 |
Member
Location: Berkeley, CA Join Date: Jul 2010
Posts: 71
|
![]()
Are you sure that all of your paired fastq files have the same name for both reads?
|
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279
|
![]()
From manual
Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Spain Join Date: Jul 2009
Posts: 133
|
![]()
Thanks adarob and John for your answers but that does not solve the error I am finding. I have renamed the files so that they are called s_1_1.fq and s_1_2.fq and I still get that cufflinks reports the reads as single-end. Furthermore this issue does not happen all the time because for other lanes with the same naming convention, i.e. s_N_1_QC_sequence.fq and s_N_"_QC_sequence.fq, cufflinks reports them as paired-end
So this is a new try modifying the name to follow "convention": $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1.fq /Data/s_1_2.fq $ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam [11:57:53] Inspecting reads and determining fragment length distribution. > Processed 35812 loci. [*************************] 100% > Map Properties: > Upper Quartile Mass: 445776.00 > Read Type: 105bp single-end > Fragment Length Distribution: Gaussian (default) > Estimated Mean: 217.34 > Estimated Std Dev: 65.65 [11:58:32] Assembling transcripts and estimating abundances. > Processed 35845 loci. And this is an old try with other lane that gives good results $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s3 -g 1 a_thaliana /Data/s_3_1_QC_sequence.fq /Data/s_3_2_QC_sequence.fq $cufflinks-0.9.2/cufflinks -o ./tophat_s3/cufflinks -I 11000 -p 8 -N --library-type fr-unstranded ./tophat_s3/accepted_hits.bam [12:05:41] Inspecting reads and determining fragment length distribution. > Processed 34538 loci. [*************************] 100% > Map Properties: > Upper Quartile Mass: 498434.94 > Read Type: 105bp paired-end > Fragment Length Distribution: Gaussian (default) > Estimated Mean: 217.34 > Estimated Std Dev: 65.65 [12:06:24] Assembling transcripts and estimating abundances. > Processed 34563 loci. [*************************] 100% Can you suggest anything else I can try? D. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279
|
![]()
The only suggestions I'd have left are:
1) update to the most recent versions of tophat, cufflinks, bowtie 2) double check the integrity of the input files that are failing. - Are they the same size in bytes? - Are they in the same order, run head and tail on both files and compare - Does your core generate MD5 sums on the original .fq if so run a checksum 3) though things are not working I also be worried that the mean and STD for both runs are identical, that seems hard to believe. |
![]() |
![]() |
![]() |
#6 |
Member
Location: Berkeley, CA Join Date: Jul 2010
Posts: 71
|
![]()
Can you paste the first few reads from both ends of the fastq files that are giving you problems?
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Spain Join Date: Jul 2009
Posts: 133
|
![]()
Thanks for your suggestions. I think I see where you are aming at. So the problem could be that after doing QC and removing reads with low quality values on each file separately, the two files from a lane ended up with different number of reads and this is not correct . So initially both s_1_1_sequence.fq and s_1_2_sequence.fq had 32193587 reads and after QC, the numbers are as follow:
s_1_1_sequence.fq -- 15590194 reads s_1_2_sequence.fq -- 16385365 reads So I guess the read-pairing is lost for some reads. And these are the first and last few reads for both files $ head s_1_1_QC_sequence.fq @ILLUMINA-GA_191110:1:1:4938:1104#0/1 CGACAATCCTGTAGGTAAGGGCATGATTACCATATTTGATGAGATTTTCCTTATCAATAAGGAAATCATGGTCGCCATCGAGTTCCCAAAATCTGCAGTATATCA +ILLUMINA-GA_191110:1:1:4938:1104#0/1 hhhhhhhhhhgehhhWeeedffgchghhhchggfhhhhhehhgghhhhaghh_hhhhhhfhcghhcehbgdaceahdaedac__acfcdceaaecaac\aac_aa @ILLUMINA-GA_191110:1:1:5113:1105#0/1 TTCACTTTTCTTTGATTCCCTTCTAGACTGTTGTCAATAATGACTCTGATGAATAAATCTTCGTCTAGCTCTGATCCTATGGATTGCCATCTATGAACAAGCTTG +ILLUMINA-GA_191110:1:1:5113:1105#0/1 hhhhhhhghghehhhgghehhgddhhfhghhfcfhghfhfggdhghdh]ghghhfhh`ehhhgfgfWghdgfeWfeg_daadRbdabdd_[dadaaba[b]_BBB @ILLUMINA-GA_191110:1:1:5938:1104#0/1 GGAGAACTATGTGAATAGCCGTTCCTTGAAGCATGCTCGTGACATATATAGGCAAATTCGTGAACATGTTGAACAGATAGGCTTTAATGTGTCTTCATGCGGAAA $ head s_1_2_QC_sequence.fq @ILLUMINA-GA_191110:1:1:5233:1161#0/2 ATGCTAATGAGGTCATTGCTAATAGAGCAGCTGAGANTCTTGGTCGCAAACGTGGTGAGAAATGTGTGCACCCAAATGACCATGTGAACAGATCACAATCTTCTA +ILLUMINA-GA_191110:1:1:5233:1161#0/2 fhhhhhchhahdehhhhhhhhhhhehdfcgddfaf_F^a_\^]Z]X[YXYZ[YZ\\daaa``_`[daebade_edaa_a]W^]_Sa_a_ad``Z__b___aaaa] @ILLUMINA-GA_191110:1:1:6770:1157#0/2 TGACGTCTGGTGAAGCTGTCAAGTACAAGAGTTCTTNGGACGCCTTCAAGCAGATCCTCANCCTTCAAGAGCTCCCTCTGGTGCCTGGATCTCAACCATGTAANC +ILLUMINA-GA_191110:1:1:6770:1157#0/2 ggaggggggeggagggggeg_dfdaff[facafcfdEdd^bbbdbgg]Rcebe_W]^\^WGYa^^_[^^Zc[]dbc_aaabMdba_aa\Yaaca]baabd]X]G^ @ILLUMINA-GA_191110:1:1:8784:1163#0/2 GCTTGTTCTTGCGTAGCCACATTAATGAACCTAGTGNAGCTAGAGATATCGTGAACACCAGATTTGCATTGGCTTCAAGCCCTTGCAAAACATAATCCCAAGTGA $ tail s_1_1_QC_sequence.fq +ILLUMINA-GA_191110:1:100:18210:21435#0/1 hghhhhhchheghhhhghhhhhhhffhhfhhhgfhchhhhhhhhhhfghghhhhhhhfgehaghhghggghhgdgeghgdhgcgdfgfggeggghfffgcffggg @ILLUMINA-GA_191110:1:100:18834:21433#0/1 AGAGGATTATTGCTGAGAGGTTTGGGAAGAAGAAAGGTGGGAGTAGTGTTAAGAAGACGAGTTCGGTTAGGAAGAAGAAGAAGCGTGTTTCTGATGAGTCTGATT +ILLUMINA-GA_191110:1:100:18834:21433#0/1 hcfhhhhhghhhhfhchhgghhhhhgfffahg]hgfdVff_\e^dfbfdfhhghg]cffcf_cfchcdhecehgfd]hhedceffTb^bbaaaR[a_aZabbabd @ILLUMINA-GA_191110:1:100:19124:21432#0/1 GTAAATGATAAGGAAATCATTCATTTTTGTCATGAAGAACATAATCTAAGACTTGGTGGTGATGATGATGTTACCCGTTATGAAAAGATGCTGTGCGACGCTTGC +ILLUMINA-GA_191110:1:100:19124:21432#0/1 f`fdfdff[f^faffdffcffffcff^ffVfffdffafffaafffgfgfggggg_f_faJa_W_[K^^Z\YU\aVQKWUUW^Z^^[^WI^YYWU\decebeabac $ tail s_1_2_QC_sequence.fq +ILLUMINA-GA_191110:1:100:18069:21438#0/2 hhhhhhhhhghhhghhghhhhhfhghhghhhfhhhhhghhhghhhhhhhhfehghehhghhhhhggfhfhfgghghhhhhgehhhggghghggehgggfhhggga @ILLUMINA-GA_191110:1:100:18210:21435#0/2 TGCCAGTCTTAGCTTGCATAGATTTGATTGTTTCACCTCCTTTACCAATTATCAAACCAACCTTGTTATTCGGAATTTTCATAACAAATTGATCAGCCCCTGCTT +ILLUMINA-GA_191110:1:100:18210:21435#0/2 hghhhhhhhhaffffhhhhggehhhhdhghhhghhhhhhhhhghhfhhhhhhhhhghhhhhhhhhhhhhhdhahghhfhhhhcggghhhhfhhggghhchgghh] @ILLUMINA-GA_191110:1:100:18834:21433#0/2 CAGACTCATCAGAATCATCTTCATCTCTCTTTCTTCCTCTCCTCTCCTTTCTTCTCTTACTCCTTCCCTCTTCCTCATCCTCAGACTCACTCAAACTCCTTCTCT +ILLUMINA-GA_191110:1:100:18834:21433#0/2 hhchhahhghhhghhhgghghhgdhhggfhgfgeggegcddfd_aedcdde_ebdggdbddffbdcdgddcfffcda]_d[^ZXYa\^Z_ad_^bc\_`VZ^^^^ I understand that having different reads may affect the analysis but remember that my analysis pipeline is the same for other lanes and other lanes are detected as paired-ends! For s_3_1_sequence.fq and s_3_2_sequence.fq had 36272340 reads and after QC, the numbers are as follow: s_3_1_sequence.fq -- 15265617 reads s_3_2_sequence.fq -- 19079615 reads and this time they are detected as paired-end reads by cufflinks. As an aside, coming from this discussion, do you know of any piece of software that may do QC filtering on paired-end files maintaining the read-pairing? |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279
|
![]()
I'm guessing you found the problem. I'm not sure how much more help I can be as I just use the fastq files from our core were each file is identical. How are you generating these files? I'm a bit ignorant of the on machine steps, but my understanding was that the illumina qseq files are the unfiltered files and the "s_1_sequence.txt" files are those passing illumina filtering and the reads are paired properly. If you are filtering the "s_1_sequence.txt" files this may not be necessary and I doubt it is a common step (not that it would be incorrect to do so).
In regards to the one that is working, I still don't have a good answer but perhaps it keeps the correct order for a part of the file? I'd suggest getting the input files to be the same number of reads with matching read IDs (ie. the s_1_sequence.txt files) renaming to the expected tophat convention (s_1_sequence_1.fq and s_1_sequence_2.fq) and that should work as expected. |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Spain Join Date: Jul 2009
Posts: 133
|
![]()
Hi Jon, thanks for your input.
If I am not wrong, illumina filtering step is different from fastx QC filtering. Actually, if you have a look at the Phred scores for the raw _sequence.txt files in my experiment, supposing they are already quality filtered by Illumina, I shouldn't see the gradual drop in QC values down to below Q20 that I am actually seeing. This is why I filtered out some reads. In this link http://bioinfo-core.org/index.php/9t...8_October_2010 you could see in the first image an example of the kind of graph I am getting. They suggest trimming but I rather filter complete reads below a threshold this time. I am now trying with the complete raw _sequence.txt file from illumina without filtering to see if the error comes up again. Thanks Dave |
![]() |
![]() |
![]() |
Thread Tools | |
|
|