SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Can Cuffdiff treat paired-end and single-end reads at the same time? zun RNA Sequencing 3 06-12-2012 06:37 PM
Can paired-end mapping produce more reads than single-end ? warrenemmett Bioinformatics 13 03-21-2012 12:10 AM
Paired-end Bam from single-end aligned sam ramouz87 Bioinformatics 4 08-17-2011 01:55 PM
RNA-seq: Replicates, single-end, paired-end story pasta Bioinformatics 2 07-05-2011 12:51 AM
Does Cufflinks support single-end and paired end data together ? ersenkavak Bioinformatics 1 10-22-2010 08:26 AM

Reply
 
Thread Tools
Old 01-31-2011, 03:42 AM   #1
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default cufflinks detecting single-end when paired-end used

Dear users,

I am running the same commands for a set of four paired-end lanes in tophat and cufflinks. But when running cufflinks on the tophat .bam output, some of the lanes are detected as single-end while others are correctly detected as paired-end. How could this be? I show you the commands for the first lane which is wrongly detected with cufflinks stdout

$ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1_QC_sequence.fq /Data/s_1_2_QC_sequence.fq

$ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam
[12:02:53] Inspecting reads and determining fragment length distribution.
> Processed 35625 loci. [*************************] 100%
> Map Properties:
> Upper Quartile Mass: 444119.98
> Read Type: 105bp single-end
> Fragment Length Distribution: Gaussian (default)
> Estimated Mean: 217.34
> Estimated Std Dev: 65.65
[12:03:32] Assembling transcripts and estimating abundances.
> Processed 35657 loci. [*************************] 100%

Your comments are greatly appreciated.
Thanks

D.
Edit: my program versions:
TopHat (v1.1.4)
Bowtie version: 0.12.7.0
Samtools version: 0.1.11.0
Cufflinks version: 0.9.2

Last edited by dnusol; 01-31-2011 at 04:21 AM. Reason: include version info
dnusol is offline   Reply With Quote
Old 01-31-2011, 11:08 AM   #2
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Are you sure that all of your paired fastq files have the same name for both reads?
adarob is offline   Reply With Quote
Old 01-31-2011, 12:06 PM   #3
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

From manual
Quote:
When running TopHat with paired ends, it is critical that the *_1 files an the *_2 files appear in separate comma separated lists, and that the order of the files in the two lists is the same.
Thus your file names must end with myfile_1.fq and myfile_2.fq
Jon_Keats is offline   Reply With Quote
Old 02-01-2011, 03:19 AM   #4
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Thanks adarob and John for your answers but that does not solve the error I am finding. I have renamed the files so that they are called s_1_1.fq and s_1_2.fq and I still get that cufflinks reports the reads as single-end. Furthermore this issue does not happen all the time because for other lanes with the same naming convention, i.e. s_N_1_QC_sequence.fq and s_N_"_QC_sequence.fq, cufflinks reports them as paired-end

So this is a new try modifying the name to follow "convention":
$ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1.fq /Data/s_1_2.fq

$ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam
[11:57:53] Inspecting reads and determining fragment length distribution.
> Processed 35812 loci. [*************************] 100%
> Map Properties:
> Upper Quartile Mass: 445776.00
> Read Type: 105bp single-end
> Fragment Length Distribution: Gaussian (default)
> Estimated Mean: 217.34
> Estimated Std Dev: 65.65
[11:58:32] Assembling transcripts and estimating abundances.
> Processed 35845 loci.

And this is an old try with other lane that gives good results

$ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s3 -g 1 a_thaliana /Data/s_3_1_QC_sequence.fq /Data/s_3_2_QC_sequence.fq

$cufflinks-0.9.2/cufflinks -o ./tophat_s3/cufflinks -I 11000 -p 8 -N --library-type fr-unstranded ./tophat_s3/accepted_hits.bam
[12:05:41] Inspecting reads and determining fragment length distribution.
> Processed 34538 loci. [*************************] 100%
> Map Properties:
> Upper Quartile Mass: 498434.94
> Read Type: 105bp paired-end
> Fragment Length Distribution: Gaussian (default)
> Estimated Mean: 217.34
> Estimated Std Dev: 65.65
[12:06:24] Assembling transcripts and estimating abundances.
> Processed 34563 loci. [*************************] 100%


Can you suggest anything else I can try?

D.
dnusol is offline   Reply With Quote
Old 02-01-2011, 10:39 AM   #5
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

The only suggestions I'd have left are:

1) update to the most recent versions of tophat, cufflinks, bowtie
2) double check the integrity of the input files that are failing.
- Are they the same size in bytes?
- Are they in the same order, run head and tail on both files and compare
- Does your core generate MD5 sums on the original .fq if so run a checksum
3) though things are not working I also be worried that the mean and STD for both runs are identical, that seems hard to believe.
Jon_Keats is offline   Reply With Quote
Old 02-01-2011, 05:38 PM   #6
adarob
Member
 
Location: Berkeley, CA

Join Date: Jul 2010
Posts: 71
Default

Can you paste the first few reads from both ends of the fastq files that are giving you problems?
adarob is offline   Reply With Quote
Old 02-01-2011, 11:37 PM   #7
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Thanks for your suggestions. I think I see where you are aming at. So the problem could be that after doing QC and removing reads with low quality values on each file separately, the two files from a lane ended up with different number of reads and this is not correct . So initially both s_1_1_sequence.fq and s_1_2_sequence.fq had 32193587 reads and after QC, the numbers are as follow:
s_1_1_sequence.fq -- 15590194 reads
s_1_2_sequence.fq -- 16385365 reads

So I guess the read-pairing is lost for some reads. And these are the first and last few reads for both files

$ head s_1_1_QC_sequence.fq
@ILLUMINA-GA_191110:1:1:4938:1104#0/1
CGACAATCCTGTAGGTAAGGGCATGATTACCATATTTGATGAGATTTTCCTTATCAATAAGGAAATCATGGTCGCCATCGAGTTCCCAAAATCTGCAGTATATCA
+ILLUMINA-GA_191110:1:1:4938:1104#0/1
hhhhhhhhhhgehhhWeeedffgchghhhchggfhhhhhehhgghhhhaghh_hhhhhhfhcghhcehbgdaceahdaedac__acfcdceaaecaac\aac_aa
@ILLUMINA-GA_191110:1:1:5113:1105#0/1
TTCACTTTTCTTTGATTCCCTTCTAGACTGTTGTCAATAATGACTCTGATGAATAAATCTTCGTCTAGCTCTGATCCTATGGATTGCCATCTATGAACAAGCTTG
+ILLUMINA-GA_191110:1:1:5113:1105#0/1
hhhhhhhghghehhhgghehhgddhhfhghhfcfhghfhfggdhghdh]ghghhfhh`ehhhgfgfWghdgfeWfeg_daadRbdabdd_[dadaaba[b]_BBB
@ILLUMINA-GA_191110:1:1:5938:1104#0/1
GGAGAACTATGTGAATAGCCGTTCCTTGAAGCATGCTCGTGACATATATAGGCAAATTCGTGAACATGTTGAACAGATAGGCTTTAATGTGTCTTCATGCGGAAA

$ head s_1_2_QC_sequence.fq
@ILLUMINA-GA_191110:1:1:5233:1161#0/2
ATGCTAATGAGGTCATTGCTAATAGAGCAGCTGAGANTCTTGGTCGCAAACGTGGTGAGAAATGTGTGCACCCAAATGACCATGTGAACAGATCACAATCTTCTA
+ILLUMINA-GA_191110:1:1:5233:1161#0/2
fhhhhhchhahdehhhhhhhhhhhehdfcgddfaf_F^a_\^]Z]X[YXYZ[YZ\\daaa``_`[daebade_edaa_a]W^]_Sa_a_ad``Z__b___aaaa]
@ILLUMINA-GA_191110:1:1:6770:1157#0/2
TGACGTCTGGTGAAGCTGTCAAGTACAAGAGTTCTTNGGACGCCTTCAAGCAGATCCTCANCCTTCAAGAGCTCCCTCTGGTGCCTGGATCTCAACCATGTAANC
+ILLUMINA-GA_191110:1:1:6770:1157#0/2
ggaggggggeggagggggeg_dfdaff[facafcfdEdd^bbbdbgg]Rcebe_W]^\^WGYa^^_[^^Zc[]dbc_aaabMdba_aa\Yaaca]baabd]X]G^
@ILLUMINA-GA_191110:1:1:8784:1163#0/2
GCTTGTTCTTGCGTAGCCACATTAATGAACCTAGTGNAGCTAGAGATATCGTGAACACCAGATTTGCATTGGCTTCAAGCCCTTGCAAAACATAATCCCAAGTGA


$ tail s_1_1_QC_sequence.fq
+ILLUMINA-GA_191110:1:100:18210:21435#0/1
hghhhhhchheghhhhghhhhhhhffhhfhhhgfhchhhhhhhhhhfghghhhhhhhfgehaghhghggghhgdgeghgdhgcgdfgfggeggghfffgcffggg
@ILLUMINA-GA_191110:1:100:18834:21433#0/1
AGAGGATTATTGCTGAGAGGTTTGGGAAGAAGAAAGGTGGGAGTAGTGTTAAGAAGACGAGTTCGGTTAGGAAGAAGAAGAAGCGTGTTTCTGATGAGTCTGATT
+ILLUMINA-GA_191110:1:100:18834:21433#0/1
hcfhhhhhghhhhfhchhgghhhhhgfffahg]hgfdVff_\e^dfbfdfhhghg]cffcf_cfchcdhecehgfd]hhedceffTb^bbaaaR[a_aZabbabd
@ILLUMINA-GA_191110:1:100:19124:21432#0/1
GTAAATGATAAGGAAATCATTCATTTTTGTCATGAAGAACATAATCTAAGACTTGGTGGTGATGATGATGTTACCCGTTATGAAAAGATGCTGTGCGACGCTTGC
+ILLUMINA-GA_191110:1:100:19124:21432#0/1
f`fdfdff[f^faffdffcffffcff^ffVfffdffafffaafffgfgfggggg_f_faJa_W_[K^^Z\YU\aVQKWUUW^Z^^[^WI^YYWU\decebeabac

$ tail s_1_2_QC_sequence.fq
+ILLUMINA-GA_191110:1:100:18069:21438#0/2
hhhhhhhhhghhhghhghhhhhfhghhghhhfhhhhhghhhghhhhhhhhfehghehhghhhhhggfhfhfgghghhhhhgehhhggghghggehgggfhhggga
@ILLUMINA-GA_191110:1:100:18210:21435#0/2
TGCCAGTCTTAGCTTGCATAGATTTGATTGTTTCACCTCCTTTACCAATTATCAAACCAACCTTGTTATTCGGAATTTTCATAACAAATTGATCAGCCCCTGCTT
+ILLUMINA-GA_191110:1:100:18210:21435#0/2
hghhhhhhhhaffffhhhhggehhhhdhghhhghhhhhhhhhghhfhhhhhhhhhghhhhhhhhhhhhhhdhahghhfhhhhcggghhhhfhhggghhchgghh]
@ILLUMINA-GA_191110:1:100:18834:21433#0/2
CAGACTCATCAGAATCATCTTCATCTCTCTTTCTTCCTCTCCTCTCCTTTCTTCTCTTACTCCTTCCCTCTTCCTCATCCTCAGACTCACTCAAACTCCTTCTCT
+ILLUMINA-GA_191110:1:100:18834:21433#0/2
hhchhahhghhhghhhgghghhgdhhggfhgfgeggegcddfd_aedcdde_ebdggdbddffbdcdgddcfffcda]_d[^ZXYa\^Z_ad_^bc\_`VZ^^^^

I understand that having different reads may affect the analysis but remember that my analysis pipeline is the same for other lanes and other lanes are detected as paired-ends!
For s_3_1_sequence.fq and s_3_2_sequence.fq had 36272340 reads and after QC, the numbers are as follow:
s_3_1_sequence.fq -- 15265617 reads
s_3_2_sequence.fq -- 19079615 reads

and this time they are detected as paired-end reads by cufflinks.

As an aside, coming from this discussion, do you know of any piece of software that may do QC filtering on paired-end files maintaining the read-pairing?
dnusol is offline   Reply With Quote
Old 02-02-2011, 08:06 AM   #8
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

I'm guessing you found the problem. I'm not sure how much more help I can be as I just use the fastq files from our core were each file is identical. How are you generating these files? I'm a bit ignorant of the on machine steps, but my understanding was that the illumina qseq files are the unfiltered files and the "s_1_sequence.txt" files are those passing illumina filtering and the reads are paired properly. If you are filtering the "s_1_sequence.txt" files this may not be necessary and I doubt it is a common step (not that it would be incorrect to do so).

In regards to the one that is working, I still don't have a good answer but perhaps it keeps the correct order for a part of the file?

I'd suggest getting the input files to be the same number of reads with matching read IDs (ie. the s_1_sequence.txt files) renaming to the expected tophat convention (s_1_sequence_1.fq and s_1_sequence_2.fq) and that should work as expected.
Jon_Keats is offline   Reply With Quote
Old 02-02-2011, 09:43 AM   #9
dnusol
Senior Member
 
Location: Spain

Join Date: Jul 2009
Posts: 133
Default

Hi Jon, thanks for your input.
If I am not wrong, illumina filtering step is different from fastx QC filtering. Actually, if you have a look at the Phred scores for the raw _sequence.txt files in my experiment, supposing they are already quality filtered by Illumina, I shouldn't see the gradual drop in QC values down to below Q20 that I am actually seeing. This is why I filtered out some reads.

In this link
http://bioinfo-core.org/index.php/9t...8_October_2010

you could see in the first image an example of the kind of graph I am getting. They suggest trimming but I rather filter complete reads below a threshold this time.

I am now trying with the complete raw _sequence.txt file from illumina without filtering to see if the error comes up again.

Thanks

Dave
dnusol is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:03 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO