SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Count the Number of Paired-End Reads Mapped by Tophat Fernas Bioinformatics 5 10-23-2015 03:26 AM
Asymmetric trimmomatic output with paired-end RNA seq. data alpha2zee Bioinformatics 6 11-19-2014 05:57 AM
Metagenomics/ Number of paired end reads in metasim kumara Introductions 0 07-24-2014 12:49 PM
Trimming paired end RNAseq - Trimmomatic sjeschonek Bioinformatics 1 07-14-2014 12:52 PM
How to count number of mapped paired-end and single-end rna-seq reads repinementer Bioinformatics 8 01-06-2013 06:06 AM

Reply
 
Thread Tools
Old 10-20-2014, 11:01 PM   #1
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default Trimmomatic Paired End - Low number of surviving reads

Hi Friends,

I am having this problem with Illumina Hiseq (v. 1.9) paired end libraries (150nt reads). The number of surviving reads after trimming are very low.

This is my command:
Code:
java -jar trimmomatic-0.32.jar PE -threads 28 -phred33 WT_CTTGTA_L001_R1_001.fastq WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
My adapter FASTA file:
Quote:
>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>ReadThrough_PE/1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>ReadThrough_PE/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PCR_Primer/1.1/1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA
>PCR_Primer/1.2/1
ATCTCGTATGCCGTCTTCTGCTTG
>PCR_Primer/1.1/2
CAAGCAGAAGACGGCATACGAGAT
>PCR_Primer/1.2/2
TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
Stats from trimmomatic:
Quote:
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
Input Read Pairs: 22060013 Both Surviving: 14586934 (66.12%) Forward Only Surviving: 6580960 (29.83%) Reverse Only Surviving: 237640 (1.08%) Dropped: 654479 (2.97%)
TrimmomaticPE: Completed successfully
I would appreciate your help and/or suggestions.


BADE
BADE is offline   Reply With Quote
Old 10-20-2014, 11:35 PM   #2
avo
Member
 
Location: Germany

Join Date: Sep 2013
Posts: 14
Default

Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.

It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.
avo is offline   Reply With Quote
Old 10-21-2014, 05:08 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Give BBDuk a try on the side to see if you get better results.
GenoMax is offline   Reply With Quote
Old 10-21-2014, 07:42 AM   #4
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Hi Avo,

Quote:
Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.
I have e-mailed our sequencing center for the information and waiting for their reply.

Quote:
It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.
You are right - The quality of reverse reads is really low, and running Trimmomatic with just adapter trimming options (without quality trimming) reports back with 100% surviving reads in both:

Quote:
TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired__WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
Input Read Pairs: 22060013 Both Surviving: 22059856 (100.00%) Forward Only Surviving: 2 (0.00%) Reverse Only Surviving: 154 (0.00%) Dropped: 1 (0.00%)
TrimmomaticPE: Completed successfully
I have attached the quality scores for read1/Forward and read2/reverse below. What could be the reason for such low quality reverse reads? Is there a way I can rescue these low quality reads. For my analysis I need paired output.

Thanks

BADE
Attached Images
File Type: png Read1.png (10.1 KB, 46 views)
File Type: png Read2.png (10.5 KB, 46 views)
BADE is offline   Reply With Quote
Old 10-21-2014, 07:47 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores. Is this a MiSeq run?
GenoMax is offline   Reply With Quote
Old 10-21-2014, 08:26 AM   #6
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Hi Genomax,

Quote:
Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores.
But the problem is that I need paired files for my analysis and there are only 66% reads surviving in pair. Any suggestion on how to improve the number of surviving reads in both forward and reverse? Or should I combine unpaired reads and treat them as single end sequencing reads for my (RNA-seq) analysis to identify top expressed and differentially expressed genes?

Quote:
Is this a MiSeq run?
Its from HiSeq2500.

Thanks

BADE
BADE is offline   Reply With Quote
Old 10-21-2014, 08:53 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.

If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.
GenoMax is offline   Reply With Quote
Old 10-21-2014, 10:14 AM   #8
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Hi Genomax,

Quote:
Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.
The reason for such low reads in one paired library is because 6 samples (3 control and 3 test) were multiplexed on single lane. I am not sure is that's good or bad for a standard RNA-seq analysis for species with gold-standard reference genome like Mouse. Maybe you can comment on it.

Quote:
If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.
Thanks for your suggestion with parameter - SLIDINGWINDOW:4:5 - I am getting:

Quote:
TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:Trimmomatic-0.32/TruSeq3-PE-2.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
Input Read Pairs: 22060013 Both Surviving: 20067327 (90.97%) Forward Only Surviving: 1989348 (9.02%) Reverse Only Surviving: 3200 (0.01%) Dropped: 138 (0.00%)
I will continue with this and see how it goes. Actually, I was thinking of combining the unpaired reads and using the file as from single end sequencing for further analysis.

Any further suggestions would be helpful.

Bade
BADE is offline   Reply With Quote
Old 10-22-2014, 02:19 PM   #9
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Hi All,

As suggested in this thread I did the pre-processing of all the samples and proceeded to map the reads with TopHat2 keeping the standard analysis options (pasted below)

Quote:
* FASTQ Quality Scale: Sanger (PHRED33)
* Anchor length: 8
* Maximum number of mismatches that can appear in the anchor region of spliced alignment: 0
* The minimum intron length: 70
* The maximum intron length: 50000
* Minimum isoform fraction: 0.15
* Maximum number of alignments to be allowed: 20
* Minimum intron length that may be found during split-segment (default) search: 50
* Maximum intron length that may be found during split-segment (default) search: 500000
* Number of mismatches allowed in each segment alignment for reads mapped independently: 2
* Minimum length of read segments: 20
* Mate-Pair Inner Distance: 50
* Bowtie 2 speed and sensitivity: Sensitive (slower)
The TopHat alignment summary for WT and cKO samples is pasted below:

Quote:
Sample: WT_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 20067327
Mapped: 13674480 (68.1% of input)
of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
Right reads:
Input: 15421666
Mapped: 6407370 (41.5% of input)
of these: 605201 ( 9.4%) have multiple alignments (6202 have >20)
56.6% overall read alignment rate.

Aligned pairs: 5421652
of these: 48595 ( 0.9%) have multiple alignments
and: 5298036 (97.7%) are discordant alignments
0.8% concordant pair alignment rate.

Sample: cKO_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 18672105
Mapped: 10964334 (58.7% of input)
of these: 1118743 (10.2%) have multiple alignments (15345 have >20)
Right reads:
Input: 18672105
Mapped: 7635944 (40.9% of input)
of these: 736048 ( 9.6%) have multiple alignments (8155 have >20)
49.8% overall read alignment rate.

Aligned pairs: 7384018
of these: 550466 ( 7.5%) have multiple alignments
and: 13750 ( 0.2%) are discordant alignments
39.5% concordant pair alignment rate.
I am wondering why are there so many “discordant alignments” in WT sample? Can cKO sample be considered as “good” and used for further analysis?

Please suggest.

BADE
BADE is offline   Reply With Quote
Old 10-22-2014, 03:43 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That normally means that your read ordering got messed up by some preprocessing step, and thus the reads are no longer properly paired. Note, for example -

Quote:
Left reads:
Input: 20067327
Mapped: 13674480 (68.1% of input)
of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
Right reads:
Input: 15421666
Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.
Brian Bushnell is offline   Reply With Quote
Old 10-22-2014, 03:46 PM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

@BADE: You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?
GenoMax is offline   Reply With Quote
Old 10-22-2014, 07:53 PM   #12
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Quote:
@ Brian Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.
Yes, ordering of two samples was messed up. I ran the TopHat again and below is the output:
Quote:
Sample: WT_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 20067327
Mapped: 12967700 (64.6% of input)
of these: 1229841 ( 9.5%) have multiple alignments (21956 have >20)
Right reads:
Input: 20067327
Mapped: 8661002 (43.2% of input)
of these: 784306 ( 9.1%) have multiple alignments (11071 have >20)
53.9% overall read alignment rate.

Aligned pairs: 8312616
of these: 551497 ( 6.6%) have multiple alignments
and: 33317 ( 0.4%) are discordant alignments
41.3% concordant pair alignment rate.
I understand the read alignment rate is 53 % that is because of low quality reverse reads. Also, concordant rate is only 41.2 %. I am getting similar alignment rate and concordant rate for all the other samples. Is it appropriate to proceed with this data to perform the next step- Cufflink?

Quote:
@GenoMax:You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?
I have 20067327 reads surviving out of 22060013. For TopHat I used the out_paired reads.

Please suggest

Thanks,

BADE
BADE is offline   Reply With Quote
Old 10-22-2014, 08:07 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence. In that case, unless the genomic material is incredibly precious, you should just resequence it.

What organism is this, and do you have a reference or at least some assembly?
Brian Bushnell is offline   Reply With Quote
Old 10-22-2014, 11:08 PM   #14
relipmoc
Member
 
Location: Los Angeles, CA

Join Date: Jul 2011
Posts: 58
Default

hi BADE,
You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.

Last edited by relipmoc; 10-22-2014 at 11:12 PM. Reason: typo
relipmoc is offline   Reply With Quote
Old 10-23-2014, 10:33 AM   #15
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Quote:
Brian @ The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence.In that case, unless the genomic material is incredibly precious, you should just resequence it.
I performed the quality trimming also but the number of surviving reads was only 66% (described in earlier post). I am not sure how many will align to the genome.

Quote:
What organism is this, and do you have a reference or at least some assembly?
That data is from mouse samples and the ref genome is Mus musculus

Quote:
@ Relipmoc: You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.
Thanks for the suggestions. I will try it.
BADE is offline   Reply With Quote
Old 10-28-2014, 02:14 AM   #16
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by BADE View Post
I performed the quality trimming also but the number of surviving reads was only 66% (described in earlier post). I am not sure how many will align to the genome.
From the quality plots, i would guess the machine had some kind of problem during the cycle 2 and 3 in the reverse run. This could be caused by environmental problems (e.g. external vibration) or within the machine (bubbles in the flow cell etc). If you have access to the 'per tile' quality information there may be a clear pattern to the low quality reads.

I would suggest you perform a 'HEADCROP' of 3 bp on the data, immediately after the ILLUMINACLIP step, and before the other quality filtering steps. This will simply drop the problem bases entirely from all reads.

Alternatively, you could widen the window/lower the threshold of the SLIDINGWINDOW, which would help bridge these dodgy patches. However, even after getting the dodgy data past the trimming stage, you still have the issue of aligning it so the HEADCROP might be better.

You might also want to consider more liberal alignment settings in Tophat, since many of the reads are probably failing due to these poor quality bases.
tonybolger is offline   Reply With Quote
Old 10-29-2014, 10:00 AM   #17
BADE
Member
 
Location: Boston

Join Date: Jan 2014
Posts: 13
Default

Hi tonybolger,

Quote:
I would suggest you perform a 'HEADCROP' of 3 bp on the data, immediately after the ILLUMINACLIP step, and before the other quality filtering steps. This will simply drop the problem bases entirely from all reads.
I performed trimmomatic as per suggested settings. Here is my code and output:
Code:
TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:/home/kakrana/tools/Trimmomatic-0.32/TruSeq3-PE-2.fa:2:30:10:8:true HEADCROP:3 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
output:
Input Read Pairs: 22060013 Both Surviving: 14588404 (66.13%) Forward Only Surviving: 6416736 (29.09%) Reverse Only Surviving: 295779 (1.34%) Dropped: 759094 (3.44%)

As you see HEADCROP is not helping in getting more survive reads. What do you suggest? Should I use these paired reads to do the TopHat with changed settings?

Quote:
You might also want to consider more liberal alignment settings in Tophat, since many of the reads are probably failing due to these poor quality bases.
For the previous TopHat run I used the below standard analysis options:
* FASTQ Quality Scale: Sanger (PHRED33)
* Anchor length: 8
* Maximum number of mismatches that can appear in the anchor region of spliced alignment: 0
* The minimum intron length: 70
* The maximum intron length: 50000
* Minimum isoform fraction: 0.15
* Maximum number of alignments to be allowed: 20
* Minimum intron length that may be found during split-segment (default) search: 50
* Maximum intron length that may be found during split-segment (default) search: 500000
* Number of mismatches allowed in each segment alignment for reads mapped independently: 2
* Minimum length of read segments: 20
* Mate-Pair Inner Distance: 50
* Bowtie 2 speed and sensitivity: Sensitive (slower)

Would you please elaborate more about what setting should I use?

Thanks for your suggestions.
BADE is offline   Reply With Quote
Old 10-29-2014, 10:37 AM   #18
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by BADE View Post
Input Read Pairs: 22060013 Both Surviving: 14588404 (66.13%) Forward Only Surviving: 6416736 (29.09%) Reverse Only Surviving: 295779 (1.34%) Dropped: 759094 (3.44%)

As you see HEADCROP is not helping in getting more survive reads. What do you suggest?
OK, not as much improvement as i hoped. I guess you will also need to be a bit more liberal with the SLIDINGWINDOW - maybe an average of 10 or 12, rather than 15. You could already get a major improvement with 5, but that is very liberal.

You can also remove the MINLENGTH to see precisely how short the reads are getting after filtering.

Another alternative is the MAXINFO quality filter mode (rather than sliding window) - it adaptively gets stricter during the read, so almost all reads will get close to the target length.

Quote:
Originally Posted by BADE View Post
Should I use these paired reads to do the TopHat with changed settings?
I think you need to gain a few more reads, and then try to gain alignment rate with tophat. I would try alternative settings of --initial-read-mismatches and --segment-mismatches.

You could also consider aligning against the reference transcriptome - it won't get you the alternative splicing, but it will indicate if the reads are mostly ok with a few errors, or completely random, since most standard aligners are a bit more liberal than tophat.

In any case, given the quality plots and the mapping rates, it is a question of how much effort you want to spend on such low-quality data.
tonybolger is offline   Reply With Quote
Reply

Tags
illumina, paired end, trimming, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:28 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO