Seqanswers Leaderboard Ad

**avo** · 10-20-2014, 10:35 PM

Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.

It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.

**GenoMax** · 10-21-2014, 04:08 AM

Give BBDuk a try on the side to see if you get better results.

**BADE** · 10-21-2014, 06:42 AM

Hi Avo,

Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.

I have e-mailed our sequencing center for the information and waiting for their reply.

It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.

You are right - The quality of reverse reads is really low, and running Trimmomatic with just adapter trimming options (without quality trimming) reports back with 100% surviving reads in both:

TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired__WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
Input Read Pairs: 22060013 Both Surviving: 22059856 (100.00%) Forward Only Surviving: 2 (0.00%) Reverse Only Surviving: 154 (0.00%) Dropped: 1 (0.00%)
TrimmomaticPE: Completed successfully

I have attached the quality scores for read1/Forward and read2/reverse below. What could be the reason for such low quality reverse reads? Is there a way I can rescue these low quality reads. For my analysis I need paired output.

Thanks

BADE

Attached Files

**GenoMax** · 10-21-2014, 06:47 AM

Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores. Is this a MiSeq run?

**BADE** · 10-21-2014, 07:26 AM

Hi Genomax,

Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores.

But the problem is that I need paired files for my analysis and there are only 66% reads surviving in pair. Any suggestion on how to improve the number of surviving reads in both forward and reverse? Or should I combine unpaired reads and treat them as single end sequencing reads for my (RNA-seq) analysis to identify top expressed and differentially expressed genes?

Is this a MiSeq run?

Its from HiSeq2500.

Thanks

BADE

**GenoMax** · 10-21-2014, 07:53 AM

Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.

If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.

**BADE** · 10-21-2014, 09:14 AM

Hi Genomax,

Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.

The reason for such low reads in one paired library is because 6 samples (3 control and 3 test) were multiplexed on single lane. I am not sure is that's good or bad for a standard RNA-seq analysis for species with gold-standard reference genome like Mouse. Maybe you can comment on it.

If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.

Thanks for your suggestion with parameter - SLIDINGWINDOW:4:5 - I am getting:

TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:Trimmomatic-0.32/TruSeq3-PE-2.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:36
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
Input Read Pairs: 22060013 Both Surviving: 20067327 (90.97%) Forward Only Surviving: 1989348 (9.02%) Reverse Only Surviving: 3200 (0.01%) Dropped: 138 (0.00%)

I will continue with this and see how it goes. Actually, I was thinking of combining the unpaired reads and using the file as from single end sequencing for further analysis.

Any further suggestions would be helpful.

Bade

**BADE** · 10-22-2014, 01:19 PM

Hi All,

As suggested in this thread I did the pre-processing of all the samples and proceeded to map the reads with TopHat2 keeping the standard analysis options (pasted below)

* FASTQ Quality Scale: Sanger (PHRED33)
* Anchor length: 8
* Maximum number of mismatches that can appear in the anchor region of spliced alignment: 0
* The minimum intron length: 70
* The maximum intron length: 50000
* Minimum isoform fraction: 0.15
* Maximum number of alignments to be allowed: 20
* Minimum intron length that may be found during split-segment (default) search: 50
* Maximum intron length that may be found during split-segment (default) search: 500000
* Number of mismatches allowed in each segment alignment for reads mapped independently: 2
* Minimum length of read segments: 20
* Mate-Pair Inner Distance: 50
* Bowtie 2 speed and sensitivity: Sensitive (slower)

The TopHat alignment summary for WT and cKO samples is pasted below:

Sample: WT_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 20067327
Mapped: 13674480 (68.1% of input)
of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
Right reads:
Input: 15421666
Mapped: 6407370 (41.5% of input)
of these: 605201 ( 9.4%) have multiple alignments (6202 have >20)
56.6% overall read alignment rate.

Aligned pairs: 5421652
of these: 48595 ( 0.9%) have multiple alignments
and: 5298036 (97.7%) are discordant alignments
0.8% concordant pair alignment rate.

Sample: cKO_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 18672105
Mapped: 10964334 (58.7% of input)
of these: 1118743 (10.2%) have multiple alignments (15345 have >20)
Right reads:
Input: 18672105
Mapped: 7635944 (40.9% of input)
of these: 736048 ( 9.6%) have multiple alignments (8155 have >20)
49.8% overall read alignment rate.

Aligned pairs: 7384018
of these: 550466 ( 7.5%) have multiple alignments
and: 13750 ( 0.2%) are discordant alignments
39.5% concordant pair alignment rate.

I am wondering why are there so many “discordant alignments” in WT sample? Can cKO sample be considered as “good” and used for further analysis?

Please suggest.

BADE

**Brian Bushnell** · 10-22-2014, 02:43 PM

That normally means that your read ordering got messed up by some preprocessing step, and thus the reads are no longer properly paired. Note, for example -

Left reads:
Input: 20067327
Mapped: 13674480 (68.1% of input)
of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
Right reads:
Input: 15421666

Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.

**GenoMax** · 10-22-2014, 02:46 PM

@BADE: You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?

**BADE** · 10-22-2014, 06:53 PM

@ Brian Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.

Yes, ordering of two samples was messed up. I ran the TopHat again and below is the output:

Sample: WT_CTTGTA_L001_R1_001.fastq
Left reads:
Input: 20067327
Mapped: 12967700 (64.6% of input)
of these: 1229841 ( 9.5%) have multiple alignments (21956 have >20)
Right reads:
Input: 20067327
Mapped: 8661002 (43.2% of input)
of these: 784306 ( 9.1%) have multiple alignments (11071 have >20)
53.9% overall read alignment rate.

Aligned pairs: 8312616
of these: 551497 ( 6.6%) have multiple alignments
and: 33317 ( 0.4%) are discordant alignments
41.3% concordant pair alignment rate.

I understand the read alignment rate is 53 % that is because of low quality reverse reads. Also, concordant rate is only 41.2 %. I am getting similar alignment rate and concordant rate for all the other samples. Is it appropriate to proceed with this data to perform the next step- Cufflink?

@GenoMax:You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?

I have 20067327 reads surviving out of 22060013. For TopHat I used the out_paired reads.

Please suggest

Thanks,

BADE

**Brian Bushnell** · 10-22-2014, 07:07 PM

The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence. In that case, unless the genomic material is incredibly precious, you should just resequence it.

What organism is this, and do you have a reference or at least some assembly?

**relipmoc** · 10-22-2014, 10:08 PM

hi BADE,
You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.

**BADE** · 10-23-2014, 09:33 AM

Brian @ The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence.In that case, unless the genomic material is incredibly precious, you should just resequence it.

I performed the quality trimming also but the number of surviving reads was only 66% (described in earlier post). I am not sure how many will align to the genome.

What organism is this, and do you have a reference or at least some assembly?

That data is from mouse samples and the ref genome is Mus musculus

@ Relipmoc: You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.

Thanks for the suggestions. I will try it.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Trimmomatic Paired End - Low number of surviving reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News