Problems using TopHat with RNA-seq data

balaena

Member

Join Date: Jan 2015

Posts: 29
- Share
- Tweet
#1

Problems using TopHat with RNA-seq data

02-07-2015, 05:33 AM

Hi

I am totally new to all this stuff and promptly ran into problems using Tophat for aligning RNAseq-data.

I have an invertebrate draft genome around 100megabases in 7000 scaffolds and I wanted to align PE RNAseq-data to the genome with Tophat. I wanted to use the output in Maker for annotation of the genome, because I have not yet done a de-novo-assembly.

Tophat finished without any errors but the results have been a tiny accepted_hits.bam and a huge unmapped.bam.

So what I did was:

- build the genome index with bowtie2-build
- trimmed the 76 bp paired-end reads with Trimmomatic (leading: 20 trailing: 20 minlen: 50)
- ran Tophat with default parameters, except: -r 150 (insert size 250 - 300bp)

I am not sure which of the many logs are important, some also seem to contain redundant information, but here are:

- tophat.log:

[2015-02-06 17:50:08] Beginning TopHat run (v2.0.0)
-----------------------------------------------
[2015-02-06 17:50:08] Checking for Bowtie
Bowtie version: 2.0.0.6
[2015-02-06 17:50:08] Checking for Samtools
Samtools version: 0.1.18.0
[2015-02-06 17:50:08] Checking for Bowtie index files
[2015-02-06 17:50:08] Checking for reference FASTA file
Warning: Could not find FASTA file H2genome.fa
[2015-02-06 17:50:08] Reconstituting reference FASTA file from Bowtie index
Executing: /local64/usr_local/bin/bowtie2-inspect H2genome > ./tophat_out/tmp/H2genome.fa
[2015-02-06 17:50:15] Generating SAM header for H2genome
format: fastq
quality scale: phred33 (default)
[2015-02-06 17:50:17] Preparing reads
left reads: min. length=50, count=31002899
right reads: min. length=50, count=31002899
[2015-02-06 18:15:30] Mapping left_kept_reads against H2genome with Bowtie2
[2015-02-06 18:39:35] Mapping left_kept_reads_seg1 against H2genome with Bowtie2 (1/3)
[2015-02-06 18:44:16] Mapping left_kept_reads_seg2 against H2genome with Bowtie2 (2/3)
[2015-02-06 18:49:13] Mapping left_kept_reads_seg3 against H2genome with Bowtie2 (3/3)
[2015-02-06 18:56:30] Mapping right_kept_reads against H2genome with Bowtie2
[2015-02-06 19:20:49] Mapping right_kept_reads_seg1 against H2genome with Bowtie2 (1/3)
[2015-02-06 19:25:54] Mapping right_kept_reads_seg2 against H2genome with Bowtie2 (2/3)
[2015-02-06 19:31:06] Mapping right_kept_reads_seg3 against H2genome with Bowtie2 (3/3)
[2015-02-06 19:36:59] Searching for junctions via segment mapping
[2015-02-06 19:41:42] Retrieving sequences for splices
[2015-02-06 19:41:48] Indexing splices
[2015-02-06 19:43:16] Mapping left_kept_reads_seg1 against segment_juncs with Bowtie2 (1/3)
[2015-02-06 19:45:25] Mapping left_kept_reads_seg2 against segment_juncs with Bowtie2 (2/3)
[2015-02-06 19:47:56] Mapping left_kept_reads_seg3 against segment_juncs with Bowtie2 (3/3)
[2015-02-06 19:50:27] Joining segment hits
[2015-02-06 20:06:22] Mapping right_kept_reads_seg1 against segment_juncs with Bowtie2 (1/3)
[2015-02-06 20:08:41] Mapping right_kept_reads_seg2 against segment_juncs with Bowtie2 (2/3)
[2015-02-06 20:11:01] Mapping right_kept_reads_seg3 against segment_juncs with Bowtie2 (3/3)
[2015-02-06 20:13:48] Joining segment hits
[2015-02-06 20:28:45] Reporting output tracks

- reports.log:

[samopen] SAM header is present: 6973 sequences.
Loaded 7 junctions
Reporting final accepted alignments...done.
Printing junction BED track...done
Printing insertions...done
Printing deletions...done
Found 5 junctions from happy spliced reads

- bowtie.left_kept_reads.fixmap.log:

49466 out of 31052365 reads have been filtered out
31002899 reads; of these:
31002899 (100.00%) were unpaired; of these:
5185062 (16.72%) aligned 0 times
23947202 (77.24%) aligned exactly 1 time
1870635 (6.03%) aligned >1 times
83.28% overall alignment rate

- bowtie.right_kept_reads.fixmap.log:

31002899 reads; of these:
31002899 (100.00%) were unpaired; of these:
5185062 (16.72%) aligned 0 times
23947202 (77.24%) aligned exactly 1 time
1870635 (6.03%) aligned >1 times
83.28% overall alignment rate

So it seems that most of the reads align to the genome, right (>80%)? Although, to me, it looks a little bit odd that the right reads have exactly the same values as the left reads?

Also, in any of the "bowtie.right/left_kept_reads_seg.fixmap.log" files, the alignment rate is much lower.

e.g. bowtie.right_kept_reads_seg2.fixmap.log:

9740245 reads; of these:
9740245 (100.00%) were unpaired; of these:
6453402 (66.26%) aligned 0 times
2647488 (27.18%) aligned exactly 1 time
639355 (6.56%) aligned >1 times
33.74% overall alignment rate

So I am really interested to understand what has happened in my run and why Tophat has found almost no accepted hits and junctions?

Any suggestions?

Thanks in advance!
Tags: None
balaena

Member

Join Date: Jan 2015

Posts: 29
- Share
- Tweet
#2

02-07-2015, 08:40 AM

Sorry,

could be a false alarm. I may have simple messed up my fastq files. I am rerunning the alignment right now.

Sorry again
Comment
Brian Bushnell

Super Moderator

Join Date: Jan 2014

Posts: 2709
- Share
- Tweet
#3

02-07-2015, 10:20 AM

Q20 is a pretty high threshold for quality trimming. Depending on your data quality, you may shorten your reads substantially, which will make them less sensitive to alternative splicing. Also, since you are (probably) doing a quantitative analysis, very aggressive trimming and discarding of reads shorter than 50bp can cause bias, as there is a correlation between bases/motifs and quality values.
Comment

Previous template Next

Advancing Precision Medicine for Rare Diseases in Children

by seqadmin

Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
- Channel: Articles
12-16-2024, 07:57 AM
Recent Advances in Sequencing Technologies

by seqadmin

Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...
- Channel: Articles
12-02-2024, 01:49 PM

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Problems using TopHat with RNA-seq data

Comment

Comment

Latest Articles

ad_right_rmr

News