Seqanswers Leaderboard Ad

**sdarko** · 01-13-2010, 08:43 AM

I am having the exact same problem and, in fact, was doing a search on the forum on this topic.

Does anyone have a working, tested UCSC hg18 or hg19 GFF3 file that they could pass along to me or post somewhere?

One problem with searching for the terms "GFF" or "GFT" on this forum is that the words are too short to be included as a search term

Thanks,
Sam Darko

**sjm** · 01-13-2010, 04:38 PM

have you thought of trying the cufflinks module?

Getting GFF3-formatted files, or constructing them from gtfs, is a real pain. Been there, done that, for the mouse genome. Have you considered using Cufflinks, from the authors that produced Tophat? (Disclosure: I'm not at all affiliated with them but their tools have worked very well for me!) While the software is advertised as being of most use for isoform quantitation / splicing junction work, you'll find that you can feed Cufflinks a gtf file (like those you already have) and it will annotate your reads against genes and their separate transcripts...

For mouse work, I downloaded a gtf file from Ensembl and have been using it with no issues with Cufflinks.

Your workflow becomes:
- run Tophat as per usual, and without the -G switch
- take the accepted_hits.sam file that Tophat produces, and give it to cufflinks together with your gtf annotation file
- look in the genes.expr and transcripts.expr files for your annotated RPKM data

**Gangcai** · 01-13-2010, 05:53 PM

Sorry I cannot give you an obvious answer, but maybe you can try to use Ensembl gtf format files:
ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/
Then change the chromosome name:
1--> chr1
MT-->chrM
And then use gtf2gff3.pl to transform the gtf format into gff3 format.
This works well for me.

By the way, I think the version of start site(0 based or 1 based) is dependent on the reference genome you use.

The tophat code is looks like this:

tophat -r 50 --mate-std-dev 20 -a 8 -m 0 -I 1000000 -p 4 -g 100 -o ~/gene/labs/phase16/result/tophat/human/s_1/withgff3_r50 --solexa1.3-quals --coverage-search --microexon-search --segment-mismatches 2 --segment-length 25 --max-segment-intron 1000000 -G ~/share/database/ensembl/Ensembl56/GRCh37/GTF/GFF3/v1/Homo_sapiens.GRCh37.56_v1.gff3 ~/share/database/tophat/index/hg19 ~/data/Processed_Solexa/Long_Solexa_Seq/s_1_1.fq ~/data/Processed_Solexa/Long_Solexa_Seq/s_1_2.fq

The gff3 files looks like this:

##gff-version 3
chr7 protein_coding gene 143701022 143702068 . + . ID=ENSG00000221813; NAME=OR6B1;
chr7 protein_coding mRNA 143701022 143702068 . + . ID=ENST00000408922; NAME=OR6B1-001; PARENT=ENSG00000221813;
chr7 protein_coding five_prime_utr 143701022 143701089 . + . ID=five_prime_utr:ENST00000408922:1; PARENT=ENST00000408922;
chr7 protein_coding exon 143701022 143702068 . + . ID=exon:ENST00000408922:1; PARENT=ENST00000408922;

**Ender985** · 01-14-2010, 03:48 AM

Glad to see I'm not the only one with the problem!

@sjm: I've been using Cufflinks already for some data analysis, but in this case I was trying to directly map my reads against the human transcriptome (hoping to find more splice junctions), rather than mapping them to the whole genome and looking at the transcriptome afterwards, like you suggest. I've already done that, I was interested in taking this alternative approach and seeing how different the results would be.

@Gangcai: I'll try to convert the Ensembl gtf file as you suggest and hopefully it will work, even though I can not see any differences between the gff3 file you posted and the one I produced.

Thanks for the suggestions people, keep them coming; I'll keep this thread updated if I find a working solution.

**sdarko** · 01-14-2010, 08:05 AM

Okay, I ended up using the solution posted in this thread. That is, I downloaded the gtf from Ensembl and converted it for use with UCSC reference using the program in the thread. I tested it yesterday and it looked good.

@sjm: I'm also going to try your approach in the future. Thanks so much for the suggestion!

Also a I want to give a general thanks to the good people of the forum who are happy to help strangers in need

Cheers,
Sam

**edge** · 05-27-2011, 08:32 AM

Hi sjm,

Do you know that how to specify Tophat produce accepted_hits.sam?
After I run Tophat, why it only generate accepted_hits.bam

Thanks for advice.

**edge** · 05-27-2011, 08:34 AM

Hi sjm,

Do you know that how to specify Tophat produce accepted_hits.sam?
After I run Tophat, why it only generate accepted_hits.bam

Thanks for advice.

**sjm** · 05-27-2011, 01:33 PM

Tophat bam and sam files

I could be wrong, but my understanding is that the current version of tophat only produces files in .bam format. However, it's easy enough to convert back later - you already have the samtools package installed, otherwise tophat wouldn't work, so look in the help for samtools and use the command to convert .bam back to .sam :

samtools view input.bam > output.sam

**edge** · 05-27-2011, 04:14 PM

Thanks sjm

In general, should we need to include the "-h" option as well?
eg.

Code:

samtools view -h -o input.bam output.sam

My purpose is wanna to use the *.sam as an input for Cufflink after then.
In order to run Cufflink in default, I just not sure weather I needed to include header or not.

Do you familiar about Cufflink as well?
I'm using the latest version of Cufflink 1.0.1 and run the following command"

Code:

Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -G Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam 
Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks: /lib64/libz.so.1: no version information available (required by Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks)
Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
[19:26:23] Loading reference annotation.
[19:26:37] Inspecting reads and determining fragment length distribution.
Processed 33455 loci.                        [*************************] 100%
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
Map Properties:
       Total Map Mass: 77344272.10
       Read Type: [B]0bp single-end[/B]
       Fragment Length Distribution: Truncated Gaussian (default)
                     Default Mean: 200
                  Default Std Dev: 80
[19:32:30] Estimating transcript abundances.
Processed 33455 loci.                        [*************************] 100%

My input file is 2X50bp, paired-end read.
I'm not sure why Cufflink will detect it as "0bp single-end".
Do you have any idea regarding the above error message?
If I'm using the "-g" option, the progress worked fine but the result is the same as the default Cufflink (without -g and Homo_sapiens.GRCh37.62.gtf in the command) running result

Code:

Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -g Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam

I was thinking just to assemble back the annotate transcript in my sample.
Thanks for any advice to improve the progress.

**edge** · 05-27-2011, 04:18 PM

Hi Gangcai,

Do you know how to calculate "--mate-std-dev" of RNA-seq data?
My input file is 2X50bp, sequencing by Illumina Hiseq, library insert size is 210bp, adaptor sequence is 100bp.
Apart frm that, do you know how to calculate "--frag-len-mean and --frag-len-std-dev" as well?
I'm facing the problem as shown in thread above.
Thanks.

**sjm** · 05-27-2011, 07:29 PM

paired-end data in correct format?

@edge: are you sure that your 2x50 bp paired-end data are in the correct format for Cufflinks, i.e. are you choosing the right --library-type option? Sorry I can't be more helpful, as I haven't used paired-end data with Cufflinks data yet. I don't think converting between .bam and .sam is going to help you much if you don't have the right --library-type option set.

**edge** · 05-27-2011, 10:25 PM

Hi sjm,

I believe my input data should be in correct format.
When I type the following command:

Code:

Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -g Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam

[10:44:16] Loading reference annotation.
[10:44:31] Inspecting reads and determining fragment length distribution.
> Processed 217311 loci.                       [*************************] 100%
> Map Properties:
>       Total Map Mass: 77344272.10
>       Read Type: 50bp x 50bp
>       Fragment Length Distribution: Empirical (learned)
>                     Estimated Mean: 181.46
>                  Estimated Std Dev: 32.00
[11:07:40] Assembling transcripts and estimating abundances.
> Processed 217311 loci.                       [*************************] 100%

The above command shown my input data is 50bp X 50bp.
Thus I not sure why "-g" work fined while "-G" got error

**DZhang** · 05-28-2011, 06:02 PM

Hi All,

I just wanted to add one point: the old versions of Cullinks worked with only the gtf format and the latest version works with the both gff/gtf format. Check the cufflinks manual for details.

Thank you,
Douglas

https://www.contigexpress.com

Topics	Statistics	Last Post
AI Model Maps 3D Genome Structures in Minutes by seqadmin Started by seqadmin, 02-03-2025, 09:07 AM	0 responses 13 views 0 likes	Last Post by seqadmin 02-03-2025, 09:07 AM
Long-Read Sequencing Speeds Up Diagnosis of Rare Genetic Diseases by seqadmin Started by seqadmin, 01-31-2025, 08:31 AM	0 responses 24 views 0 likes	Last Post by seqadmin 01-31-2025, 08:31 AM
New Genome Analysis Tool Offers Scalable Phylogenomic Insights by seqadmin Started by seqadmin, 01-24-2025, 07:35 AM	0 responses 78 views 0 likes	Last Post by seqadmin 01-24-2025, 07:35 AM
How T Cells Protect the Gut from Infections by seqadmin Started by seqadmin, 01-23-2025, 09:43 AM	0 responses 46 views 0 likes	Last Post by seqadmin 01-23-2025, 09:43 AM

Seqanswers Leaderboard Ad

Announcement

TopHat and the GFF3 file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News