Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks transcripts have no annotation

    Hello everyone,

    I have a little problem with using the cufflinks pipeline. My data consists of RNA Seq reads (one experiment single end and one experiment paired end) which have to be mapped against the human hg19 genome and finally assembled to transcripts.
    I started my experiments with the new tophat version (1.4.1) to map the reads. I tried it with an annotation file (Illumina package for cufflinks with ncbi annotation) and without one. There seem to be no problem with the mapping, so i started the next step using cufflinks with the same annotation file.
    And here is the problem. My output consists of two typs of transcripts, exons... . I have transcripts which have a name like in the annotation file, but an FPKM value 0 and transcripts with an cufflinks identifier, that have an FPKM > 0. I tried also using cuffcompare to map the annotation to the cufflinks transcripts but nothing happens.

    I have tried three different annotations: ensembl, refseq, ucsc; 2 different parameters for the annotation (-g and -G) and also use the -b option for bias correction. But I dont understand why cufflinks do not label my transcripts with one of the annotation names.

    Have anybody the same problem and/or any solution for me?

    Here are my commands
    Code:
    cufflinks -p 4 -o ../cufflinks.out.deNovo/ -b ../../../human_genom/Illumina_paket/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -g ../../../human_genom/Illumina_paket/Homo_sapiens/NCBI/build37.2/Annotation/Genes/genes.gtf accepted_hits.bam
    You are using Cufflinks v1.3.0, which is the most recent release.
    [16:48:35] Loading reference annotation.
    [16:48:41] Inspecting reads and determining fragment length distribution.
    > Processed 408173 loci.                       [*************************] 100%
    > Map Properties:
    >    Total Map Mass: 141260632.27
    >    Fragment Length Distribution: Empirical (learned)
    >                  Estimated Mean: 161.78
    >               Estimated Std Dev: 47.67
    [17:55:29] Assembling transcripts and estimating abundances.
    > Processed 408173 loci.                       [*************************] 100%
    [18:51:09] Loading reference annotation and sequence.
    Warning: couldn't find fasta record for 'chr1'!
    This contig will not be bias corrected.
    Warning: couldn't find fasta record for 'chr10'!
    This contig will not be bias corrected.
    ...
    Warning: couldn't find fasta record for 'Un|NT_167236.1'!
    This contig will not be bias corrected.
    [18:51:53] Learning bias parameters.
    > Processed 66680 loci.                        [*************************] 100%
    [19:48:34] Re-estimating abundances with bias correction.
    > Processed 66680 loci.
    Code:
    cufflinks -p 4 -b /Daten2/rna-seq/human_genom/Illumina_paket/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -o cufflinks.out.ensembl -g ../../human_genom/Homo_sapiens_ENSEMBL.gtf tophat.out.sample1.reference/accepted_hits.bam
    And some lines of my output file transcripts.gtf:

    Code:
    chr	feature	start	end	score	strand	gene_id	transcript_id	FPKM	frac	conf_lo	conf_hi	cov	full_read_support
    chr1	transcript	135727	136178	1000	.	CUFF.3	 CUFF.3.1	0.2633463896	1	0.104543	0.42215	1.920432	 yes
    chr1	transcript	165727	169225	1	-	CUFF.2	 CUFF.2.1	0	0	0	0	0	 yes
    chr1	transcript	167772	169225	1000	-	CUFF.2	 CUFF.2.2	0.9961441143	1	0.769102	1.223186
    11	transcript	57219163	57219495	1	+	ENSG00000222998	 ENST00000411066	0	0	0	0	0	 no
    11	transcript	57174429	57194523	1	-	ENSG00000134802	 ENST00000395123	0	0	0	0	0	 no
    11	transcript	57174429	57194594	1	-	ENSG00000134802	 ENST00000395124	0	0	0	0	0	 no

  • #2
    Psikon,

    Your hg19 reference file has chromosome names in the format 'chr1' so that is the format they would be in the BAM file(s) generated by TopHat. Whereas the Ensembl GTF you showed in the example has chromosome names as just the numbers (e.g. '11'). Since the alignments in your BAM file(s) are to references named 'chrN' and the annotations in the GTF file are to references named just 'N' Cufflinks naturally won't be able to match them. You need to ensure that the chromosome names in the reference FASTA you use to map exactly match the names in the annotation file. This is typically best done by getting your reference and annotation from the same source.

    Comment


    • #3
      Thank you for your reply. Is it also possible to convert the annotation from ensembl to the needs of cufflinks. The only thing I have to do is to convert the chromosome names from e.g. "11" to "chr11", isn't it?

      Comment


      • #4
        Alternatively, I think you can go to cufflinks page to download hg19 annotion set and use the GTF file in the tar ball as the annotation.

        Comment


        • #5
          Sorry for the long delay. I have tested your suggestion and my machine is not the fastest. Also my Datasets are very huge. I have now downloaded two Igenome packages. One from NCBI and one from Ensembl. For my analysis i need the ensembl annotation. But after inspecting the annotation file i recognized that the chromsome names are not in the format of cufflinks.
          I have started an test run with with this command:
          Code:
          cufflinks -p 6 -o ../cufflinks.out.ensembl -g /Daten2/rna-seq/human_genom/Illumina_paket/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf -b /Daten2/rna-seq/human_genom/Illumina_paket/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa accepted_hits-sorted.bam
          But my assembled transcripts are at the moment unlabeled (run is still in progress)

          Take the labeling part at the end of the run or just in time? If it just in time the format of the annotation file is also wrong like the other ones. I need help please. I dont understand why there is no labeling or how i must change my annotation to label the transcripts.

          The problem also is, how is the command for changing the chromosome names in the annotation.gtf.

          I tried to change the chromosome names with R. Change the <name> to chr<name>. Is it enough for cufflinks or it there anything i forgot. The basis file for this manipulation was the iGenome annotation from ensembl.
          Last edited by Psikon; 03-28-2012, 07:04 AM. Reason: Forgot something

          Comment


          • #6
            Hello everybody,

            Thank you for your advises. The annotation now works fine and i am really happy that i can go on with my analysis. Thank you very much.

            Comment


            • #7
              Hello,

              I know this is an older topic but I had the exact same issue and this thread solved it. I wrote a very simple Perl script to convert GFF's which don't have "chr" in the chromosome number to those that do to resolve this. So no one has to duplicate my work, the code is attached below. Its not the most elegant solution but it worked for me!


              #!/usr/bin/perl -w
              # Converts GFF transcript files with chromosomes listed as "1" "2" etc to "chr1" "chr2"
              # First argument is GFF file name, second is desired output file name

              open (OUTPUT,'>',$ARGV[1]) or die;
              open (FIRSTFILE, $ARGV[0]) || die "can't open $ARGV[0]\n\n" ;

              while (<FIRSTFILE>){
              my @temp=split(/\t/,$_);
              $temp[0]="chr".$temp[0];
              my $tempstring = join("\t", @temp);
              print OUTPUT $tempstring;
              }
              close FIRSTFILE;
              close OUTPUT;

              Comment


              • #8
                I got the same problem, I read your previous tags but just confused... I am using latest version of cufflinks 2.1.1 and all data from ensembl.

                Warning: couldn't find fasta record for 'HSCHR9_3_CTG35'!
                This contig will not be bias corrected.

                if my .bam file is in different format then can I try converting in .sam. will it work or you have another solution...

                please do reply

                thanks

                Comment


                • #9
                  Originally posted by Charitra View Post
                  I got the same problem, I read your previous tags but just confused... I am using latest version of cufflinks 2.1.1 and all data from ensembl.

                  Warning: couldn't find fasta record for 'HSCHR9_3_CTG35'!
                  This contig will not be bias corrected.

                  if my .bam file is in different format then can I try converting in .sam. will it work or you have another solution...

                  please do reply

                  thanks
                  Are you getting that error for all your genes or just one specific one?

                  What does your transcripts.gtf file look like?

                  Comment


                  • #10
                    Dear Helical
                    I am gettin that error for most of my genes, not one specific at cuffmerge and cuffdiff.

                    Comment


                    • #11
                      That is strange. When you run cufflinks, are you running it with -g (use reference transcripts and assemble new ones) or -G (only use reference transcripts)? I am no expert but I imagine this error could arise depending on how you mapped to the known transcripts.

                      Comment


                      • #12
                        Originally posted by Helical View Post
                        That is strange. When you run cufflinks, are you running it with -g (use reference transcripts and assemble new ones) or -G (only use reference transcripts)? I am no expert but I imagine this error could arise depending on how you mapped to the known transcripts.
                        I got one solution here http://seqanswers.com/forums/showthr...646#post102646 please see my complete log details in attachment...

                        these seems to be lincRNA or pseudo... and excluded from process by cufflinks...
                        do you know something about what will happen if I just continue because these are just warnings not errors.....??

                        and what if I want to include these lincRNAs ??

                        please do reply

                        Comment


                        • #13
                          Originally posted by Helical View Post
                          That is strange. When you run cufflinks, are you running it with -g (use reference transcripts and assemble new ones) or -G (only use reference transcripts)? I am no expert but I imagine this error could arise depending on how you mapped to the known transcripts.
                          Charitra used the -G option based on the log file provided in the other post.

                          Comment


                          • #14
                            Originally posted by GenoMax View Post
                            Charitra used the -G option based on the log file provided in the other post.
                            Yes Dear Helical and GenoMax

                            I used -G for aligning using tophat and then -g in cuffmerge.

                            Comment


                            • #15
                              I got confusions again and posted here http://seqanswers.com/forums/showthr...818#post102818

                              is there anything to add large bundle ?

                              I got he answer for my questions... cufflinks exclude the large bundle possibly pseudogene and the default value is good enough for dealing with human and mouse genome... and if so then these warnings which I have are good to exclude or ignore... and I don't have to include as a asked here.

                              Thank you GenoMax and Helical for you very kind help in this great forum.
                              However, I will put my q here again if my differentially expressed genes are not good enough ..

                              Second thing is for me to know about sorting .bam file using sam tools.... is one command is enough to sort .bam files... samtools sort aln.bam aln.sorted
                              or there are bunch of commands to do complete sorting...?




                              Originally posted by Charitra View Post
                              Yes Dear Helical and GenoMax

                              I used -G for aligning using tophat and then -g in cuffmerge.
                              Last edited by Charitra; 04-26-2013, 05:42 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-27-2024, 06:37 PM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-27-2024, 06:07 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              69 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X