Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotation difference between refSeq and Gencode

    Hi all,

    I am trying to set up an RNAseq work flow:

    1. Generated genome files for STAR using .fna files from NCBI ftp and gtf files from Gencode;

    2. Aligned fq using STAR, convert sam to bam and sorted bam.

    3. Then I used the sorted bam files to test cufflinks and compared different gtf files for the -G option. The cufflinks output somehow all have different positions for the same genes:

    refSeq:
    gene_id gene_short_name locus
    PDIA3 - chr15:44038589-44064804
    CD276 - chr15:73976621-74006859
    PROM2 - chr2:95940200-95957055

    gencode:
    gene_id gene_short_name locus
    ENSG00000167004.12 PDIA3 chr15:43746391-43773279
    ENSG00000103855.17 CD276 chr15:73683965-73714518
    ENSG00000155066.15 PROM2 chr2:95274452-95291308

    And the FPKM as a result are very different in the two output.

    What am I missing here and how to fix it, please? If the two gtf are inherently different in regard to gene loci, which one should I trust, pls?

    Best,
    Grace

  • #2
    As you have discovered the hard way it is extremely important to make sure that you are using a consistent genome build/patch level for your analysis (I assume that is what is being reflected in the co-ordinate differences above).

    If you want to avoid these types of issues you could download sequence/annotation/index bundles (you will need to roll your own indexes if you want to use STAR but at least the sequence/annotation would be consistent) from iGenomes.

    In terms of salvaging the analysis, check to see if there are corresponding annotation files available at NCBI where you got the sequence files.

    Comment


    • #3
      Example PDIA3:
      RefSeq co-ordinates are from Hg19/GRCh37.p19
      Gencode are from GRCh38.p2

      So if your sequence was from GRCh37/Hg19 then get the corresponding annotation file.
      Attached Files
      Last edited by GenoMax; 10-07-2015, 08:49 AM.

      Comment


      • #4
        Thanks for the responses.

        I used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna to build the genome for STAR. Does it mean gencode is the right gtf to use here?

        Is it right that if I want to use RefSeq annotation, I could just download hg19 reference sequences from iGenome?

        Also the cufflinks output with refseq or gencode gtf are very different, less than 30K genes with refseq and about 60K genes with gencode. Is there any explanation on it?

        Comment


        • #5
          If you used the GRCh38 fasta then gencode should be the right gtf file to use.

          If you want to re-do the alignments then you could go the iGenomes route and save yourself some trouble.

          Since you are sampling different areas of the genome with the two GTF files (co-ordinate differences) the cufflinks outputs is different (though 2x is a big change). How are you handling multi-mappers? Perhaps there is a repeat region in one but not the other.

          Comment


          • #6
            Thanks.

            By 30K vs 60K difference I meant the row numbers in the cufflinks output with the two different gtf files. The row numbers and genes are fixed for each regardless of the input bam files. I checked and found that gencode gtf returns a lot of rows of Y_RNA or 5s_rRNA. Is there a way to only return mRNA annotation with gencode/GRCh38 gtf, pls?

            Comment


            • #7
              You can filter the rows you do not want/need from the GTF file using grep. Look into the -v option.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              47 views
              0 likes
              Last Post seqadmin  
              Working...
              X