Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Changing text in a GTF

    Hi,

    I have the following GTF file (from Gencode):
    HTML Code:
    chr10	Yale_UCSC	transcript	3117806	3119013	.	-	.	gene_id "PGOMOU00000268019"; transcript_id "PGOMOU00000268019"; gene_type "pseudogene"; gene_status "UNKNOWN"; gene_name "PGOMOU00000268019"; transcript_type "pseudogene"; transcript_status "UNKNOWN"; transcript_name "PGOMOU00000268019"; level 3; tag "2way_pseudo_cons"; yale_id "PGOMOU00000268019"; ucsc_id "NM_019986.3-3"; parent_id "ENSMUSG00000021476
    ";
    chr10	Yale_UCSC	transcript	3139466	3141067	.	-	.	gene_id "PGOMOU00000268020"; transcript_id "PGOMOU00000268020"; gene_type "pseudogene"; gene_status "UNKNOWN"; gene_name "PGOMOU00000268020"; transcript_type "pseudogene"; transcript_status "UNKNOWN"; transcript_name "PGOMOU00000268020"; level 3; tag "2way_pseudo_cons"; yale_id "PGOMOU00000268020"; ucsc_id "BC036983.1-1"; parent_id "ENSMUSG00000028228
    ";
    I would like to change all "PGOMOU*" names for "ENSMUSG*". How it can be done in R or awk in an easy way? Sorry for naive question, but I am struggling with awk to make the changes in GTF file. PGOMOU* nomenclature is not recognized by Bowtie/TopHat. So, any input is welcome!

  • #2
    Make a backup copy of the file before trying the following:

    Code:
    $ sed 's/PGOMOU/ENSMUSG/g' your_file > new_file
    Last edited by GenoMax; 08-22-2014, 10:38 AM.

    Comment


    • #3
      It's unlikely that changing this will solve whatever problem you're having. Post the actual problem and we'll try to solve it.

      Comment


      • #4
        Thanks all!

        I'm trying to use the GTF file from Gencode that contain all pseudogenes predicted by the Yale & UCSC pipelines (but not by Havana on reference chromosomes) (ftp://ftp.sanger.ac.uk/pub/gencode/G...pseudos.gtf.gz) with the last GRCm38.p3 assembly, also from Gencode. The fastq files are OK. However, when I use Bowtie2, I'm always getting Bowtie error = 1, which could be related to PGOMOU gene nomenclature (my hypothesis). My first idea was to change all PGOMOU for ENSMUSG in order to allow Bowtie to recognize the same ID on genome...or I'm wrong?

        Comment


        • #5
          ENSMUSG* doesn't exist in the mouse genome (it's just used in the annotation). Please provide the exact command you used that produced the error and entire error message including the entire output that's printed to the screen.

          Comment


          • #6
            Here it goes:

            HTML Code:
            tophat2 -p4 -G gencode.v20.2wayconspseudos.gtf -o MSCd0Adip-12 GRCh38 SRR490218_output2.fastq
            
            [2014-08-22 15:59:44] Beginning TopHat run (v2.0.9)
            -----------------------------------------------
            [2014-08-22 15:59:44] Checking for Bowtie
                              Bowtie version:        2.1.0.0
            [2014-08-22 15:59:44] Checking for Samtools
                            Samtools version:        0.1.19.0
            [2014-08-22 15:59:44] Checking for Bowtie index files (genome)..
            [2014-08-22 15:59:44] Checking for reference FASTA file
                    Warning: Could not find FASTA file GRCh38.fa
            [2014-08-22 15:59:44] Reconstituting reference FASTA file from Bowtie index
              Executing: /usr/bin/bowtie2-inspect GRCh38 > MSCd0Adip-12/tmp/GRCh38.fa
            [2014-08-22 16:02:05] Generating SAM header for GRCh38
                    format:          fastq
                    quality scale:   phred33 (default)
            [2014-08-22 16:03:07] Reading known junctions from GTF file
                    Warning: TopHat did not find any junctions in GTF file
            [2014-08-22 16:03:07] Preparing reads
                     left reads: min. length=60, max. length=66, 53983 kept reads (176 discarded)
            [2014-08-22 16:03:09] Building transcriptome data files..
            [2014-08-22 16:04:02] Building Bowtie index from gencode.v20.2wayconspseudos.fa
                    [FAILED]
            Error: Couldn't build bowtie index with err = 1
            The information is from the human GTF file and, of course, the last human genome assembly from Gencode, which generate an identical error for mouse. I'm running Bowtie together with TopHat (I known that is not necessary. Just only Bowtie is sufficient for alignment).

            thanks again!

            Comment


            • #7
              You can't expect an mouse annotation and a human reference sequence to be compatible (no amount of changing ID names will change that).

              Comment


              • #8
                Yes, of course, but the example that I posted was for human (GTF AND genome) and the alignment was human Gencode pseudogene GTF with human genome assembly. When I tested the murine Gencode pseudogenes GTF AND murine genome (also from Gencode), I got the same Bowtie error.....If you look the example that I posted, the genome is from human and the GTF is from human. No murine genome OR murine GTF was used in that example.

                Comment


                • #9
                  Then look in the run log for the last command that tophat issued and run that yourself. You'll then get the actual underlying error message.

                  Comment


                  • #10
                    And to reinforce that I'm not mixturing murine AND humans, each fastq is especific for each organism....

                    Again, any help is welcome!

                    Comment


                    • #11
                      OK....TopHat indicated that "TopHat did not find any junctions in GTF file" and in run log the following command was used

                      PHP Code:
                      /usr/bin/tophat -p4 -G gencode.v20.2wayconspseudos.gtf -o MSCd0Adip-12 GRCh38 SRR490218_output2.fastq
                      /usr/bin/gtf_juncs gencode.v20.2wayconspseudos.gtf  MSCd0Adip-12/tmp/gencode.juncs
                      #>prep_reads:
                      /usr/bin/prep_reads --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir MSCd0Adip-12/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p4 --gtf-annotations gencode.v20.2wayconspseudos.gtf --gtf-juncs MSCd0Adip-12/tmp/gencode.juncs --no-closure-search --no-coverage-search --no-microexon-search --fastq --aux-outfile=MSCd0Adip-12/prep_reads.info --index-outfile=MSCd0Adip-12/tmp/left_kept_reads.bam.index --sam-header=MSCd0Adip-12/tmp/GRCh38_genome.bwt.samheader.sam --outfile=MSCd0Adip-12/tmp/left_kept_reads.bam SRR490218_output2.fastq
                      #>map_start:
                      /usr/bin/gtf_to_fasta --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir MSCd0Adip-12/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p4 --gtf-annotations gencode.v20.2wayconspseudos.gtf --gtf-juncs MSCd0Adip-12/tmp/gencode.juncs --no-closure-search --no-coverage-search --no-microexon-search gencode.v20.2wayconspseudos.gtf MSCd0Adip-12/tmp/GRCh38.fa MSCd0Adip-12/tmp/gencode.v20.2wayconspseudos.fa MSCd0Adip-12/logs/g2f.out
                      /usr/bin/bowtie2-build MSCd0Adip-12/tmp/gencode.v20.2wayconspseudos.fa MSCd0Adip-12/tmp/gencode.v20.2wayconspseudos 
                      Could it be possible that the search for junction by TopHat is inducing Bowtie error (again, my hypothesis. Excuse if it is to naive, but I'm struggling with this error at some days)?

                      Comment


                      • #12
                        Possible, is there anything in "MSCd0Adip-12/tmp/gencode.v20.2wayconspseudos.fa"?

                        Comment


                        • #13
                          No, it empty....(0 bytes)

                          Comment


                          • #14
                            OK, now we're getting somewhere. That file is made by gtf_to_fasta, so something is going wrong with it. This could be the lack of junctions or it could be something else. Try running that command without the "--gtf-juncs MSCd0Adip-12/tmp/gencode.juncs" options and see what happens (I haven't a clue if it'll even run). If it runs, check to see if the resulting fasta files is empty or not.

                            Can you look through the GTF file and just see if you see any spliced transcripts? I wonder if tophat ignore pseudogenes.

                            Comment


                            • #15
                              The comand did not work also...in fact, the GTF file just only contains spliced transcripts. Maybe running only with Bowtie alone should work. That's a weird problem...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X