Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Please find the the gff file I have used at this url ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_004556.gff

    otherwise also please check a sample of it, hope it will be readable in the post
    ##gff-version 3
    #!gff-spec-version 1.14
    #!source-version NCBI C++ formatter 0.2
    ##Type DNA NC_004556.1
    NC_004556.1 RefSeq source 1 2519802 . + . organism=Xylella fastidiosa Temecula1;mol_type=genomic DNA;strain=Temecula1;db_xref=taxon:183190;note=Pierce%27s disease strain
    NC_004556.1 RefSeq gene 146 1465 . + . ID=NC_004556.1:dnaA;locus_tag=PD0001;db_xref=GeneID:1144259
    NC_004556.1 RefSeq CDS 146 1462 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq start_codon 146 148 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq stop_codon 1463 1465 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq gene 1747 2847 . + . ID=NC_004556.1:dnaN;locus_tag=PD0002;db_xref=GeneID:1144260
    NC_004556.1 RefSeq CDS 1747 2844 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq start_codon 1747 1749 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq stop_codon 2845 2847 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq gene 3153 4247 . + . ID=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;db_xref=GeneID:1144261
    NC_004556.1 RefSeq CDS 3153 4244 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    NC_004556.1 RefSeq start_codon 3153 3155 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    NC_004556.1 RefSeq stop_codon 4245 4247 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    I am looking for the count of all the reads mapped to each gene. Please let me know if it is correct way I am doing using locus_tag and CDs. It is also microbial genome.

    Thank you

    Comment


    • #17
      On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.
      I just gave the last few lines where no features is shown, I also get the count of reads mapped to each locus_tag. eg
      PD0003 23
      PD0001 5
      etc

      Thank you very much

      Comment


      • #18
        Again: Are the five lines you quote the whole output that you get from htseq-count? And have you looked at your data and the GTF file with a genome browser?

        And what is a "locus tag"? (Sorry, I'm not very familiar with prokaryotic data.)

        Comment


        • #19
          Then, you should really check things with a genome browser. Check especially whether the reads are really on the same strand as the gene, to make sure that '-s yes' is correct.

          Comment


          • #20
            Locus tags are systematic, unique identifiers that are assigned to each gene in GenBank. Therefore I used them as identifiers in this context too. However I am not sure if the use of CDS is causing this problem of lack of identification.

            I will also look into the genome browser and get back to you with results. Thank you

            Comment


            • #21
              Results from genome browser view,

              I could see many reads actually mapping to each gene which we call locus tag. I get lesser than the actual count for each gene. Any way to increase it. When I remove the -i option then I get error, tRNAs have no gene id feature, error on 72 line and the program exit. Please let me know how to proceed.

              Any input from guys working with bacterial genomes will be very helpful

              Thank you

              Comment


              • #22
                There are also many regions in between the genes where the reads are mapping....but the ones mapping onto the gene too are not shown in the count

                Comment


                • #23
                  I tried to give command,

                  htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out

                  I get error line 72 gene feature doesnt exist for tRNA .......then the progam exits. When I checked the line I see that the feature in this line has 'exon'. Such lines exists for many tRNAs in this gff file. but which are other transcripts we have different information.
                  NC_004556.1 RefSeq gene 17033 17950 . + . ID=NC_004556.1:hemF;locus_tag=PD0015;db_xref=GeneID:1144215
                  NC_004556.1 RefSeq CDS 17033 17947 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq start_codon 17033 17035 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq stop_codon 17948 17950 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq gene 20728 20804 . - . locus_tag=PD0016;db_xref=GeneID:1144216
                  NC_004556.1 RefSeq exon 20728 20804 . - . gbkey=tRNA;locus_tag=PD0016;product=tRNA-Met;note=found by tRNAscan;db_xref=GeneID:1144216;exon_number=1
                  So I am exculsively giving the command htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out

                  Then in PD_out we see,
                  .........
                  ,..........
                  PD2111 4
                  PD2112 2
                  PD2113 7
                  PD2105 15
                  PD2106 11
                  PD2107 35
                  PD2108 10
                  PD2109 21
                  PD2110 17
                  .....
                  ...
                  ....
                  no_feature 14905057
                  ambiguous 13
                  too_low_aQual 0
                  not_aligned 1913205
                  alignment_not_unique 0

                  Something like this. Am I on the right path.

                  By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.

                  Any suggestions please.

                  Thank you

                  Comment


                  • #24
                    Originally posted by sunkorner View Post
                    htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out
                    This won't work with your GFF file, because it does not have 'gene_id' attributes, and you don't have a '-i'.


                    This is why you needed this here:
                    htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out
                    By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.
                    If you had many read falling onto overlapping genes, they would be counted as ambiguous. "no_feature" means that they don't fall anywhere.

                    What you should do is visualize the 'out' SAM file along your GFF file and see if you find any read that is marked as "no_feature" and appears to be on a feature. Then, make a screenshot and post it here, and we can see what is wrong.

                    Comment


                    • #25
                      I am using DESeq and I get the following error.

                      > exampleFile = "/home/sun/Desktop/xylella_normalized_rpm.tab"
                      > countsTable <- read.delim(exampleFile, header=TRUE, stringsAsFactors=TRUE)
                      > head(countsTable)
                      Locus_tag WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
                      1 PD0001 32.59 35.18 200.56 37.90 173.21 224.03
                      2 PD0002 26.55 11.80 149.89 18.16 130.71 162.92
                      3 PD0003 9.73 7.46 171.14 14.50 129.91 203.61
                      4 PD0004 7.36 6.46 20.83 12.31 16.76 29.63
                      5 PD0005 47.71 69.57 481.91 82.01 372.57 567.03
                      6 PD0006 5.13 4.90 175.65 13.04 154.31 202.34
                      > rownames( countsTable) <- countsTable$Locus_tag
                      > countsTable <- countsTable[ ,-1 ]
                      > conds <- c( "W", "W", "W", "M", "M", "M" )
                      > cds <- newCountDataSet( countsTable, conds )
                      Error in newCountDataSet(countsTable, conds) :
                      The countData is not integer.
                      > cds <- newCountDataSet( countsTable, conds)
                      Error in newCountDataSet(countsTable, conds) :
                      The countData is not integer.
                      > head(countsTable)
                      WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
                      PD0001 32.59 35.18 200.56 37.90 173.21 224.03
                      PD0002 26.55 11.80 149.89 18.16 130.71 162.92
                      PD0003 9.73 7.46 171.14 14.50 129.91 203.61
                      PD0004 7.36 6.46 20.83 12.31 16.76 29.63
                      PD0005 47.71 69.57 481.91 82.01 372.57 567.03
                      PD0006 5.13 4.90 175.65 13.04 154.31 202.34
                      > str(countsTable)
                      'data.frame': 2123 obs. of 6 variables:
                      $ WT.1 : num 32.59 26.55 9.73 7.36 47.71 ...
                      $ WT.2 : num 35.18 11.8 7.46 6.46 69.57 ...
                      $ WT.3 : num 200.6 149.9 171.1 20.8 481.9 ...
                      $ mutant.1: num 37.9 18.2 14.5 12.3 82 ...
                      $ mutant.2: num 173.2 130.7 129.9 16.8 372.6 ...
                      $ mutant.3: num 224 162.9 203.6 29.6 567 ...
                      I have to still visualize my out SAM file, so I will get back on that soon.

                      Thank you

                      Comment


                      • #26
                        Please, read the error message. It says clearly:

                        "Error in newCountDataSet(countsTable, conds) :The countData is not integer."

                        Sorry but you are about the thousandth person to ask this, and I'm a bit desperate because I thought I wrote it nice and clearly in the manual:

                        DESeq expects as input a table, which tells, for each sample and each gene, how many reads fall onto the gene in the sample.

                        It seems that many users feel an irresistible urge to do strange things to their counts and then supply something like the number of read per million reads, the number per kilobase transcript length, or whatever.

                        So, please, people, stop normalizing, dividing, rounding or doing whatever with your data, and just hand over the raw unnormalized integer counts.

                        Comment


                        • #27
                          Thank you very much and also sorry. I am new to RNAseq data analysis and so somehow must have overlooked the details.

                          Comment


                          • #28
                            Error running HTSeq

                            Hi,

                            I'm having the same problem as the original post in this thread. When I run HTSeq I'm getting the error:

                            Warning: Skipping read 'HWI-ST765:7:2307:9917:51869#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.
                            Warning: Skipping read 'HWI-ST765:7:2307:9917:76673#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.
                            This is the command I used:

                            python ../../programs/HTSeq-0.5.3p3/HTSeq/scripts/count.py --mode=intersection-nonempty --stranded=no --idattr=transcript_id -o 24.1.htseq 24.1.sorted.sam ../ec_txn.gtf
                            However, I can see that my SAM file and GTF file are using the same chromosome names so I'm not sure why HTSeq isn't recognizing them.

                            Here are examples of a few lines from my SAM and GTF files:

                            HWI-ST765:7:1101:10001:176633#0 147 chr1 97536 255 101M = 97470 -167 ACAAAGATGTCGCACTTGGCATTGATGTACATGCAGACCTCCTGGCAGACCTTGCGCTCGATGTGCTTCTTGAGCTTGTGGAGCTGTCCTGGGCATCGGAG DDDDDDDDDDDDCCDDDDDDDDDDEEFEDDDDDDDCADDFEFFEHHHHHJIJJJIGGIJJIJIHEIJIJJJJJJJJHEJJJJIJJJJIHBHHHFFDB;CC@ NM:i:0 NH:i:1
                            HWI-ST765:7:1101:10001:176633#0 99 chr1 97470 255 101M = 97536 167 CGGAGGTTTCTGTATCTTCTCCTGTCGTTAACCACGACTCTGTAGAAGTCGCAGTCCCCGACCAGCACAAAGATGTCGCACTTGGCATTGATGTACATGCA BCCFFFBDHHHHHJJJJJJJJJJJJJIFIIJJJJJJIJJJGGIJIJJJHJJIJJIIIJJHHFFDEDDDDDDCDDDCCBDDDDDDDDCDDDDEDEEEEEDDD NM:i:1 NH:i:1
                            HWI-ST765:7:1101:10002:95570#0 147 chr1 107247 255 101M = 107184 -164 TCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACAGGAATGTTCTTCTTTTTTGCCAGCGCCATGACAAGCTTGGAAATTCTTGGCTCTGCATTCTCG BDDDCDDDDDDDDDDDDDDDDDDDDDEDEEEEFEFFFFFFFHHHHHIIGIIIJIJJJJJJJJJJIJIJIIJJJIJIJIJJJJJJJJJJHHHHHFFFFFCCC NM:i:0 NH:i:1
                            HWI-ST765:7:1101:10002:95570#0 99 chr1 107184 255 101M = 107247 164 CGGCAACACAGCACCCCTTGCTCCTCACCTTCCCGCTGGAGCTGACATTCTCCACACCGACAATCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIIIJFHIJIJIIIJGGJIIJIGHHBECDDDCDDDDDDDDDCDDDDDDDDDDDDDDDDDDDDDDDDD NM:i:0 NH:i:1

                            chr1 tophat CDS 4117 4593 . + 0 gene_id "ECU01_0010"; transcript_id "ECU01_0010";
                            chr1 tophat CDS 4391 5086 . - 0 gene_id "ECU01_0020"; transcript_id "ECU01_0020";
                            chr1 tophat CDS 5014 7486 . - 0 gene_id "ECU01_0023"; transcript_id "ECU01_0023";
                            chr1 tophat CDS 7536 8832 . - 0 gene_id "ECU01_0025"; transcript_id "ECU01_0025";
                            chr1 tophat CDS 11158 11529 . - 0 gene_id "ECU01_0030"; transcript_id "ECU01_0030";
                            Let me know if you have any ideas on how to solve this issue.
                            Thanks in advance

                            Comment


                            • #29
                              So I was messing around with parameters and found that when I changed --type=CDS to match my GTF file the program ran perfectly. I guess I was assuming CDS was the standard feature in column 3 of a GTF file since that's what is shown in the example GTF (link in manual). Also, I'm not sure why not specifying the --type attribute resulted in the error message I was getting.

                              Anyways, it is working now. Thanks for this great tool.

                              Comment


                              • #30
                                The default value for 'type' is 'exon', not 'CDS'. This is because a GTF files usually have two lines for each exon, one labelled 'exon' and the other 'CDS', the difference being that the former includes untranslated regions and the latter only the coding part of the exon. (For interior exons, both lines contain the same coordinates.) I felt that most users would want htrseq-count to count a read also if it maps to a gene's UTR and this is why the default is 'exon', not 'CDS'.

                                In your GTF file, all the 'exon' lines seem to be missing, so you hit on the right solution to tell htseq-count to use the 'CDS' lines instead.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                45 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X