Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Raw read counts for RNAseq

    hi All,
    I am very new to this NGS business.. Presently I am trying my hands on RNA-seq data. I have managed to obtain the SAM files by running topHat.
    I also ran cufflinks on those files. I now want to conduct differential expression analysis using DESeq package. (and eventually compare the results of cufflinks and DESeq). How do I obtain the raw read counts from the SAM file? I read that that is the input to DESeq. Can someone please help?
    thanks.

  • #2
    wc -l <sam file>

    Comment


    • #3
      I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
      T1a T1b T2 T3 N1 N2
      Gene_00001 0 0 2 0 0 1
      Gene_00002 20 8 12 5 19 26
      Gene_00003 3 0 2 0 0 0
      Gene_00004 75 84 241 149 271 257
      Gene_00005 10 16 4 0 4 10
      Gene_00006 129 126 451 223 243 149

      Comment


      • #4
        Originally posted by biofreak View Post
        I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
        T1a T1b T2 T3 N1 N2
        Gene_00001 0 0 2 0 0 1
        Gene_00002 20 8 12 5 19 26
        Gene_00003 3 0 2 0 0 0
        Gene_00004 75 84 241 149 271 257
        Gene_00005 10 16 4 0 4 10
        Gene_00006 129 126 451 223 243 149
        Hi,

        Starting from the output of TopHat you can use this pipeline:

        Code:
        samtools sort -n accepted_hits.bam accepted_hits.nsorted  ## Sort by read name (necessary for htseq-count)
        samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
        python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf  ## See the options of htseq that suite you
        For each library, this will give you a table of read counts for each gene.

        Hope it helps

        Dario

        Comment


        • #5
          thanks Dario for the reply.
          I did run htSeq-Count and obtained the following output for each of the SAM files.
          .
          .
          NM_000019 246
          NM_000020 0
          NM_000021 0
          NM_000022 489
          NM_000023 2
          NR_038131 14
          NR_038133 0
          NR_038135 0
          NR_038146 0
          NR_038150 0
          no_feature 4796180
          ambiguous 1464598
          too_low_aQual 0
          not_aligned

          Is the number of ambiguous reads too high?

          Comment


          • #6
            Originally posted by biofreak View Post
            thanks Dario for the reply.
            I did run htSeq-Count and obtained the following output for each of the SAM files.
            .
            .
            NM_000019 246
            NM_000020 0
            NM_000021 0
            NM_000022 489
            NM_000023 2
            NR_038131 14
            NR_038133 0
            NR_038135 0
            NR_038146 0
            NR_038150 0
            no_feature 4796180
            ambiguous 1464598
            too_low_aQual 0
            not_aligned

            Is the number of ambiguous reads too high?
            Well how many reads were inyour data set? Can't say if the number is too high if we don't know what fraction they represent.

            Comment


            • #7
              Starting from the output of TopHat you can use this pipeline:

              samtools sort -n accepted_hits.bam accepted_hits.nsorted ## Sort by read name (necessary for htseq-count)
              samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
              python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf ## See the options of htseq that suite you

              Hi Dario,

              I am trying to do exactly you mention above, the first step is to sort the accepted_hits.bam. But it looks like to generate a lots of file. Does it looks correct? As attached. Thanks.
              Attached Files

              Comment


              • #8
                This is ok. It takes some time and after finishing the process the temporary files are deleted.

                Comment


                • #9
                  Originally posted by hanshart View Post
                  This is ok. It takes some time and after finishing the process the temporary files are deleted.
                  HI Hanshart,

                  Is there any problem with my sort? It showed segmentation fault but this accepeted_hits.bam is from tophat output. Thanks

                  [v1wwanm-lx03 S1_R1_thout]$ ls
                  accepted_hits.bam accepted_hits.sorted.0004.bam accepted_hits.sorted.0012.bam accepted_hits.sorted.0020.bam pullout
                  accepted_hits.bed accepted_hits.sorted.0005.bam accepted_hits.sorted.0013.bam accepted_hits.sorted.0021.bam test_extracted_locus.out
                  accepted_hits.index accepted_hits.sorted.0006.bam accepted_hits.sorted.0014.bam deletions.bed tmp
                  accepted_hits.sam accepted_hits.sorted.0007.bam accepted_hits.sorted.0015.bam extracted.out unmapped.bam
                  accepted_hits.sorted.0000.bam accepted_hits.sorted.0008.bam accepted_hits.sorted.0016.bam insertions.bed
                  accepted_hits.sorted.0001.bam accepted_hits.sorted.0009.bam accepted_hits.sorted.0017.bam junctions.bed
                  accepted_hits.sorted.0002.bam accepted_hits.sorted.0010.bam accepted_hits.sorted.0018.bam logs
                  accepted_hits.sorted.0003.bam accepted_hits.sorted.0011.bam accepted_hits.sorted.0019.bam prep_reads.info
                  [v1wwanm-lx03 S1_R1_thout]$ [bam_sort_core] merging from 25 files...
                  [bam_header_read] bgzf_check_EOF: Invalid argument
                  [bam_header_read] invalid BAM binary header (this is not a BAM file).
                  [bam_header_read] bgzf_check_EOF: Invalid argument
                  [bam_header_read] invalid BAM binary header (this is not a BAM file).

                  [1]+ Segmentation fault samtools sort accepted_hits.bam accepted_hits.sorted

                  Comment


                  • #10
                    Hi,

                    How can we change the identifier to real gene name. The gene name should be taken from genes.gtf, isn't it? Thank you


                    ENSG00000261838 0
                    ENSG00000261839 2
                    ENSG00000261840 3
                    ENSG00000261841 0
                    ENSG00000261842 0
                    no_feature 985727
                    ambiguous 435068
                    too_low_aQual 0
                    not_aligned 0
                    alignment_not_unique 778451

                    Comment


                    • #11
                      How does your genes.gtf file look like? Does it contain gene names?

                      Comment


                      • #12
                        Hi Simon,

                        I took the gtf file from ENSEMBL (genes.gtf). Here is the file looks like. It is possible to change from identifier to gene name? Thank you.

                        1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-002"; tss_id "TSS26614";
                        1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-201"; tss_id "TSS165962";
                        1 transcribed_unprocessed_pseudogene exon 11874 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; tss_id "TSS165964";
                        1 transcribed_unprocessed_pseudogene exon 12010 12057 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-001"; tss_id "TSS72060";

                        Comment


                        • #13
                          You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".

                          Comment


                          • #14
                            Originally posted by Simon Anders View Post
                            You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".
                            It's work! Tq simon..

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            12 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            69 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X