Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • counting RNA-seq matches

    I have a data set of RNA-seq. I also have a set of genome wide coding sequences.

    My goal is to count the number of matches of each coding sequence which is queried by RNA-seq. What kind of tools would fit my purpose?

  • #2
    htseq-count

    Comment


    • #3
      I was also suggested to use Bowtie. Bowtie basically outputs an alignment report. Then, how to do the counting/statistics part?

      Comment


      • #4
        The program I mentioned above, htseq-count, takes a sam alignment file and a gtf/gff file describing your features and gives you the number of reads aligning to each feature. This is then appropriate for use in various statistical programs such as DESeq and EdgeR.

        Comment


        • #5
          I like to use RSEM for mapping (uses Bowtie) and isoform expression estimations, followed by DESeq for the differential expression statistics.

          If you just want raw read counts, map your reads (take some time to find the right software and options here, as this depends on the sequencing technology, sample and reference) and extract the counts from the BAM file with "samtools idxstats".

          Comment


          • #6
            Originally posted by arvid View Post
            I like to use RSEM for mapping (uses Bowtie) and isoform expression estimations, followed by DESeq for the differential expression statistics.

            If you just want raw read counts, map your reads (take some time to find the right software and options here, as this depends on the sequencing technology, sample and reference) and extract the counts from the BAM file with "samtools idxstats".
            "samtools idxstats file.bam" seems to compute the number of reads per reference sequence -typically chromosomes. To get the number of reads per gene htseq-count is a valid option indeed. I think that BEDtools is another one.
            BTW, I am curious to know if one is much faster than the other one.

            Comment


            • #7
              BEDtools coverageBed with a bed file of genes can be used with a bam file from bowtie...

              Comment


              • #8
                Beside Simon's python-based HTSeq , If you have experience with R and bioconductor, summarizeOverlaps function in GenomicRanges is easy to use too, following the same pattern defined in HTSeq


                countByOverlaps could also do the trick if you can manage your data and features into the right forms.

                Comment


                • #9
                  I used bowtie for alignment and samtools idxstats for counting. It works!

                  However, bowtie only allows me set a alignment constrain by a number of mismatches. Can I set a constrain by either identity percentage or P-value via bowtie or other tools?

                  Comment


                  • #10
                    My data reads are all bout 100bp. I want to count any alignments which are 90% identified and up. I notice that bowtie only allows max mismatches to be 3. How do I increase mismatch to be 10?

                    Comment


                    • #11
                      Originally posted by shuang View Post
                      My data reads are all bout 100bp. I want to count any alignments which are 90% identified and up. I notice that bowtie only allows max mismatches to be 3. How do I increase mismatch to be 10?
                      With Bowtie, the -n option (0-3) is for the seed only (usually the first 28 bases in the read), if you increase -e (max sum of mismatch quals) more mismatches in the whole alignment are allowed. You could also use the -v option (report end-to-end hits w/ <=v mismatches; ignore qualities) instead of -e if you want to allow a specific number of mismatches.

                      The settings in Bowtie2 (currently in beta5) are simplified and might suit your purposes better...

                      Comment


                      • #12
                        the average read length of my RNA-seq is 83. The reference sequences are coding sequences, including genomic, chloroplast, and mitochondria of the same species and strain.

                        Ideally, I want to set a threshold about 90% identity in finding matches. I set the parameters as -n 2 -l 15 -e 10

                        However, the aligned reads were only about 30% while I expected it to be almost 100%. Where did I do wrong?

                        Comment


                        • #13
                          Originally posted by shuang View Post
                          the average read length of my RNA-seq is 83. The reference sequences are coding sequences, including genomic, chloroplast, and mitochondria of the same species and strain.

                          Ideally, I want to set a threshold about 90% identity in finding matches. I set the parameters as -n 2 -l 15 -e 10

                          However, the aligned reads were only about 30% while I expected it to be almost 100%. Where did I do wrong?
                          Not sure whether you'll see massive improvements, but you should set -e much higher (you set it lower than the default 70), it is in mismatch qualities, not bases. Try something extreme like "-e 9999999" to see whether that gives you more alignments...

                          Comment


                          • #14
                            Thank you. This works!

                            I have one more question about the alignment. How do I set a threshold on P value or the minimum length of alignments. Basically, I don't want alignments which are too short, such as shorted than 50bp.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X