Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • featureCounts: 270 chr in GRCh38 and only assigning 1.5%?

    INPUT
    Stranded RNAseq, 101bp Illumina paired end reads mapped to GRCh38 release 78 using star.

    COMMAND
    featureCounts -a Homo_sapiens.GRCh38.78.gtf -o mySAMPLE/mySAMPLE.featureCounts -F GTF -t exon -g gene_name -s 1 -T 10 -p -P mySAMPLE/Aligned.out.sam

    QUESTIONS
    1. Why does it say, that 'ftp://ftp.ensembl.org/pub/release-78/gtf/homo_sapiens/Homo_sapiens.GRCh38.78.gtf.gz' has 270 chromosomes?
    2. What could be the reason for the low fragment assignment, 697168 / 46248243 = 1.5%
    3. Have I used wrong options?

    Cheers!

  • #2
    Obvious thing to check - if chromosome names in your sam file match the annotation.

    There are 270 "chromosomes" since there are entries of things like (besides the standard chromosomes 1,2,3 etc)

    CHR_HSCHR9_1_CTG2
    CHR_HSCHR17_5_CTG4
    CHR_HSCHR16_4_CTG1
    CHR_HSCHR19_5_CTG2
    CHR_HSCHR9_1_CTG4
    CHR_HSCHR19_4_CTG2
    KI270726.1
    KI270711.1
    KI270714.1
    KI270713.1
    CHR_HSCHR12_5_CTG2)
    in the GTF file.

    Comment


    • #3
      featureCounts also outputs assignment stat in a .summary file. Could you show the content of that file?

      Comment


      • #4
        Originally posted by shi View Post
        featureCounts also outputs assignment stat in a .summary file. Could you show the content of that file?
        Status mySAMPLE/Aligned.out.sam
        Assigned 696763
        Unassigned_Ambiguity 11448
        Unassigned_MultiMapping 13953741
        Unassigned_NoFeatures 17772725
        Unassigned_Unmapped 0
        Unassigned_MappingQuality 0
        Unassigned_FragementLength 13813566
        Unassigned_Chimera 0
        Unassigned_Secondary 0
        Unassigned_Nonjunction 0
        Unassigned_Duplicate 0

        Comment


        • #5
          Originally posted by GenoMax View Post
          Obvious thing to check - if chromosome names in your sam file match the annotation.
          How?
          Originally posted by GenoMax View Post
          There are 270 "chromosomes" since there are entries of things like (besides the standard chromosomes 1,2,3 etc) in the GTF file.
          What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?

          Comment


          • #6
            Originally posted by LeonDK View Post
            How?

            What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?
            The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

            BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.

            Comment


            • #7
              Originally posted by dpryan View Post
              The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

              BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.
              I'm using this annotation file:
              ftp://ftp.ensembl.org/pub/release-78...Ch38.78.gtf.gz

              And this genome:
              ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz

              The first column of the annotation file contain 1..22+MT/X/Y and other things, such as e.g. 'CHR_HG142_HG150_NOVEL_TEST' and the header of my SAM file contain:
              cat mySAMPLE/Aligned.out.sam | grep '@SQ'
              @SQ SN:10 LN:133797422
              @SQ SN:11 LN:135086622
              @SQ SN:12 LN:133275309
              @SQ SN:13 LN:114364328
              @SQ SN:14 LN:107043718
              @SQ SN:15 LN:101991189
              @SQ SN:16 LN:90338345
              @SQ SN:17 LN:83257441
              @SQ SN:18 LN:80373285
              @SQ SN:19 LN:58617616
              @SQ SN:1 LN:248956422
              @SQ SN:20 LN:64444167
              @SQ SN:21 LN:46709983
              @SQ SN:22 LN:50818468
              @SQ SN:2 LN:242193529
              @SQ SN:3 LN:198295559
              @SQ SN:4 LN:190214555
              @SQ SN:5 LN:181538259
              @SQ SN:6 LN:170805979
              @SQ SN:7 LN:159345973
              @SQ SN:8 LN:145138636
              @SQ SN:9 LN:138394717
              @SQ SN:MT LN:16569
              @SQ SN:X LN:156040895
              @SQ SN:Y LN:57227415

              Comment


              • #8
                If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

                Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.

                Comment


                • #9
                  Originally posted by dpryan View Post
                  If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

                  Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.
                  If I use "-s 2", which is the option for "reverse stranded", the mapping dramatically increases to ~50%. But the kit is called "TruSeq stranded", not "reverse stranded"?

                  Comment


                  • #10
                    The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".

                    Comment


                    • #11
                      Originally posted by dpryan View Post
                      The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".
                      Thank you!

                      Comment


                      • #12
                        There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

                        You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

                        Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.

                        Comment


                        • #13
                          Originally posted by dpryan View Post
                          BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).
                          What is the difference between 'primary' [1] and 'top level' [2] assembly?

                          1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
                          2. ftp://ftp.ensembl.org/pub/release-78...toplevel.fa.gz

                          Comment


                          • #14
                            Originally posted by shi View Post
                            There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

                            You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

                            Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.
                            I will try to re-do the mapping based on the 'top level' assembly, as suggested by dpryan and then I will re-run featureCounts with updated parameters as suggested by you. I will return with the results, once they are ready!

                            Cheers

                            Comment


                            • #15
                              The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

                              BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X