Seqanswers Leaderboard Ad

**GenoMax** · 01-13-2015, 04:57 AM

Obvious thing to check - if chromosome names in your sam file match the annotation.

There are 270 "chromosomes" since there are entries of things like (besides the standard chromosomes 1,2,3 etc)

CHR_HSCHR9_1_CTG2
CHR_HSCHR17_5_CTG4
CHR_HSCHR16_4_CTG1
CHR_HSCHR19_5_CTG2
CHR_HSCHR9_1_CTG4
CHR_HSCHR19_4_CTG2
KI270726.1
KI270711.1
KI270714.1
KI270713.1
CHR_HSCHR12_5_CTG2)

in the GTF file.

**shi** · 01-13-2015, 01:38 PM

featureCounts also outputs assignment stat in a .summary file. Could you show the content of that file?

**LeonDK** · 01-13-2015, 11:11 PM

Originally posted by shi View Post

featureCounts also outputs assignment stat in a .summary file. Could you show the content of that file?

Status mySAMPLE/Aligned.out.sam
Assigned 696763
Unassigned_Ambiguity 11448
Unassigned_MultiMapping 13953741
Unassigned_NoFeatures 17772725
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragementLength 13813566
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0

**LeonDK** · 01-13-2015, 11:30 PM

Originally posted by GenoMax View Post

Obvious thing to check - if chromosome names in your sam file match the annotation.

How?

Originally posted by GenoMax View Post

There are 270 "chromosomes" since there are entries of things like (besides the standard chromosomes 1,2,3 etc) in the GTF file.

What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?

**dpryan** · 01-14-2015, 12:51 AM

Originally posted by LeonDK View Post

How?

What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?

The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.

**LeonDK** · 01-14-2015, 01:03 AM

Originally posted by dpryan View Post

The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.

I'm using this annotation file:
ftp://ftp.ensembl.org/pub/release-78...Ch38.78.gtf.gz

And this genome:
ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz

The first column of the annotation file contain 1..22+MT/X/Y and other things, such as e.g. 'CHR_HG142_HG150_NOVEL_TEST' and the header of my SAM file contain:
cat mySAMPLE/Aligned.out.sam | grep '@SQ'
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:1 LN:248956422
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:2 LN:242193529
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:MT LN:16569
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415

**dpryan** · 01-14-2015, 01:12 AM

If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.

**LeonDK** · 01-14-2015, 01:28 AM

Originally posted by dpryan View Post

If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.

If I use "-s 2", which is the option for "reverse stranded", the mapping dramatically increases to ~50%. But the kit is called "TruSeq stranded", not "reverse stranded"?

**dpryan** · 01-14-2015, 01:32 AM

The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".

**LeonDK** · 01-14-2015, 03:53 AM

Originally posted by dpryan View Post

The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".

Thank you!

**shi** · 01-14-2015, 01:55 PM

There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.

**LeonDK** · 01-14-2015, 11:13 PM

Originally posted by dpryan View Post

BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

What is the difference between 'primary' [1] and 'top level' [2] assembly?

1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
2. ftp://ftp.ensembl.org/pub/release-78...toplevel.fa.gz

**LeonDK** · 01-14-2015, 11:15 PM

Originally posted by shi View Post

There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.

I will try to re-do the mapping based on the 'top level' assembly, as suggested by dpryan and then I will re-run featureCounts with updated parameters as suggested by you. I will return with the results, once they are ready!

Cheers

**dpryan** · 01-15-2015, 12:15 AM

The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

featureCounts: 270 chr in GRCh38 and only assigning 1.5%?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News