Seqanswers Leaderboard Ad

**LeonDK** · 01-15-2015, 12:47 AM

Originally posted by dpryan View Post

The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).

First: Thank you so much for your input - I really appreciate your time and hope that others, which later on potentially will end up reading this via google, will also benefit!

Next: Having recently finished my Bioinformatics PhD in statistical methodology and vaccine development, the world of NGS is new to me!

Finally: I am using the STAR aligner (STAR_2.4.0h1) and I have build genome indices using GRCh38 [1] as described in [2]. I have then aligned my human 101bp Illumina paired-end reads to this build, using options "ENCODE standard options for long RNA-seq" in [2] section 3.2.1 and then finally I am trying to use featureCounts (Version 1.4.6) to generate a count matrix. Based on this I aim at performing DEG analysis.

Questions:
a) Is this approach valid/recommendable or have I missed something?
b) Given the above scenario, would you recommend using the 'primary' or the 'top level' assembly?
c) Any other input?

Cheers again!

1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
2. https://github.com/alexdobin/STAR/bl...STARmanual.pdf

**dpryan** · 01-15-2015, 12:52 AM

Welcome to the NGS side of bioinfo!

a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

Anyway, it sounds like you're on the right track!

**LeonDK** · 01-15-2015, 01:02 AM

Originally posted by dpryan View Post

Welcome to the NGS side of bioinfo!

a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

Anyway, it sounds like you're on the right track!

Thanks! So far, I find it a bit of a strange world, with a lot of 'best guesses'!

Regarding 'trimming', I have been told, that this is not necessary, since the aligner will simply 'skip' adapters/bad quality bases, apparently using something called 'soft-clipping'?

**dpryan** · 01-15-2015, 01:10 AM

Originally posted by LeonDK View Post

Regarding 'trimming', I have been told, that this is not necessary, since the aligner will simply 'skip' adapters/bad quality bases, apparently using something called 'soft-clipping'?

Jein, as my German colleagues would say. Aligners can nicely soft-clip low-quality or otherwise mismatched stretches of sequence. However, adapter sequence is going to be both good quality and easy to spot. The general worry is that by leaving it in there, you can bias what should otherwise be multimapping reads/pairs to aberrantly become uniquely mapping. Since adapter sequences can be spotted with high fidelity, they can just be trimmed off to avoid this. Quality trimming doesn't gain much for RNAseq (there are some papers on this), though it seems that trimming off the really low quality bases (<5 or so) seems to have some benefit). Anyway, I believe that STAR has an option to adapter trim for you, which is probably the most convenient way to go.

**LeonDK** · 01-15-2015, 01:34 AM

Originally posted by dpryan View Post

Jein, as my German colleagues would say. Aligners can nicely soft-clip low-quality or otherwise mismatched stretches of sequence. However, adapter sequence is going to be both good quality and easy to spot. The general worry is that by leaving it in there, you can bias what should otherwise be multimapping reads/pairs to aberrantly become uniquely mapping. Since adapter sequences can be spotted with high fidelity, they can just be trimmed off to avoid this. Quality trimming doesn't gain much for RNAseq (there are some papers on this), though it seems that trimming off the really low quality bases (<5 or so) seems to have some benefit). Anyway, I believe that STAR has an option to adapter trim for you, which is probably the most convenient way to go.

Ok, I think that makes sense! And 'Jein' would be 'Nja' in Danish

Looking in the manual, STAR does indeed have options for adapter removal:
----------------------------------------
--clip3pAdapterSeq
default: -
string(s): adapter sequences to clip from 3p of each mate. If one value is given,
it will be assumed the same for both mates.
--clip3pAdapterMMp
default: 0.1
double(s): max proportion of mismatches for 3p adpater clipping for each mate.
If one value is given, it will be assumed the same for both mates.
--clip3pAfterAdapterNbases
default: 0
int(s): number of bases to clip from 3p of each mate after the adapter clipping.
If one value is given, it will be assumed the same for both mates.
----------------------------------------

However I am not quite sure of:
a) Which would be the right option to use
b) How do I find out the appropriate adapter sequence to state to STAR?

**dpryan** · 01-15-2015, 01:39 AM

a) --clip3pAdapterSeq
b) You'd need to either run fastQC (or something like that) or just ask whomever did the sequencing. They can give you the adapter sequence (it's presumably just the standard Illumina adapter sequence).

**LeonDK** · 01-15-2015, 03:50 AM

Originally posted by dpryan View Post

a) --clip3pAdapterSeq
b) You'd need to either run fastQC (or something like that) or just ask whomever did the sequencing. They can give you the adapter sequence (it's presumably just the standard Illumina adapter sequence).

Ok, so the kit used is [1] and the standard Illumina adapter sequences can be found in [2], where it on p12 says:

TruSeq Universal Adapter
5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

Is this then the sequence, I should state after the '--clip3pAdapterSeq' option? Or should I reverse it like so:

3' TCTAGCCTTCTCGCAGCACATCCCTTTCTCACATCTAGAGCCACCAGCGGCATAGTAA

?

1. http://www.illumina.com/products/tru..._prep_kit.html
2. http://supportres.illumina.com/docum...nce-letter.pdf

**shi** · 01-15-2015, 02:14 PM

For read mapping, you may also try Subread aligner. Its performance has been demonstrated in the recent SEQC/MAQC III study:

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/25150838

We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA …

This publication includes detailed results from the Subread-featureCounts-limma/voom pipeline and comparisons with other pipelines from using billions of RNA-seq reads generated in six different sequencing centers.

**LeonDK** · 01-16-2015, 01:05 AM

Originally posted by shi View Post

For read mapping, you may also try Subread aligner. Its performance has been demonstrated in the recent SEQC/MAQC III study:

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/25150838

We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA …

This publication includes detailed results from the Subread-featureCounts-limma/voom pipeline and comparisons with other pipelines from using billions of RNA-seq reads generated in six different sequencing centers.

Thank you for your input. The reason for using star is speed and that it "is-the-aligner-of-choice", where I am working

**suhaib58** · 03-12-2015, 10:56 AM

Hi all,

Similar to the question posted by Leon regarding the chromosomes no: 270 with latest release 78, GRCh38.gtf from ensemble I found the same issue in my analysis. I am using RNA-seq paired end data illumina 75bp read outs. After QC and trimming adapters and low quality bp, I have aligned trimmed paired reads with Tophat2.0/Bowtie2. Before aligning, concatenated all .fa files for chromosomes 1:22, +M+X+Y, and then built index on the concatenated.GrCh38.fa. Using the bam file generated featureCount program, read summarisation was performed keeping parameter s -2 (52.1% assigned fragments) as s -1 (~1.5%) was giving significantly low successfully assigned fragments.

Result
Meta-features : 58674
Chromosomes : 270
This was tested in standalone FeatureCount program coded in C.

When the same data was used in Rsubread featureCount function
Meta-features : 64253
Chromosomes : 270

Questions
1. Is there any discrepancy in the program in C and R package.
2. Chromosomes 270 and meta feature 58674 makes sense ?
3. Increasing number of threads to 20 in both (C and R) is not increasing the speed of the program, and none of the cores/threads are being utilised.

**shi** · 03-12-2015, 01:41 PM

Can you provide your commands and also the versions of programs you use? This will help find out the difference between your two runs.

Also, the percentages of assigned fragments suggest your data are reversely stranded reads. So you should use '-s 2', not '-s 1'.

Cheers,
Wei

**suhaib58** · 03-13-2015, 02:29 AM

I just found when when gene_name is used both the packages it provides consistent meta-features.
C version
featureCounts -p -s 2 -M \
-a ../Homo_sapiens.GRCh38.78.gtf -o counts.txt -t exon -g gene_name -T 10 *accepted_hits.bam

R subread
count<-featureCounts(files="accepted_hits.bam",
,annot.ext="Homo_sapiens.GRCh38.78.gtf",isGTFAnnotationFile=TRUE,
GTF.featureType="exon",GTF.attrType="gene_name",useMetaFeatures=TRUE,
allowMultiOverlap=FALSE,isPairedEnd=TRUE,requireBothEndsMapped=FALSE,
checkFragLength=FALSE,minFragLength=50,maxFragLength=1000,
nthreads=20,strandSpecific=2,minMQS=0,
readExtension5=0,readExtension3=0,read2pos=NULL,
minReadOverlap=1,countSplitAlignmentsOnly=FALSE,
countMultiMappingReads=TRUE,countPrimaryAlignmentsOnly=FALSE,
countChimericFragments=TRUE,ignoreDup=FALSE,chrAliases=NULL,reportReads=FALSE)

However, when multiple featurecounts ran on different tabs for different samples. Multi-threading does not get used up. However, for single run, its much faster.

**suhaib58** · 03-13-2015, 03:27 AM

total assigned fragments : 47.9%

Summary
Status 34b1_accepted_hits.bam
Assigned 28314328
Unassigned_Ambiguity 6375026
Unassigned_MultiMapping 0
Unassigned_NoFeatures 24480296
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragementLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0

Any reason why there are so many unassigned Nofeature and ambiguous reads. When the aligned bam file was tested with samtools flagstats function it showed 100% mapped reads and ~86% properly paired reads.

**shi** · 03-15-2015, 01:52 PM

Having a high mapping percentage does not mean you will have a high percentage of reads being assigned to genes included in your annotation. This is affected by many factors such as the completeness of your annotation, your mRNA enrichment quality and etc.

I typically see around 50-60% of reads successfully assigned to genes when NCBI RefSeq gene annotation is used.

Wei

**suhaib58** · 03-17-2015, 03:55 AM

Thank you very much for your advise and time.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News