SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 07:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 11:11 AM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 10:38 AM

Reply
 
Thread Tools
Old 02-19-2016, 01:29 PM   #201
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi Rui,

here are a few suggestions in addition to @GenoMax's suggestion.

1. You are using --alignEndsType EndToEnd, which requires end-to-end alignment for each read (no soft clipping). This might be too harsh for longer reads, which are more likely to have poor quality tails, adapters at the ends etc. Please try to map without this option.
2. Map read1 and read2 separately - you may have a problem with one of the reads.
3. Check sequencing quality by plotting quality scores vs position in read (Illumina pipelines typically produce these plots). If sequencing quality drops towards the ends of the reads for a substantial portion of the reads, this would explain poor mappability.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 02-19-2016, 07:52 PM   #202
harrike
Member
 
Location: St. Louis, MO

Join Date: Jun 2010
Posts: 29
Default

Hi Alex,

Thanks for your suggestions.

I manually checked a couple of reads as Genomax suggested, and find the major reason of this low mapping rate is because that most of the reads have adapter, due to the poor construction of RNA-seq library. What I am trying to do is to trim the adapter and do the mapping again. The read quality is good per FastQC check.

I will try to relax --alignEndsType option, and see if the mapping will become better or not.

Rui
harrike is offline   Reply With Quote
Old 02-19-2016, 10:40 PM   #203
Juegos 2 friv 4school
Junior Member
 
Location: Canada

Join Date: Feb 2016
Posts: 1
Default

Thank your article. very helpful article. thank you very much.
Juegos 2 friv 4school is offline   Reply With Quote
Old 04-04-2016, 09:08 AM   #204
SamCurt
Member
 
Location: Iowa

Join Date: May 2010
Posts: 40
Default

Just a quick question here. Is the parameters file used with --parametersFile just a list of command-line options in the same way I type in the console?
SamCurt is offline   Reply With Quote
Old 04-04-2016, 02:25 PM   #205
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by SamCurt View Post
Just a quick question here. Is the parameters file used with --parametersFile just a list of command-line options in the same way I type in the console?
The file with parameters should have each parameter on a separate line:
<parameterName> <parameterValue(s)>
parameterName should not contain --
For instance,
genomeChrBinNbits 18
genomeSAsparseD 1
readFilesIn Read1 Read2
readFilesCommand -
alexdobin is offline   Reply With Quote
Old 04-05-2016, 08:42 PM   #206
SamCurt
Member
 
Location: Iowa

Join Date: May 2010
Posts: 40
Default

Thank you for the quick reply, Alex.

I also have another problem here. My new institution only has 2.4.0j on their cluster, and it'd take about a week to get a newer version installed. Do you think it's safe to run the first pass using 2.4.0j, and use its SJ.out.tab files for --sjdbFileChrStartEnd when I get, say, 2.5.1b?
SamCurt is offline   Reply With Quote
Old 04-06-2016, 06:31 AM   #207
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by SamCurt View Post
Thank you for the quick reply, Alex.

I also have another problem here. My new institution only has 2.4.0j on their cluster, and it'd take about a week to get a newer version installed. Do you think it's safe to run the first pass using 2.4.0j, and use its SJ.out.tab files for --sjdbFileChrStartEnd when I get, say, 2.5.1b?

Hi Sam,

this would be generally safe, however, when you publish your method, the reviewers and readers will have a bone to pick with you.
STAR does not really require installation, you can download a pre-compiled executable and run it instead of the one "installed" on your cluster.
I recommend re-generating the genome indexes for the 2.5.1b.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 04-06-2016, 06:46 AM   #208
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

Quote:
Originally Posted by alexdobin View Post
I recommend re-generating the genome indexes for the 2.5.1b.

Cheers
Alex
@Alex: Does that mean indexes generated with older versions won't work or you recommend that they be regenerated.
GenoMax is offline   Reply With Quote
Old 04-06-2016, 08:30 AM   #209
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by GenoMax View Post
@Alex: Does that mean indexes generated with older versions won't work or you recommend that they be regenerated.
The new versions of STAR may not work with old genome indexes in rare cases - hence my recommendation to re-generate with 2.5.1 that is very stable.
alexdobin is offline   Reply With Quote
Old 05-02-2016, 08:55 AM   #210
SamCurt
Member
 
Location: Iowa

Join Date: May 2010
Posts: 40
Default

So, just for gene expression profiling purposes, should I keep my sjDb file set for second-pass alignment constant?

Complete story: I have a set of ~40 samples already completed the entire set of double-pass alignment for both gene expression and variation analysis purposes. sjDb files from the first-passes of these samples were used for their second-pass alignments.

Now I have received a further ~15 samples within the same project of which I'd perform gene expression only. I wonder whether I should I do a first-pass on these new samples and pool their sjDb's with the old ones for second-pass, or just do a "second-pass" with the old sjDb's? My concern is obviously not about time, but rather whether using a different sjDb set would make the gene counts less comparable.
SamCurt is offline   Reply With Quote
Old 05-04-2016, 07:31 AM   #211
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by SamCurt View Post
So, just for gene expression profiling purposes, should I keep my sjDb file set for second-pass alignment constant?

Complete story: I have a set of ~40 samples already completed the entire set of double-pass alignment for both gene expression and variation analysis purposes. sjDb files from the first-passes of these samples were used for their second-pass alignments.

Now I have received a further ~15 samples within the same project of which I'd perform gene expression only. I wonder whether I should I do a first-pass on these new samples and pool their sjDb's with the old ones for second-pass, or just do a "second-pass" with the old sjDb's? My concern is obviously not about time, but rather whether using a different sjDb set would make the gene counts less comparable.
Hi Sam,

To avoid quantification bias it's better to use the same splice junctions for the 2nd pass mapping. However, this affects only the novel (unannotated junctions), so if you are quantifying only annotated genes, the bias is likely to be very small.

The ideal solution is to combine splice junctions files (SJ.out.tab) from the 1st pass of all samples (old and new), and then run the 2nd pass on *all* samples.

The 2nd best solution (for differential expression) is to use only the junctions from the old samples for the "2nd" pass mapping of the new samples (you would not need the 1st pass mapping for the new samples, nor another 2nd pass on the old samples). This way you would avoid bias for junctions detected only in the new samples.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 09-26-2016, 09:20 AM   #212
mdidish
Junior Member
 
Location: Paris

Join Date: Sep 2016
Posts: 2
Default

Hi everyone
Do you think we can align with Star on a laptop with Intel Core Extreme i7-4940MX and 32GB RAM, even overnight? I will have about 130 million reads, to align on human genome.
Thank you
mdidish is offline   Reply With Quote
Old 09-27-2016, 02:10 PM   #213
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by mdidish View Post
Hi everyone
Do you think we can align with Star on a laptop with Intel Core Extreme i7-4940MX and 32GB RAM, even overnight? I will have about 130 million reads, to align on human genome.
Thank you
Hi,

depending on the read length, the speed should 20-50M reads per hour per core, so it should be doable. 32GB is just enough for human genome.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 09-28-2016, 02:52 AM   #214
mdidish
Junior Member
 
Location: Paris

Join Date: Sep 2016
Posts: 2
Default

Hi,
Thank you or your response. Finally, I should have a laptop with Intel Core Extreme i7-4940MX and 64GB RAM.
The duration is not important, I just wanted to make sure I can start the analysis.
Marc
mdidish is offline   Reply With Quote
Old 12-15-2016, 11:35 PM   #215
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default dear alex, or other star experts

which parameters should I set to get ALL non-canonical (i.e. back) spliced reads in the unmapped sam file. I want call circular RNAs with these reads.

I have 50 bp paired-end unstranded RNA-seq reads, and a genome index with a splice database from the same data. 2-pass over all samples. I hope back-spliced junctions are NOT present in this (joined) splice database - or should I filter theses databases accordingly to remove the back-splice junctions?...


is

--outFilterIntronMotifs RemoveNoncanonicalUnannotated

the correct setting. i.e. will all spliced reads not present in the splice junctions database get in the unmapped sam-file?

best wishes and thank's in advance,

dietmar
dietmar13 is offline   Reply With Quote
Old 12-21-2016, 11:04 AM   #216
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by dietmar13 View Post
which parameters should I set to get ALL non-canonical (i.e. back) spliced reads in the unmapped sam file. I want call circular RNAs with these reads.

I have 50 bp paired-end unstranded RNA-seq reads, and a genome index with a splice database from the same data. 2-pass over all samples. I hope back-spliced junctions are NOT present in this (joined) splice database - or should I filter theses databases accordingly to remove the back-splice junctions?...


is

--outFilterIntronMotifs RemoveNoncanonicalUnannotated

the correct setting. i.e. will all spliced reads not present in the splice junctions database get in the unmapped sam-file?

best wishes and thank's in advance,

dietmar
Hi Dietmar,

the non-canonical junctions have non-canonical motifs, but they are still "linear" in the genome, i.e. acceptor site follows the donor site. The circular junctions are classified as "chimeric", so you need to enable chimeric detection, e.g.: --chimSegmentMin 15 --chimJunctionOverhangMin 15 . You can extract the circular junctions from the Chimeric.out.junction (see this post), an example script is in STAR source distribution: extras/scripts/filterCirc.awk . The chimeric alignments are also written in the SAM/BAM files.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 06-04-2017, 08:38 PM   #217
anorman07
Junior Member
 
Location: San Francisco, CA

Join Date: Jun 2017
Posts: 1
Default

Hi Alex

I've been using Star now for several weeks and I love it! Thanks for creating such a great tool.

I'd like to use Star to try to align Macaque reads to the human genome. I think this might work best if I relax the alignment stringency. Do you have any recommendations for how I should do this?
anorman07 is offline   Reply With Quote
Old 06-28-2017, 10:56 AM   #218
SamCurt
Member
 
Location: Iowa

Join Date: May 2010
Posts: 40
Default

Hi Alex,
I understand that with --quantMode TranscriptomeSAM --quantTranscriptomeBan Singleend I can generate a transcript-coordinate bam file with indels and soft-clips. But do you consider it acceptable for variant-calling (eg for allele-specific expression purposes)?
SamCurt is offline   Reply With Quote
Old 04-02-2018, 05:59 PM   #219
drdna
Member
 
Location: Kentucky

Join Date: May 2012
Posts: 71
Default duplicated reference genomes

I don't understand why the genomeGenerate mode is creating a duplicated (concatenated) reference. This is resulting in at least two identical alignments for every read:

Command issued:
Code:
STAR --runMode genomeGenerate --genomeDir NPB_Pi9 --genomeFastaFiles NPB_Pi9.fasta --runThreadN 2 --genomeSAindexNbases 14
resulting SAM header and first two alignments:
Code:
@HD	VN:1.4
@SQ	SN:chr01	LN:43270923
@SQ	SN:chr02	LN:35937250
@SQ	SN:chr03	LN:36413819
@SQ	SN:chr04	LN:35502694
@SQ	SN:chr05	LN:29958434
@SQ	SN:chr06	LN:31248787
@SQ	SN:chr07	LN:29697621
@SQ	SN:chr08	LN:28443022
@SQ	SN:chr09	LN:23012720
@SQ	SN:chr10	LN:23207287
@SQ	SN:chr11	LN:29021106
@SQ	SN:chr12	LN:27531856
@SQ	SN:AC155918	LN:32941
@SQ	SN:AC156495	LN:88500
@SQ	SN:AC160949	LN:128256
@SQ	SN:AP008246	LN:206004
@SQ	SN:AP008247	LN:157458
@SQ	SN:AC174930	LN:15426
@SQ	SN:Syng_TIGR_002	LN:14476
@SQ	SN:Syng_TIGR_004	LN:19457
@SQ	SN:Syng_TIGR_005	LN:21787
@SQ	SN:Syng_TIGR_007	LN:7820
@SQ	SN:Syng_TIGR_008	LN:16676
@SQ	SN:Syng_TIGR_009	LN:10296
@SQ	SN:Syng_TIGR_010	LN:15493
@SQ	SN:Syng_TIGR_011	LN:10901
@SQ	SN:Syng_TIGR_012	LN:16417
@SQ	SN:Syng_TIGR_013	LN:10512
@SQ	SN:Syng_TIGR_014	LN:21421
@SQ	SN:Syng_TIGR_015	LN:10595
@SQ	SN:Syng_TIGR_016	LN:12792
@SQ	SN:Syng_TIGR_019	LN:10422
@SQ	SN:Syng_TIGR_020	LN:10699
@SQ	SN:Syng_TIGR_021	LN:17477
@SQ	SN:Syng_TIGR_022	LN:9889
@SQ	SN:Syng_TIGR_023	LN:24772
@SQ	SN:Syng_TIGR_024	LN:10060
@SQ	SN:Syng_TIGR_026	LN:19971
@SQ	SN:Syng_TIGR_027	LN:11522
@SQ	SN:Syng_TIGR_028	LN:31094
@SQ	SN:Syng_TIGR_029	LN:12884
@SQ	SN:Syng_TIGR_030	LN:10794
@SQ	SN:Syng_TIGR_031	LN:9548
@SQ	SN:Syng_TIGR_032	LN:9603
@SQ	SN:Syng_TIGR_033	LN:11093
@SQ	SN:Syng_TIGR_034	LN:10311
@SQ	SN:Syng_TIGR_035	LN:10686
@SQ	SN:Syng_TIGR_036	LN:10434
@SQ	SN:Syng_TIGR_037	LN:13061
@SQ	SN:Syng_TIGR_038	LN:8197
@SQ	SN:Syng_TIGR_039	LN:6269
@SQ	SN:Syng_TIGR_041	LN:10210
@SQ	SN:Syng_TIGR_042	LN:5510
@SQ	SN:Syng_TIGR_043	LN:4236
@SQ	SN:Syng_TIGR_044	LN:6000
@SQ	SN:Syng_TIGR_045	LN:22545
@SQ	SN:Syng_TIGR_046	LN:11447
@SQ	SN:Syng_TIGR_047	LN:20829
@SQ	SN:Syng_TIGR_048	LN:7140
@SQ	SN:Syng_TIGR_049	LN:6261
@SQ	SN:Syng_TIGR_050	LN:8529
@SQ	SN:Pi9_cDNA	LN:4650
@SQ	SN:chr01	LN:43270923
@SQ	SN:chr02	LN:35937250
@SQ	SN:chr03	LN:36413819
@SQ	SN:chr04	LN:35502694
@SQ	SN:chr05	LN:29958434
@SQ	SN:chr06	LN:31248787
@SQ	SN:chr07	LN:29697621
@SQ	SN:chr08	LN:28443022
@SQ	SN:chr09	LN:23012720
@SQ	SN:chr10	LN:23207287
@SQ	SN:chr11	LN:29021106
@SQ	SN:chr12	LN:27531856
@SQ	SN:AC155918	LN:32941
@SQ	SN:AC156495	LN:88500
@SQ	SN:AC160949	LN:128256
@SQ	SN:AP008246	LN:206004
@SQ	SN:AP008247	LN:157458
@SQ	SN:AC174930	LN:15426
@SQ	SN:Syng_TIGR_002	LN:14476
@SQ	SN:Syng_TIGR_004	LN:19457
@SQ	SN:Syng_TIGR_005	LN:21787
@SQ	SN:Syng_TIGR_007	LN:7820
@SQ	SN:Syng_TIGR_008	LN:16676
@SQ	SN:Syng_TIGR_009	LN:10296
@SQ	SN:Syng_TIGR_010	LN:15493
@SQ	SN:Syng_TIGR_011	LN:10901
@SQ	SN:Syng_TIGR_012	LN:16417
@SQ	SN:Syng_TIGR_013	LN:10512
@SQ	SN:Syng_TIGR_014	LN:21421
@SQ	SN:Syng_TIGR_015	LN:10595
@SQ	SN:Syng_TIGR_016	LN:12792
@SQ	SN:Syng_TIGR_019	LN:10422
@SQ	SN:Syng_TIGR_020	LN:10699
@SQ	SN:Syng_TIGR_021	LN:17477
@SQ	SN:Syng_TIGR_022	LN:9889
@SQ	SN:Syng_TIGR_023	LN:24772
@SQ	SN:Syng_TIGR_024	LN:10060
@SQ	SN:Syng_TIGR_026	LN:19971
@SQ	SN:Syng_TIGR_027	LN:11522
@SQ	SN:Syng_TIGR_028	LN:31094
@SQ	SN:Syng_TIGR_029	LN:12884
@SQ	SN:Syng_TIGR_030	LN:10794
@SQ	SN:Syng_TIGR_031	LN:9548
@SQ	SN:Syng_TIGR_032	LN:9603
@SQ	SN:Syng_TIGR_033	LN:11093
@SQ	SN:Syng_TIGR_034	LN:10311
@SQ	SN:Syng_TIGR_035	LN:10686
@SQ	SN:Syng_TIGR_036	LN:10434
@SQ	SN:Syng_TIGR_037	LN:13061
@SQ	SN:Syng_TIGR_038	LN:8197
@SQ	SN:Syng_TIGR_039	LN:6269
@SQ	SN:Syng_TIGR_041	LN:10210
@SQ	SN:Syng_TIGR_042	LN:5510
@SQ	SN:Syng_TIGR_043	LN:4236
@SQ	SN:Syng_TIGR_044	LN:6000
@SQ	SN:Syng_TIGR_045	LN:22545
@SQ	SN:Syng_TIGR_046	LN:11447
@SQ	SN:Syng_TIGR_047	LN:20829
@SQ	SN:Syng_TIGR_048	LN:7140
@SQ	SN:Syng_TIGR_049	LN:6261
@SQ	SN:Syng_TIGR_050	LN:8529
@SQ	SN:Pi9_cDNA	LN:4650
@PG	ID:STAR	PN:STAR	VN:STAR_2.5.4b	CL:STAR   --runThreadN 16   --genomeDir NPB_Pi9   --genomeFastaFiles NPB_Pi9.fasta      --genomeSAindexNbases 1   --readFilesIn STARFILES/MF046_S4_L002_R
1_001.fastq.gz      --readFilesCommand gunzip   -c      --outFileNamePrefix STARFILES/MF046.NPB_Pi9   --outFilterMatchNmin 40
@CO	user command line: STAR --runThreadN 16 --genomeDir NPB_Pi9 --genomeFastaFiles NPB_Pi9.fasta --genomeSAindexNbases 1 --readFilesCommand gunzip -c --readFilesIn STARFILES/MF046_S4_L002_R
1_001.fastq.gz --outFileNamePrefix STARFILES/MF046.NPB_Pi9 --outFilterMatchNmin 40
K00282:141:HJTJWBBXX:2:1101:2656:1068	16	chr01	12873883	3	50M1S	*	0	0	CTTGAGNCGANCACACTATAGCCATGTACATTAGTATAGGTTTACACTAGN	JJJJJJ#JJJ#J<FJFJJJJJJJJJ
JJJJJJJJJJJJJJJJJJJJJAFAA#	NH:i:2	HI:i:1	AS:i:47	nM:i:0
K00282:141:HJTJWBBXX:2:1101:2656:1068	272	chr01	12873883	3	50M1S	*	0	0	CTTGAGNCGANCACACTATAGCCATGTACATTAGTATAGGTTTACACTAGN	JJJJJJ#JJJ#J<FJFJJJJJJJJJ
JJJJJJJJJJJJJJJJJJJJJAFAA#	NH:i:2	HI:i:2	AS:i:47	nM:i:0

Last edited by GenoMax; 04-03-2018 at 03:35 AM.
drdna is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:26 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO