Unconfigured Ad

**alexdobin** · 03-27-2013, 02:01 PM

Originally posted by Nino View Post

Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

Thanks,
Nino

You will need to generate special genome files for STAR.
This is done with the following command:
STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
If you want to use annotations for improved mapping accuracy, you also need to use:
--sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.

**Auction** · 03-29-2013, 05:37 PM

Originally posted by alexdobin View Post

You will need to generate special genome files for STAR.
This is done with the following command:
STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
If you want to use annotations for improved mapping accuracy, you also need to use:
--sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.

Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

Thanks.

**alexdobin** · 03-30-2013, 07:18 AM

Originally posted by Auction View Post

Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

Thanks.

To switch on chimeric detection and output, you would need to specify non-zero --chimSegmentMin, which is a minimum length of a segment (piece) of which chimeras are made. For example, if you have 2x100 PE reads and specify --chimSegmentMin, you could have a chimera in which one segment of (100-mate1+80-mate2) bases maps non-chimerically to one chromosome, and another segement of 20b-mate2 maps to another chromosome.
The Chimeric output will go into Chimeric.out.sam and Chimeric.out.junction files.

Note that the same read can have both acceptable non-chimeric (output to Aligned.out.sam) and chimeric alignments (output to Chimeric.out.*). A read is considered "unmapped" if it does not have an acceptable non-chimeric alignment, and --outSAMunmapped Within will output "unmapped" reads into Aligned.out.sam without alignment coordinates (which allows to fully reconstruct fastq file from the SAM file), while --outReadsUnmapped Fastx will output them into a fastq or fasta files.

There are other parameters that control chimeric detection:
chimJunctionOverhangMin 20
int>0: minimum overhang for a chimeric junction
chimScoreMin 0
int>0: minimum total (summed) score of the chimeric segments
chimScoreDropMax 20
int>0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length
chimScoreSeparation 10
int>0: minimum difference (separation) between the best chimeric score and the next one
chimScoreJunctionNonGTAG -1
int: penalty for a non-GT/AG chimeric junction

**[email protected]** · 04-04-2013, 06:24 AM

I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?

**alexdobin** · 04-04-2013, 06:35 PM

Originally posted by [email protected] View Post

I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?

As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.

You can run Cufflinks on STAR alignments.
If you have un-stranded RNA-seq data you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.

If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
cufflinks ... ... --library-type fr-firststrand
should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
--outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated

**[email protected]** · 04-04-2013, 11:46 PM

As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.

You are right, sorry I mixed it up a bit. Thanks for the information on the options I should use.

**bruce01** · 05-07-2013, 04:42 AM

Hi all, sorry for the basic question:

I am writing a bash script to submit star jobs, remove duplicates, get counts etc. The dataset I have has multiple fastq per sample, but different numbers for each. I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:

STAR [options] readFilesIn $files/file_read1 $files/file_read2

?

Have tried a few ways to do this but can't figure it out or get STAR to accept input. I am a 'midrange' bioinformatics PhD, so don't hold back on most efficient or crazy way of doing this!

Thanks in advance,

Bruce.

**dpryan** · 05-07-2013, 08:40 AM

Originally posted by bruce01 View Post

I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:

STAR [options] readFilesIn $files/file_read1 $files/file_read2

Have you just tried the following?

Code:

STAR --readFilesIn Sample1_r1_1.fq,Sample1_r1_2.fq,Sample1_r1_3.fq... Sample1_r2_1.fq,Sample1_r2_2.fq,Sample1_r2_3.fq...

You could also just concatenate the files together as appropriate and use the result.

**bruce01** · 05-08-2013, 03:25 AM

Dpryan, yes have tried using wildcards as input to test it works, I get a segmentation fault. When I run it with all filenames included as standard it runs fine. I have a lot of samples, with variable numbers of fastq files per sample, and want a single script to submit to a queue. So inputting all fastq by hand is not an option, hence my original question.

Concatenating the fastqs will mean I have to uncompress them, using computing time and I am keen to go from the .gz that my facility have supplied. This can't be too big of a problem is it?

**dpryan** · 05-08-2013, 03:34 AM

My example didn't use wildcards, so I'm not sure where that idea came from.

You can just concatenate the gzipped files together without uncompressing them first.

The other normal process would be to simply write your script to generate the comma separated list that's then fed to STAR. You should be able to do that easily enough in bash, which whatever you're using for job scheduling probably already can handle.

**bruce01** · 05-08-2013, 03:38 AM

Ok, asked over on Stackoverflow, this works:

group1=( $files/Sample1*r1* );
group2=( $files/Sample1*r2* );
( IFS=,; STAR --readFilesIn "${group1[*]}" "${group2[*]}" [OPTIONS]);

Thanks for the help and ideas Dpryan.

##Edit: DPryan, sorry, getting wires crossed between here and Stackoverflow. I was asking how to give STAR the input that I had created, above works. I am reticent to concatenate gzip files, I dont want to create doubles and don't want to change the gzips in any way before aligning: paranoia!

**Auction** · 05-09-2013, 07:25 AM

You can also try following commands, it works for me.
fq1=`ls -m *_R1_*.fastq.gz | tr -d '\n' | tr -d ' '`
fq2=${fq1//"_R1_"/"_R2_"}
STAR --readFilesIn $fq1 $fq2

**priya** · 05-29-2013, 05:55 AM

Originally posted by alexdobin View Post

If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
cufflinks ... ... --library-type fr-firststrand
should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
--outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated

Hi Alex,
I am trying STAR to align the reads and then use the Cufflinks to look for expression values.I have stranded RNA-seq data. MAy I know why it is recommended to remove the non-canonical junctions for cufflinks run. How is it gonna affect in Cufflinks if I use the default parameter "no filtering" ??

**NGSfan** · 05-29-2013, 06:04 AM

hi priya, you may want to post this and carry on the conversation at the google groups for rna-star:

Redirecting to Google Groups

https://groups.google.com/forum/#!forum/rna-star

**alexdobin** · 05-29-2013, 10:47 AM

I believe it's best to feed Cufflinks only with the highest confidence alignments, and non-canonical junctions in my experience contain more false positives.
Also, many non-canonical splices occur just a few bases away from the highly expressed canonical, which could be caused by sequencing/mapping errors, and possibly by spliceosome errors. These splices will likely throw Cufflinks assembly off.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News