![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bowtie, an ultrafast, memory-efficient, open source short read aligner | Ben Langmead | Bioinformatics | 513 | 05-14-2015 03:29 PM |
STAR vs Tophat (2.0.5/6) | dvanic | Bioinformatics | 44 | 05-21-2014 08:08 AM |
Using Star/ bowtie on cluster | babi2305 | Bioinformatics | 7 | 02-06-2013 12:11 PM |
Suggested aligner for local alignment of RNA-seq data | Eric Fournier | RNA Sequencing | 9 | 01-23-2013 11:38 AM |
![]() |
|
Thread Tools |
![]() |
#21 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
please use the following parameters: --genomeChrBinNbits 6 --genomeSAindexNbases 4 I was able to run successful genome generation step for your small reference. It looks like you want to use STAR in a quite non-standard way. If you explain it a bit more I could suggest the parameters for the mapping step. |
|
![]() |
![]() |
![]() |
#22 | |
Member
Location: Ohio Join Date: Jul 2012
Posts: 68
|
![]() Quote:
![]() So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length. Thank you for your time. I look forward to seeing how your software works in my studies. Regards, -Tom Blomquist |
|
![]() |
![]() |
![]() |
#23 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
--outFilterMismatchNmax N (N=number of mismatches you wish to tolerate) --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin L (L=shortest of the reference sequences) Please let me know whether it works for you. |
|
![]() |
![]() |
![]() |
#24 | |
Member
Location: Ohio Join Date: Jul 2012
Posts: 68
|
![]() Quote:
Thanks again for your help. I will try the modifications later tonight. -Tom |
|
![]() |
![]() |
![]() |
#25 |
Member
Location: Ohio Join Date: Jul 2012
Posts: 68
|
![]()
Had the chance to test your recommended parameters for my specific needs. This is not my normal mode of operation, but... HOLY POOP!!! I cut a 2 hour BFAST match and sorting into bins down to 10 seconds!!! And the specificity and sensitivity is off the hook! When I do some more formal testing I will provide back metrics.
You are a computer jedi master! This will make analysis of clinical sequencing data a breeze. Very intuitive command lines. -Tom Blomquist |
![]() |
![]() |
![]() |
#26 | |
Member
Location: /home/bob Join Date: Jun 2012
Posts: 59
|
![]() Quote:
Also just to clarify, you don't know STAR would perform with bacterial reads? Last edited by bob-loblaw; 02-22-2013 at 06:54 AM. |
|
![]() |
![]() |
![]() |
#27 |
Member
Location: Ohio Join Date: Jul 2012
Posts: 68
|
![]()
For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?
Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported. Regards, -Tom |
![]() |
![]() |
![]() |
#28 |
Member
Location: New York City Join Date: Mar 2013
Posts: 27
|
![]()
Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)
Thanks, Nino |
![]() |
![]() |
![]() |
#29 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
STAR will work with database of any size, however, required RAM scales linearly with database size, at ~10*DBsize bytes. STAR works fine with bacterial reads, this is the work we did with long and small RNA-seq for some bacteria. |
|
![]() |
![]() |
![]() |
#30 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
With default parameters, if two alignments differ by one mismatch, only the best one would be reported. This is controlled by --outFilterMultimapScoreRange, (=1 by default) which defines the range of alignment scores that are reported as multi-mappers. The max number of alignments that allowed for output is controlled by --outFilterMultimapNmax. It's 10 be default, so any read with 10 or fewer alignments with scores >= BestScore-1 will be reported. |
|
![]() |
![]() |
![]() |
#31 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
This is done with the following command: STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads> If you want to use annotations for improved mapping accuracy, you also need to use: --sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100. |
|
![]() |
![]() |
![]() |
#32 | |
Member
Location: california Join Date: Jul 2009
Posts: 24
|
![]() Quote:
Thanks. |
|
![]() |
![]() |
![]() |
#33 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
The Chimeric output will go into Chimeric.out.sam and Chimeric.out.junction files. Note that the same read can have both acceptable non-chimeric (output to Aligned.out.sam) and chimeric alignments (output to Chimeric.out.*). A read is considered "unmapped" if it does not have an acceptable non-chimeric alignment, and --outSAMunmapped Within will output "unmapped" reads into Aligned.out.sam without alignment coordinates (which allows to fully reconstruct fastq file from the SAM file), while --outReadsUnmapped Fastx will output them into a fastq or fasta files. There are other parameters that control chimeric detection: chimJunctionOverhangMin 20 int>0: minimum overhang for a chimeric junction chimScoreMin 0 int>0: minimum total (summed) score of the chimeric segments chimScoreDropMax 20 int>0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length chimScoreSeparation 10 int>0: minimum difference (separation) between the best chimeric score and the next one chimScoreJunctionNonGTAG -1 int: penalty for a non-GT/AG chimeric junction |
|
![]() |
![]() |
![]() |
#34 |
Member
Location: England Join Date: Mar 2013
Posts: 13
|
![]()
I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)? |
![]() |
![]() |
![]() |
#35 | |
Senior Member
Location: NY Join Date: Feb 2009
Posts: 161
|
![]() Quote:
You can run Cufflinks on STAR alignments. If you have un-stranded RNA-seq data you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed. If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example, cufflinks ... ... --library-type fr-firststrand should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs. It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options: --outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated |
|
![]() |
![]() |
![]() |
#36 | |
Member
Location: England Join Date: Mar 2013
Posts: 13
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#37 |
Senior Member
Location: . Join Date: Mar 2011
Posts: 157
|
![]()
Hi all, sorry for the basic question:
I am writing a bash script to submit star jobs, remove duplicates, get counts etc. The dataset I have has multiple fastq per sample, but different numbers for each. I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie: STAR [options] readFilesIn $files/file_read1 $files/file_read2 ? Have tried a few ways to do this but can't figure it out or get STAR to accept input. I am a 'midrange' bioinformatics PhD, so don't hold back on most efficient or crazy way of doing this! Thanks in advance, Bruce. Last edited by bruce01; 05-07-2013 at 06:12 AM. |
![]() |
![]() |
![]() |
#38 | |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
Code:
STAR --readFilesIn Sample1_r1_1.fq,Sample1_r1_2.fq,Sample1_r1_3.fq... Sample1_r2_1.fq,Sample1_r2_2.fq,Sample1_r2_3.fq... |
|
![]() |
![]() |
![]() |
#39 |
Senior Member
Location: . Join Date: Mar 2011
Posts: 157
|
![]()
Dpryan, yes have tried using wildcards as input to test it works, I get a segmentation fault. When I run it with all filenames included as standard it runs fine. I have a lot of samples, with variable numbers of fastq files per sample, and want a single script to submit to a queue. So inputting all fastq by hand is not an option, hence my original question.
Concatenating the fastqs will mean I have to uncompress them, using computing time and I am keen to go from the .gz that my facility have supplied. This can't be too big of a problem is it? |
![]() |
![]() |
![]() |
#40 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
My example didn't use wildcards, so I'm not sure where that idea came from.
You can just concatenate the gzipped files together without uncompressing them first. The other normal process would be to simply write your script to generate the comma separated list that's then fed to STAR. You should be able to do that easily enough in bash, which whatever you're using for job scheduling probably already can handle. |
![]() |
![]() |
![]() |
Tags |
alignment, genome, mapping, rna-seq, transcirptome |
Thread Tools | |
|
|