SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 07:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 11:11 AM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 10:38 AM

Reply
 
Thread Tools
Old 02-20-2013, 05:30 AM   #21
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default small reference

Quote:
Originally Posted by thomasblomquist View Post
I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

Regards,

-Tom Blomquist

University of Toledo, Ohio
Tom,
please use the following parameters:
--genomeChrBinNbits 6 --genomeSAindexNbases 4
I was able to run successful genome generation step for your small reference.
It looks like you want to use STAR in a quite non-standard way.
If you explain it a bit more I could suggest the parameters for the mapping step.
alexdobin is offline   Reply With Quote
Old 02-20-2013, 05:38 AM   #22
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default

Quote:
Originally Posted by alexdobin View Post
Tom,
please use the following parameters:
--genomeChrBinNbits 6 --genomeSAindexNbases 4
I was able to run successful genome generation step for your small reference.
It looks like you want to use STAR in a quite non-standard way.
If you explain it a bit more I could suggest the parameters for the mapping step.
Thank you. I do tend to go about things in a non-standard way . I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

Thank you for your time. I look forward to seeing how your software works in my studies.

Regards,

-Tom Blomquist
thomasblomquist is offline   Reply With Quote
Old 02-21-2013, 12:10 PM   #23
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by thomasblomquist View Post
Thank you. I do tend to go about things in a non-standard way . I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

Thank you for your time. I look forward to seeing how your software works in my studies.

Regards,

-Tom Blomquist
If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
--outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
--outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0
--outFilterMatchNmin L (L=shortest of the reference sequences)

Please let me know whether it works for you.
alexdobin is offline   Reply With Quote
Old 02-21-2013, 12:31 PM   #24
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default

Quote:
Originally Posted by alexdobin View Post
If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
--outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
--outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0
--outFilterMatchNmin L (L=shortest of the reference sequences)

Please let me know whether it works for you.
In essence. Since I generally know where the alleles are, the sequence reads are trimmed to include the variant loci and ~15 bases of sequencing read on either side of the expected loci(s). So, I'm matching ~40 base sequences to a reference library of chromosomes that are ~20-30 bases.

Thanks again for your help. I will try the modifications later tonight. -Tom
thomasblomquist is offline   Reply With Quote
Old 02-22-2013, 01:18 AM   #25
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default

Had the chance to test your recommended parameters for my specific needs. This is not my normal mode of operation, but... HOLY POOP!!! I cut a 2 hour BFAST match and sorting into bins down to 10 seconds!!! And the specificity and sensitivity is off the hook! When I do some more formal testing I will provide back metrics.

You are a computer jedi master! This will make analysis of clinical sequencing data a breeze. Very intuitive command lines.

-Tom Blomquist
thomasblomquist is offline   Reply With Quote
Old 02-22-2013, 01:24 AM   #26
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by alexdobin View Post
We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.
Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

Also just to clarify, you don't know STAR would perform with bacterial reads?

Last edited by bob-loblaw; 02-22-2013 at 05:54 AM.
bob-loblaw is offline   Reply With Quote
Old 02-22-2013, 06:28 AM   #27
thomasblomquist
Member
 
Location: Ohio

Join Date: Jul 2012
Posts: 68
Default

For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

Regards,

-Tom
thomasblomquist is offline   Reply With Quote
Old 03-27-2013, 11:54 AM   #28
Nino
Member
 
Location: New York City

Join Date: Mar 2013
Posts: 27
Default

Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

Thanks,
Nino
Nino is offline   Reply With Quote
Old 03-27-2013, 01:46 PM   #29
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by bob-loblaw View Post
Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

Also just to clarify, you don't know STAR would perform with bacterial reads?
Sorry, did not check this thread for awhile, just found your questions.

STAR will work with database of any size, however, required RAM scales linearly with database size, at ~10*DBsize bytes.

STAR works fine with bacterial reads, this is the work we did with long and small RNA-seq for some bacteria.
alexdobin is offline   Reply With Quote
Old 03-27-2013, 01:56 PM   #30
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by thomasblomquist View Post
For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

Regards,

-Tom
Sorry, did not check this thread for awhile, just found your question.

With default parameters, if two alignments differ by one mismatch, only the best one would be reported. This is controlled by --outFilterMultimapScoreRange, (=1 by default) which defines the range of alignment scores that are reported as multi-mappers.

The max number of alignments that allowed for output is controlled by --outFilterMultimapNmax. It's 10 be default, so any read with 10 or fewer alignments with scores >= BestScore-1 will be reported.
alexdobin is offline   Reply With Quote
Old 03-27-2013, 02:01 PM   #31
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by Nino View Post
Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

Thanks,
Nino
You will need to generate special genome files for STAR.
This is done with the following command:
STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
If you want to use annotations for improved mapping accuracy, you also need to use:
--sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.
alexdobin is offline   Reply With Quote
Old 03-29-2013, 05:37 PM   #32
Auction
Member
 
Location: california

Join Date: Jul 2009
Posts: 24
Default

Quote:
Originally Posted by alexdobin View Post
You will need to generate special genome files for STAR.
This is done with the following command:
STAR --runMode genomeGenerate --genomeDir /path/to/GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 --runThreadN <Nthreads>
If you want to use annotations for improved mapping accuracy, you also need to use:
--sjdbGTFfile /path/to/Annot.gtf --sjdbOverhang <N>, where ideally N=ReadMateLength-1, or you could generically use ~100.
Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

Thanks.
Auction is offline   Reply With Quote
Old 03-30-2013, 07:18 AM   #33
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by Auction View Post
Alex, I successfully used STAR to generate the SAM file. But I can't find how to specify the output for chimeric alignments. Should I use "--outSAMunmapped Within" to include everything in the SAM and use samtools to find chimeric alignments? And also for "--outReadsUnmapped", does it include chimeric and singleton?

Thanks.
To switch on chimeric detection and output, you would need to specify non-zero --chimSegmentMin, which is a minimum length of a segment (piece) of which chimeras are made. For example, if you have 2x100 PE reads and specify --chimSegmentMin, you could have a chimera in which one segment of (100-mate1+80-mate2) bases maps non-chimerically to one chromosome, and another segement of 20b-mate2 maps to another chromosome.
The Chimeric output will go into Chimeric.out.sam and Chimeric.out.junction files.

Note that the same read can have both acceptable non-chimeric (output to Aligned.out.sam) and chimeric alignments (output to Chimeric.out.*). A read is considered "unmapped" if it does not have an acceptable non-chimeric alignment, and --outSAMunmapped Within will output "unmapped" reads into Aligned.out.sam without alignment coordinates (which allows to fully reconstruct fastq file from the SAM file), while --outReadsUnmapped Fastx will output them into a fastq or fasta files.

There are other parameters that control chimeric detection:
chimJunctionOverhangMin 20
int>0: minimum overhang for a chimeric junction
chimScoreMin 0
int>0: minimum total (summed) score of the chimeric segments
chimScoreDropMax 20
int>0: max drop (difference) of chimeric score (the sum of scores of all chimeric segements) from the read length
chimScoreSeparation 10
int>0: minimum difference (separation) between the best chimeric score and the next one
chimScoreJunctionNonGTAG -1
int: penalty for a non-GT/AG chimeric junction
alexdobin is offline   Reply With Quote
Old 04-04-2013, 06:24 AM   #34
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default

I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?
Sipkovandam@gmail.com is offline   Reply With Quote
Old 04-04-2013, 06:35 PM   #35
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by Sipkovandam@gmail.com View Post
I am pretty new to RNA-seq analysis and I am now using STAR instead of Tophat and I am very satisfied with both the quality of the results and the speed at which I get them. One thing I miss though is the .GTF file I get from Tophat that contains new genes predicted based on the reads and splice junktions.
Does anyone know if there is a way I can combine an existing GTF file with the .tab file to create a new .GTF (or GFF) file containing newly predicted gene sites (with random names for these)?
As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.

You can run Cufflinks on STAR alignments.
If you have un-stranded RNA-seq data you will need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical junctions) will be suppressed.

If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example,
cufflinks ... ... --library-type fr-firststrand
should be used for the “standard” dUTP protocol. This option has to be used only for Cufflinks runs and not for STAR runs.
It is recommended to remove the non-canonical junctions for Cufflinks runs using STAR's options:
--outFilterIntronMotifs RemoveNoncanonical OR RemoveNoncanonicalUnannotated
alexdobin is offline   Reply With Quote
Old 04-04-2013, 11:46 PM   #36
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default

Quote:
As far as I know TopHat does not produce a GTF file on its own, at least it was true for the last version I tried (~2.0.3). You need to feed the alignments to Cufflinks, which will assemble and quantify transcripts, and produce the GTF file.
You are right, sorry I mixed it up a bit. Thanks for the information on the options I should use.
Sipkovandam@gmail.com is offline   Reply With Quote
Old 05-07-2013, 04:42 AM   #37
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi all, sorry for the basic question:

I am writing a bash script to submit star jobs, remove duplicates, get counts etc. The dataset I have has multiple fastq per sample, but different numbers for each. I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:


STAR [options] readFilesIn $files/file_read1 $files/file_read2

?

Have tried a few ways to do this but can't figure it out or get STAR to accept input. I am a 'midrange' bioinformatics PhD, so don't hold back on most efficient or crazy way of doing this!

Thanks in advance,

Bruce.

Last edited by bruce01; 05-07-2013 at 05:12 AM.
bruce01 is offline   Reply With Quote
Old 05-07-2013, 08:40 AM   #38
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by bruce01 View Post
I have made files containing fastq in the specified format (fq_r1_1,..,fq_r1_n). Can I use these when submitting the STAR job? Ie:

STAR [options] readFilesIn $files/file_read1 $files/file_read2
Have you just tried the following?
Code:
STAR --readFilesIn Sample1_r1_1.fq,Sample1_r1_2.fq,Sample1_r1_3.fq... Sample1_r2_1.fq,Sample1_r2_2.fq,Sample1_r2_3.fq...
You could also just concatenate the files together as appropriate and use the result.
dpryan is offline   Reply With Quote
Old 05-08-2013, 03:25 AM   #39
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Dpryan, yes have tried using wildcards as input to test it works, I get a segmentation fault. When I run it with all filenames included as standard it runs fine. I have a lot of samples, with variable numbers of fastq files per sample, and want a single script to submit to a queue. So inputting all fastq by hand is not an option, hence my original question.

Concatenating the fastqs will mean I have to uncompress them, using computing time and I am keen to go from the .gz that my facility have supplied. This can't be too big of a problem is it?
bruce01 is offline   Reply With Quote
Old 05-08-2013, 03:34 AM   #40
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

My example didn't use wildcards, so I'm not sure where that idea came from.

You can just concatenate the gzipped files together without uncompressing them first.

The other normal process would be to simply write your script to generate the comma separated list that's then fed to STAR. You should be able to do that easily enough in bash, which whatever you're using for job scheduling probably already can handle.
dpryan is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO