SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 514 03-13-2020 03:57 AM
STAR vs Tophat (2.0.5/6) dvanic Bioinformatics 44 05-21-2014 07:08 AM
Using Star/ bowtie on cluster babi2305 Bioinformatics 7 02-06-2013 11:11 AM
Suggested aligner for local alignment of RNA-seq data Eric Fournier RNA Sequencing 9 01-23-2013 10:38 AM

Reply
 
Thread Tools
Old 07-12-2014, 12:29 AM   #101
kjusto
Junior Member
 
Location: china

Join Date: Apr 2014
Posts: 5
Default

Quote:
Originally Posted by GenoMax View Post
Pre-compiled linux binary is available here: https://code.google.com/p/rna-star/d...4.tgz&can=2&q=
Thanks for the link... got use proxies to get it though....google issues here....my question was about 32 bit linux OS,any binaries for it.
kjusto is offline   Reply With Quote
Old 07-12-2014, 02:35 AM   #102
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Don't think Alex provides 32-bit binaries. If you have a large genome (~ human) 32-bit may not work.

Build from source if you must have 32-bit: https://code.google.com/p/rna-star/d...e.tgz&can=2&q=
GenoMax is offline   Reply With Quote
Old 07-14-2014, 01:56 AM   #103
emmanouela
Junior Member
 
Location: Oxford

Join Date: Jul 2014
Posts: 4
Default

Quote:
Originally Posted by Brian Bushnell View Post
However, if that 200kbp corresponds exactly to a known intron in the GTF file, and only occurs when using the GTF file, it's probably OK. Does it?
Hi Brian,
No, I didn't use a gtf to do the mapping in this case. Plus the mapped read corresponds to a known intron (of a short gene) on one side but a random intergenic region way after the end of the gene of which it starts in ( at least according to UCSC) on the other side. And the 200kb overlaps with 4 other known genes too. So to my eyes thats definitely a mapping error too. The question now is how to filter those out (because they are quite a few of them).
emmanouela is offline   Reply With Quote
Old 07-17-2014, 07:45 AM   #104
kjusto
Junior Member
 
Location: china

Join Date: Apr 2014
Posts: 5
Default

Hi,
Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

biostat1@biostat[STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
Jul 17 10:55:11 ..... Started STAR run
Jul 17 10:55:11 ... Starting to generate Genome files
terminate called after throwing an instance of 'std:ut_of_range'
what(): vector::_M_range_check
zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

Any ideas,
Thanks!
kjusto is offline   Reply With Quote
Old 07-18-2014, 02:08 PM   #105
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by emmanouela View Post
Hi Brian,
No, I didn't use a gtf to do the mapping in this case. Plus the mapped read corresponds to a known intron (of a short gene) on one side but a random intergenic region way after the end of the gene of which it starts in ( at least according to UCSC) on the other side. And the 200kb overlaps with 4 other known genes too. So to my eyes thats definitely a mapping error too. The question now is how to filter those out (because they are quite a few of them).
Hi Emma,

these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
A better approach is filter out long-gap alignments supported by too few reads, e.g. :
--outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=10kb by >=4 reads.

There is more discussion on this type of filtering in this post.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 07-18-2014, 02:12 PM   #106
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by kjusto View Post
Hi,
Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

biostat1@biostat[STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
Jul 17 10:55:11 ..... Started STAR run
Jul 17 10:55:11 ... Starting to generate Genome files
terminate called after throwing an instance of 'std:ut_of_range'
what(): vector::_M_range_check
zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

Any ideas,
Thanks!
Hi @kjusto,
Is it possible that your L1_1.fq L1_2.fq files are not fasta files - they seem to have FASTQ extension?
If they are valid fasta, please send me a link to them, and I will try to figure out the problem.
Cheers
Alex
alexdobin is offline   Reply With Quote
Old 09-03-2014, 11:16 AM   #107
ndaniel
Member
 
Location: Helsinki

Join Date: Feb 2009
Posts: 33
Default

Quote:
Originally Posted by alexdobin View Post
Hi Emma,

these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
A better approach is filter out long-gap alignments supported by too few reads, e.g. :
--outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=100kb by >=4 reads.
This looks like a very bad idea. There are plenty of fusion genes which are known to be "adjacent" but still are genuine fusion genes.

For example, the famous fusion gene FGFR3-TACC3 (for more info about it please Google it) has the genes FGFR3 and TACC3 as adjacent and the distance between them is less than 50kb. If this fusion is expressed moderately there could be, for example, 2 reads supporting the fusion and it will be discarded according to the above criteria.

Out there are several fusion finders which can handle this kind of cases with very low false positives rate, e.g. https://code.google.com/p/fusioncatcher/wiki/comparison

Last edited by ndaniel; 09-03-2014 at 11:18 AM.
ndaniel is offline   Reply With Quote
Old 10-23-2014, 08:57 AM   #108
coryfunk
Junior Member
 
Location: Seattle, WA

Join Date: Mar 2014
Posts: 8
Default

In using STAR for my alignments, I'm subsequently using samtools for some basic evaluations (and GATK after that to generate a vcf file). An issue I'm having is that even though STAR is correctly including the mapping location for each read in the sam file, it is not setting the flag that says if a read is mapped or not (-f 4). As such, when I use samtools to count the number of mapped reads, it's always zero. Samtools accurately counts the number of total reads and unmapped reads.

Is this an issue with STAR or possibly something I'm doing incorrectly?
coryfunk is offline   Reply With Quote
Old 10-23-2014, 11:08 AM   #109
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

STAR doesn't output unmapped reads in the SAM file (at least not by default), so bit 4 shouldn't be set there for any record.
dpryan is offline   Reply With Quote
Old 10-23-2014, 11:16 AM   #110
coryfunk
Junior Member
 
Location: Seattle, WA

Join Date: Mar 2014
Posts: 8
Default

That's a bit of an issue because the default of not having a flag set is to call it "unmapped".

Running:

samtools view -c -F 4 <file.name>

counts all the reads as unmapped. Downstream tools that rely on that flag are thus rendered incompatible.
coryfunk is offline   Reply With Quote
Old 10-23-2014, 11:25 AM   #111
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No, not setting that bit means the read is mapped.
Edit:I'll add that samtools view -c -F 4 foo.bam will count the mapped reads. Using -f 4 will count the unmapped reads, which will be 0.
dpryan is offline   Reply With Quote
Old 10-23-2014, 11:26 AM   #112
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

0x4 means 'unmapped'. If no flags are set, that means it's a single-ended read mapped to the plus strand.
Brian Bushnell is offline   Reply With Quote
Old 10-23-2014, 11:31 AM   #113
coryfunk
Junior Member
 
Location: Seattle, WA

Join Date: Mar 2014
Posts: 8
Default

My mistake. Thanks!
coryfunk is offline   Reply With Quote
Old 11-14-2014, 03:47 AM   #114
serash
Junior Member
 
Location: Belgium

Join Date: Nov 2014
Posts: 3
Default

Hi,
I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

Thanks!
serash is offline   Reply With Quote
Old 11-18-2014, 01:23 PM   #115
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by serash View Post
Hi,
I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

Thanks!
Hi,

the --twopass1readsN cannot be used with --genomeLoad LoadAndKeep option, you will need to re-load the genome for every sample. For each of the samples the 2-nd pass genome indices will be different, since the junctions discovered in the 1st pass will change from sample to sample.

For multiple samples, one option is to map (2-pass with --twopass1readsN -1 which will map all reads in the first pass) all the samples together, and use the Read Group tags to mark different samples.

Another option is the manual 2-step operation: run the 1st pass on all samples, collect the detected junctions from all samples, re-generate the genome, and run 2nd pass with the new genome, which can now be loaded with --genomeLoad LoadAndKeep option.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 12-28-2014, 03:46 AM   #116
serash
Junior Member
 
Location: Belgium

Join Date: Nov 2014
Posts: 3
Default

Hi again,

It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

cheers,
Dries
serash is offline   Reply With Quote
Old 01-08-2015, 01:52 PM   #117
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by serash View Post
Hi again,

It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

cheers,
Dries
This is a good suggestion, and it's already on my (short) TODO-list - adding annotated junctions on the fly, so that there no need to re-generate the whole genome index when you only want to change annotations.

Cheers
Alex
alexdobin is offline   Reply With Quote
Old 01-08-2015, 11:04 PM   #118
serash
Junior Member
 
Location: Belgium

Join Date: Nov 2014
Posts: 3
Default

That's great to know! Thanks
serash is offline   Reply With Quote
Old 01-15-2015, 04:03 AM   #119
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 266
Question

Star is generating files like this in the directory I'm running it from:
  • SJ.out.tab
  • Log.std.out
  • Log.out
  • Log.final.out
  • Log.progress.out

How can I put those files into a specific directory so that I can run multiple jobs in parallel from a single directory?


Cheers,
Dan.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 01-15-2015, 04:08 AM   #120
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Just use the --outFileNamePrefix option to give everything an appropriate prefix.
dpryan is offline   Reply With Quote
Reply

Tags
alignment, genome, mapping, rna-seq, transcirptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:59 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO