Seqanswers Leaderboard Ad

**dpryan** · 09-24-2013, 07:21 AM

Originally posted by Nino View Post

Hey Devon,

Its turns it is not difficult since a group of individual from Case Western Reserve University, Cleveland, OH published a paper on a program they developed called LoQuM which does exactly what I wanted. I have not tried the program yet but here is the title of article if you would like to read if yourself

"Accurate estimation of short read mapping quality for next-generation genome sequencing"

Thanks,
Nino

Interesting, I'll have to give that paper a read, thanks!

**apredeus** · 03-22-2014, 10:34 PM

Originally posted by alexdobin View Post

We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post

Alex, thank you for the great tool - STAR is indeed very impressive!

Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?

**alexdobin** · 03-26-2014, 08:11 AM

Originally posted by apredeus View Post

Alex, thank you for the great tool - STAR is indeed very impressive!

Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?

Hi Alex,

to generate combined mouse/human genome with STAR you would need to modify slightly your fasta and gtf files:
1. Modify chromosomes names so that mouse and human chromosomes have distinct names, e.g. chr1h/chr1m etc. In the FASTA files you need to make these modifications in all sequences name lines (i.e. starting with ">"). In GTF files you would need to modify all chromosome names in field 1.
2. Make sure that the transcript_id in GTF files are distinct for mouse and human. This is usually the case, for instance, Gencode has "ENSMUSTxxxxx" for mouse and "ENSTxxxxx" for human.
3. Concatenate GTF files for mouse and human into a single GTF file
4. Run genome generation with
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir ./ --genomeFastaFiles /path/to/human.fa /path/to/mouse.fa --sjdbGTFfile /path/to/mouse_human.gtf --sjdbOverhang 100

If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.

Cheers
Alex

**apredeus** · 03-26-2014, 08:35 AM

Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.

Originally posted by alexdobin View Post

If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.

What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.

**apredeus** · 03-26-2014, 08:37 AM

Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.

Thank you!

**alexdobin** · 03-28-2014, 07:18 AM

Originally posted by apredeus View Post

Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.

What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.

There are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).

**alexdobin** · 03-28-2014, 07:30 AM

Originally posted by apredeus View Post

Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.

Thank you!

I recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.

On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.

**apredeus** · 03-28-2014, 08:07 AM

Originally posted by alexdobin View Post

I recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.

On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.

Yes, very logical. Funny, after I posted this question, this whole story came up:

Find all annotated rRNA (rDNA) sequences - SEQanswers

http://seqanswers.com/forums/showthread.php?p=136425

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

So, definitely including the extra scaffolds!

**apredeus** · 03-28-2014, 08:14 AM

Originally posted by alexdobin View Post

There are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).

Yes, and in order to avoid Gencode/UCSC scaffold naming differences, I will just use the Gencode GTF with no "random" annotations. Sounds great. Thanks again for the answers!

**shangzhong0619** · 06-19-2014, 02:06 PM

failed to generate genome using STAR

Hi, I build genome using command:
STAR --runMode genomeGenerate --genomeDir STAR_pathway --genomeFastaFiles file.fa.gz --runThreadN 10
Then I failed and got message: "BUG: next index is smaller than previous, EXITING".

Also, does anyone have more detailed manual of STAR, I downloaded the manual from the website, it shows /pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/
GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/
fasta2 --runThreadN <n> …
What are the other opions in ...? I tried unzip the fa.gz file to fa file and then got the the wrong message: "limitGenomeGenerateRAM=28is too small for your genome
SOLUTION: please specify limitGenomeGenerateRAM not less than114 GB and make that much RAM available".

For other aligners we can type -h or --help to find the details, but not for star...

**alexdobin** · 06-20-2014, 01:55 PM

Hi Shangzhong,

please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.

You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:

sjdbGTFfile -
string: path to the GTF file with annotations

sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used

Cheers
Alex

**shangzhong0619** · 06-20-2014, 02:07 PM

Originally posted by alexdobin View Post

Hi Shangzhong,

please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.

You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:

sjdbGTFfile -
string: path to the GTF file with annotations

sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used

Cheers
Alex

Hi Alex,
Thanks for your reply, yes my reference fasta has many scaffolds. When I try to install the latest version, it shows the following effor.

samtools/libbam.a(bgzf.o): In function `bgzf_compress':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:187: undefined reference to `deflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:188: undefined reference to `deflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:189: undefined reference to `deflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
samtools/libbam.a(bgzf.o): In function `bgzf_dopen':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:160: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `bgzf_open':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:142: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `inflate_block':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:224: undefined reference to `inflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:228: undefined reference to `inflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:229: undefined reference to `inflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:233: undefined reference to `inflateEnd'
samtools/libbam.a(bam_import.o): In function `ks_getuntil2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `__bam_get_lines':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:92: undefined reference to `gzclose'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_close':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:485: undefined reference to `gzclose'
samtools/libbam.a(bam_import.o): In function `sam_open':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `ks_getc':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:147: undefined reference to `gzclose'
collect2: ld returned 1 exit status
make: *** [STAR] Error 1

I have samtools-0.1.19 in my computer. what was this error about? thank you.

**alexdobin** · 06-20-2014, 02:58 PM

Hi Shangzong,

please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz

If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.

Cheers
Alex

**shangzhong0619** · 06-20-2014, 03:43 PM

Originally posted by alexdobin View Post

Hi Shangzong,

please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz

If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.

Cheers
Alex

Thanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.

**alexdobin** · 06-23-2014, 08:11 AM

Originally posted by shangzhong0619 View Post

Thanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.

Hi Shangzhong,

for genome generation, STAR needs unzipped fasta. You do it once per genome, and can delete the fasta after the genome is generated. '--readFilesCommand zcat' option only applies to fastq/fasta reads at the mapping stage.

Cheers
Alex

Topics	Statistics	Last Post
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 151 views 0 likes	Last Post by seqadmin 03-03-2025, 01:15 PM
Highlights from AGBT 2025 – Part II by seqadmin Started by seqadmin, 02-28-2025, 12:58 PM	0 responses 234 views 0 likes	Last Post by seqadmin 02-28-2025, 12:58 PM
Highlights from AGBT 2025 – Part I by seqadmin Started by seqadmin, 02-24-2025, 02:48 PM	0 responses 603 views 0 likes	Last Post by seqadmin 02-24-2025, 02:48 PM
Selecting the Right AI Model for Bioinformatics Research by seqadmin Started by seqadmin, 02-21-2025, 02:46 PM	0 responses 263 views 0 likes	Last Post by seqadmin 02-21-2025, 02:46 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News