SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 09-20-2015, 07:13 PM   #241
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Well... I'm not quite sure why you're mapping DNA reads to the transcriptome. If the annotation is accurate, you already have the full mRNA nucleotide sequences in the file "rna.fa.gz" and don't need to generate a sam file at all, nor will WGS help you find the mRNA sequences with any methodology I can think of.

Are you sure you meant WGS (ie, DNA reads) rather than RNA-seq data? For RNA-seq data, to find full-length mRNAs, you would probably assemble it using an assembler like Trinity, or map it to the genome and use some other tool to guess isoforms from mapping data.

As for the high standard deviation - BBMap by default will map reads that could map to multiple locations (aka multimapping, ambiguous, or non-uniquely-mapping reads) to the first genomic location they can map to optimally. For this kind of purpose (RNA-seq, or mapping to a transcriptome) I recommend you set "ambig=random" or "ambig=all". Otherwise, for genes with multiple isoforms, all the reads from that gene will map to the first isoform containing the exon in question, leading to a high standard deviation. For mapping RNA to a transcriptome I generally recommend "ambig=random".

Also, if you ARE mapping DNA to the transcriptome (which, again, is probably not useful), you should set the "local" flag - otherwise reads that span intron/exon boundaries will have a lot of mismatches since there is intronic content in DNA reads, but none in the transcriptome.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 09-23-2015, 07:25 AM   #242
DarthMapper
Junior Member
 
Location: PR

Join Date: Sep 2015
Posts: 4
Default

I tried using the settings you had recommended, but the standard deviation is still awfully high. Would this be related to the fact that we are trying to map WGS sequences to a transcriptome?

Here are the results:

With ambig=random setting and local=t

Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 31.27
Standard deviation: 628.72
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 75.89

Time: 2084.076 seconds.

----------------------------------------------------------------------------------------------------------------------


With ambig=random setting, but no local flag

Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 31.27
Standard deviation: 635.37
Percent scaffolds with any coverage: 99.94
Percent of reference bases covered: 77.75

Time: 1935.627 seconds.
--------------------------------------------------------------------------------------------------------------------
Thank you again Brian for your last reply!
DarthMapper is offline   Reply With Quote
Old 09-23-2015, 10:56 AM   #243
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I don't understand why you are aligning DNA reads to a transcriptome. To clarify, are these DNA reads or RNA-seq reads? And can you explain what you're trying to do?
Brian Bushnell is offline   Reply With Quote
Old 09-23-2015, 11:04 AM   #244
DarthMapper
Junior Member
 
Location: PR

Join Date: Sep 2015
Posts: 4
Default

I'm so sorry; I was editing my last reply when you responded.

I forgot to mention that we are trying to determine copy number via read depth analysis, so we need to know the average and SD read mapping to all genes present in the genome (to compare against the ambiguous genes in question).

We are using WGS reads, which are DNA, and we are mapping these to the transcriptome.
DarthMapper is offline   Reply With Quote
Old 09-23-2015, 03:01 PM   #245
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 498
Default

You should be mapping to the genome, then extracting the information that corresponds to the transcriptome. I'm not surprised that your standard deviations are wonky. It's probably due to repeat elements in the genome that are under-represented in the transcriptome, which would produce much higher-than-expected coverage. You can check by BLASTing a few examples of your high-coverage regions against the genome, and see how many hits you get.
HESmith is offline   Reply With Quote
Old 09-24-2015, 08:11 AM   #246
DarthMapper
Junior Member
 
Location: PR

Join Date: Sep 2015
Posts: 4
Default

Well, usually we map the WGS reads to the genome assembly for quality testing. But, how exactly would we extract the information corresponding to the transcriptome? Would I be able to use BBMap for this?
DarthMapper is offline   Reply With Quote
Old 09-24-2015, 09:49 AM   #247
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I think you can do that with Bedtools and a gene annotation file.
Brian Bushnell is offline   Reply With Quote
Old 09-24-2015, 09:51 AM   #248
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

One would need BBMap to create the alignment file that can then be used with bedtools
GenoMax is offline   Reply With Quote
Old 10-04-2015, 01:23 PM   #249
JamesJoyce
Junior Member
 
Location: Haiti

Join Date: Oct 2015
Posts: 5
Default

Hello Brian;
I am trying to do something similar to what DarthMapper was suggesting.
My PI had instructed me to determine the copy number of certain genes in a single organism by mapping WGS reads to said gene sequences as reference. I then am supposed to compare coverage values of these individual genes with that obtained from mapping the WGS reads to the whole set of mRNA sequences present in the studied organism's transcriptome. The problem is that I am also getting strange standard deviations when using BBMap.
What is wrong with my methodology?
JamesJoyce is offline   Reply With Quote
Old 10-04-2015, 04:54 PM   #250
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi James,

Can you give me the exact command line you are using?

Also, the complete output printed to the screen would be helpful, as well as the name of the organism.

Last edited by Brian Bushnell; 10-04-2015 at 05:27 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-04-2015, 07:04 PM   #251
JamesJoyce
Junior Member
 
Location: Haiti

Join Date: Oct 2015
Posts: 5
Default

Thanks for your reply Brian.
I am using BBMap to align WGS reads to a set of mRNA sequences extracted from the organism's NCBI transcriptome. This is being done with different organisms, such as mammals, reptiles, and fungi. So far, I have tested a few mammals, including Felix Catus (cat) and Rattus Norvegicus (Rat). Again, I was instructed to find the copy number of certain genes within a single organism. My PI told me to map WGS sequences to the transcriptome's mRNA sequences as reference before mapping to individual gene sequences, but we stopped once we saw the results.

The results for both mammals are very similar, so I will only post one example

Here is the commandline I am using and the results:

Index Build
bbmap.sh ref=mRNA.fa build=1

Alignment
bbmap.sh build=1 in1=catreads/read_1.fastq in2=catreads/read_2.fastq ambig=random local=t out=mRNA.sam

Coverage and Standard Deviation:
pileup.sh in=mRNA.sam out=covstats.txt hist=histogram.txt

P.S. I have tried to use samtools depth to calculate coverage and STDV with BBMap's sam outputs but it does not let me - even when I have converted the sam's to bam, and sorted the bam. How can this be achieved?

CoverageResults
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average coverage: 15.29
Standard deviation: 384.14
Percent scaffolds with any coverage: 99.81
Percent of reference bases covered: 69.32

Is there any way to lower that STDV?
JamesJoyce is offline   Reply With Quote
Old 10-05-2015, 10:42 AM   #252
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

First off, you should expect a high standard deviation in this scenario. If one gene has 10 isoforms and another has 1 isoform, the gene with 1 isoform will have (on average) 10x the coverage of the isoforms from the gene with 10, when ambiguously-mapped reads are placed randomly. You need to use exactly one isoform per gene (preferably the longest).

However, it's also possible that there's a problem with the random selection of sites for ambiguously-mapped paired reads. Would you mind trying this (mapping with read 1 only):

bbmap.sh build=1 in1=catreads/read_1.fastq ambig=random local=t out=mRNA.sam covstats=covstats.txt hist=histogram.txt

Note that bbmap can internally generate coverage information and does not need a second pass from pileup. As for why samtools depth is not working, I have no idea. Does it give some kind of error message?
Brian Bushnell is offline   Reply With Quote
Old 10-05-2015, 10:47 AM   #253
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

@Brian: How is this analysis is going to give information about number of copies of a gene to @DarthMapper/@JamesJoyce? Am I missing something obvious?
GenoMax is offline   Reply With Quote
Old 10-05-2015, 10:59 AM   #254
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Well... it is kind of a strange analysis... and a normal transcriptome file, with many isoforms per gene, will not very useful. But, if you had exactly one transcript per gene in your reference, then theoretically, genes with 2 copies should get double the coverage of genes with 1 copy, if you map uniform WGS data. Using the "local" flag would be helpful because wherever reads go into introns they will not match the transcriptome. Also, pseudogenes will cause confusion. So I'm not sure how much signal there will be compared to the noise, but with those assumptions (one transcript per gene, local alignments, fairly uniform coverage, no pseudogenes), it should work.

Last edited by Brian Bushnell; 10-05-2015 at 11:07 AM.
Brian Bushnell is offline   Reply With Quote
Old 10-05-2015, 11:22 AM   #255
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

It is a strange analysis and with all those caveats going to be of limited real world use for eukaryotic genomes.
GenoMax is offline   Reply With Quote
Old 10-05-2015, 02:45 PM   #256
JamesJoyce
Junior Member
 
Location: Haiti

Join Date: Oct 2015
Posts: 5
Default

Thank you all for your replies. I will be trying some of the things you have all suggested.
As for running samtools depth with BBMap's outputs, I get 0 coverage and Standard Deviation values. I first convert BBMap's sam output to bam, and then proceed to sort it using samtools. It is then that I try using samtools depth.

Edit:
I tried using one read to verify if there was a problem with the random selection of sites for ambiguously-mapped paired reads. However, the results were very similar, leading me to think that I should be focusing on extracting the longest isoform per gene and seeing how that looks like. However, I am unable to produce coverage and STDV directly from BBMap, as it tells me that "hist=" is an unknown command. Thank you again Brian.

Last edited by JamesJoyce; 10-06-2015 at 06:39 AM.
JamesJoyce is offline   Reply With Quote
Old 10-08-2015, 07:08 AM   #257
JamesJoyce
Junior Member
 
Location: Haiti

Join Date: Oct 2015
Posts: 5
Default

Update: Results still have high standard deviations, even after only keeping the longest isoforms of the sequences found in the reference transcript.

I really like all the features BBMap holds, and as such I would like to keep using BBMap for a foreseeable future. Is there any way I could use BBMap to do this sort of analysis (determining copy number of certain genes within an individual organism), but instead by using RNA-Seq read sequences? Thanks again for your time guys.
JamesJoyce is offline   Reply With Quote
Old 10-08-2015, 09:25 AM   #258
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can't determine copy number of genes from RNA-seq data, as the expression is not really determined by the copy-number (and is highly variable anyway). The best way is probably to map WGS DNA reads to the genome and then use bedtools to extract coverage over coding regions, and look for regions with different-than-expected coverage. This will remove a lot of the bias caused by mapping DNA reads to a transcriptome. Also, note that exome-capture will not yield useful quantitative results, either, due to extreme bias, without extensive calibration for a given kit.

Alternately, it would probably be a lot cheaper to run a SNP chip, or qPCR, or a probe array, or something like that, rather than doing sequencing. I can't remember which one is best for this but at least one is designed to do copy-count quantification without sequencing, across a genome.
Brian Bushnell is offline   Reply With Quote
Old 10-13-2015, 02:35 PM   #259
mhgseq
Junior Member
 
Location: TX

Join Date: Oct 2015
Posts: 7
Default out of memory error

even with a very small ref and input file, I got the out of memory error. Would you check what is happen?


bbmap.sh in=~/Documents/align/test.fa out=~/Documents/align/outputu.sam

java -Djava.library.path=/home/local/bbmap/jni/ -ea -Xmx700m -cp /home/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=/home/Documents/align/HLAtest.fa out=/home/Documents/align/outputu.sam
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in=/home/Documents/align/test.fa, out=/home/Documents/align/outputu.sam]


BBMap version 35.51
Retaining first best site only for ambiguous mappings.
Set genome to 1

Loaded Reference: 0.023 seconds.
Loading index for chunk 1-1, build 1
Generated Index: 0.817 seconds.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at align2.BBIndex.analyzeIndex(BBIndex.java:106)
at align2.BBMap.loadIndex(BBMap.java:410)
at align2.BBMap.main(BBMap.java:32)
mhgseq is offline   Reply With Quote
Old 10-13-2015, 02:49 PM   #260
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBMap needs a fairly large amount of memory dependent on kmer size, regardless of how small the reference is. You can estimate it like this:

Code:
C:\temp\ecoli>java jgi.AssemblyStats2 in=ecoli_K12.fa k=13
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2462  0.2542  0.2537  0.2459  0.0000  0.0000  0.0000  0.5079  0.0000

Main genome scaffold total:             1
Main genome contig total:               1
Main genome scaffold sequence total:    4.640 MB
(stuff)...

BBMap minimum memory estimate at k=13:     -Xmx900m     (at least 1000 MB physical RAM)
So, it looks like K=13 with a tiny genome like E.coli still needs 900MB! Let's try K=12:

Code:
C:\temp\ecoli>java jgi.AssemblyStats2 in=ecoli_K12.fa k=12
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2462  0.2542  0.2537  0.2459  0.0000  0.0000  0.0000  0.5079  0.0000

Main genome scaffold total:             1
Main genome contig total:               1
Main genome scaffold sequence total:    4.640 MB
(stuff)...

BBMap minimum memory estimate at k=12:     -Xmx480m     (at least 540 MB physical RAM)
That's more reasonable; K=12 will work with -Xmx480m. Note that you must specify K both when building the index AND mapping.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO