SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie mapping with more than 3 mismatches? rndouglas Bioinformatics 4 07-29-2015 06:50 AM
mapping mismatches with Rsamtools, or best approach darren.obbard Bioinformatics 0 01-31-2012 02:40 AM
Global Alignment With Mummer (Parameter for Gap Extension & Opening Cost) peveralldubois Bioinformatics 0 03-07-2011 12:45 AM
Bowtie mapping with more mismatches mapper Bioinformatics 0 01-05-2011 03:11 AM
Multi-Genome Alignment for QC... james hadfield Bioinformatics 0 08-17-2010 08:51 AM

Reply
 
Thread Tools
Old 02-20-2013, 08:10 AM   #1
babi2305
Member
 
Location: Barcelona

Join Date: Feb 2013
Posts: 14
Default STAR-alignment/confusion with mismatches & multi-mapping parameter

Hello everyone,


1. I used STAR for mapping RNA seq data with default parameters only changing --outFilterMismatchNmax 5 (allowing 5 mismatches, default was 10). I am now confused if I should have gone for 3 or less? My read length varies between 10-50. What is the ideal cutoff one should allow for mismatch parameter?

2. Also, one should go for multi-mapping reads or not? because around 24% of reads in my case is showing multi-mapping, is it significant? i don't think so.. should I switch it to zero multi-mapping?

3. I also want to do some statistics to see the correlation between my replicates, and some post-mapping statistical analysis like length counts, log(n) etc..can anybody refer a good tool for this..?

Please suggest. thanks.
babi2305 is offline   Reply With Quote
Old 02-21-2013, 02:00 PM   #2
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
--outFilterMismatchNoverLmax 0.05
This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on.

2. Multi-mapping reads are quite common especially when you map short reads. 24% is a decent number for short RNA-seq. What to do with multi-mappers is a complicated issue. Typically they are used to represent expression of whole classes/families of RNA (repeats (e.g. transposons), gene families etc). If you are looking at your data in a browser, it's useful to have two separate tracks - for unique mappers and multi-mappers.

3. We typically check correlation between bio-reps for read counts on annotated small RNAs. A good tool to do it is 'coverageBed' from BEDtools package.
Another nice tool is HTseq.
alexdobin is offline   Reply With Quote
Old 02-22-2013, 12:51 AM   #3
babi2305
Member
 
Location: Barcelona

Join Date: Feb 2013
Posts: 14
Default

Quote:
Originally Posted by alexdobin View Post
1. The max number of mismatches depends on many factors, the most important being read length (more mismatches for longer reads), sequencing quality (more mismatches for poorer quality), sequence divergence of your sample with respect to the reference.

.

Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

Thanks again,
babi2305 is offline   Reply With Quote
Old 02-25-2013, 05:20 PM   #4
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by babi2305 View Post
Thankyou alexdobin. It is much help. I am just in starting phase right now.By the way I also need to compare between ployA+ and poly A- RNA-seq data of cytosol and nucleus(A+/A-) taken from encode, for some preliminary analysis. Can you please refer some papers or give any idea what things can I do with this data. I am not finding any reference on poly A- analysis.

Thanks again,
We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts. Many RNAs are not not to be polyadenylated, histone genes being the well-known example, while it is not as well understood for more recently discovered classes such as lncRNAs or eRNAs.
alexdobin is offline   Reply With Quote
Old 02-26-2013, 08:56 AM   #5
babi2305
Member
 
Location: Barcelona

Join Date: Feb 2013
Posts: 14
Default Cufflink producing 0.0 FPKM values for Ensembl matched transcripts

Quote:
Originally Posted by alexdobin View Post
We have not done a lot of analysis within ENCODE comparing A+ vs A- samples. There are a number of interesting questions one could try to explore. The obvious thing is that A- data contain a significant number intronic reads, probably originating from primary transcripts.
Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

Thanks.
babi2305 is offline   Reply With Quote
Old 02-26-2013, 07:52 PM   #6
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by babi2305 View Post
Thanks so much for replying. I couldn't find much reference on poly A+/polyA-..still i have started my work with the analysis of cufflinks output. Now problem is using cufflink '-g' option and giving Ensembl GTF file (hg19), & .bam as input, I have got different output files of genes/isoforms/transcripts.fpkm.tracking. But cufflink is only showing FPKM values for novel transcripts (eg CUFF.1.X CUFF.2.X etc) but no FPKM values for ENSEMBL matched transcripts, only zeros. Also, in former case FPKM values are too high like >400 or even >800. One reason i can think of is that ENCODE data is based on STAR alignment and cufflinks works better with TopHat sam/bam. Do you have any idea what should I do to get the FPKM values? do I need to change my input files? can any other tool calculate it for me?

Thanks.
The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.
alexdobin is offline   Reply With Quote
Old 03-05-2013, 06:00 AM   #7
babi2305
Member
 
Location: Barcelona

Join Date: Feb 2013
Posts: 14
Default How cuffdiff works? Need to know the novel transcripts

Quote:
Originally Posted by alexdobin View Post
The chromosome names in hg19 start with 'chr' (like chr1,chr2...), while ENSEMBL does not use the 'chr' prefix. I suspect this causes Cufflinks to not report any FPKM for annotated transcripts. You need to rename ENSEMBL chromosomes to be compatible with hg19. Note that 'chrM' in hg19 is called 'MT' in ENSEMBL, and I believe some scaffolds are named entirely differently. Another option is to use GENCODE annotations, which are basically ENSEMBL but compatible with hg19.

I recommend first to run Cufflinks with -G annot.gtf option (not the -g), which quantifies the annotated ones but does not assemble novel transcripts. Once you are satisfies with the results, you can run the (guided) assembly.
Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?
babi2305 is offline   Reply With Quote
Old 03-05-2013, 08:02 AM   #8
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Quote:
Originally Posted by babi2305 View Post
Thanks so much, I used the gtf file from gencode v15 and it worked fine. Now I am stuck with cuffdiff output. Can you also give me an idea about how cuffdiff works? I am more interested in finding novel genes/isoforms. I gave the combined.gtf (from cuffcompare) files & sam files of both my samples to cuffdiff. I wanted to know, which genes/isoforms are the one that are novel with respect to the other one. Does cuffdiff returns the novel ones also or just the differently expressed known genes ?

I ran cufflinks with -g option to get the novel transcripts. Somebody please tell me how cuffdiff works?
Sorry I cannot give you any solid advise on using Cuffdiff and Cuffcompare - I have only been using Cufflinks for transcripts assembly from our data (and have not been happy with the results).

As far as I understand, Cuffdiff will give you the differentially expressed isoforms given the same list of transcripts for the two samples, but it will not tell you which isoforms are "structurally" different between two samples. You probably should be able to parse some of the Cuffcompare outputs to get the novel isoforms.
alexdobin is offline   Reply With Quote
Old 05-07-2013, 07:12 AM   #9
babi2305
Member
 
Location: Barcelona

Join Date: Feb 2013
Posts: 14
Default Star doubt

Quote:
Originally Posted by alexdobin View Post
Since you have reads of widely varying length (are these short RNA?), I would recommend setting the max number of mismatches relative to the read length using
--outFilterMismatchNoverLmax 0.05
This is what we typically use for ENCODE small RNA-seq. This means that for reads <20b no mismatches are allowed, 20-39b: 1 mismatch, 40-59b 2 mismatches and so on..
Hello Alexander,

I am using STAR for my single-end small RNAseq data (15-46 read length) and its working great. thanks! However, since its new, I can't find much help on public forums, there are couple of things I need to know, and who can tell it better than you..

I used STAR with the following parameters:
--genomeDir </path/genome/hg19>
--genomeLoad NoSharedMemory
--readFilesIn <path/inputReadsFile/min15_max_46length_reads>
--runThreadN 4
--outFilterMismatchNoverLmax 0.05
--outSAMattributes All
--outSJfilterOverhangMin 20 10 10 10
--outFileNamePrefix <output_location>
--sjdbFileChrStartEnd <introns_ucsc_hg19>
--sjdbOverhang 45


Now, when I created contigs / or visualize it on genome browser I am getting large chunks of regions spanning 100,000bp and more in the genome, and there are many such occurrences in multiple files. which therefore is giving large set of contigs as well. Is this a mapping error?

Also, when I identified those regions from my bam files, I found bitwise flag '64' in the second column, and '255' in 5th (for uniquely mapped).

I have attached the screenshot, first one is contig file and second one is reads extracted from bam file.

Please help,

Thanks.
Attached Images
File Type: png Screenshot-1.png (122.6 KB, 59 views)
babi2305 is offline   Reply With Quote
Old 05-07-2013, 01:45 PM   #10
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

Hi @babi2305,

I am not sure how you make contigs, but I suspect that these long clusters are made of alignments that are spliced. Some small RNA can be spliced (e.g. short mRNA decay products that happen to cross exon-exon junction). Unless you are looking for them specifically, I would suggest prohibiting splicing with --alignIntronMax 10.
If you do not want splicing, you also do not --sjdbFileChrStartEnd <introns_ucsc_hg19> --sjdbOverhang 45, and in any case, these parameters should be used at the genome generation step, not mapping step.

Note, that you can also filter out spliced alignments after the mapping - they will contain "N" in the CIGAR string.

We are routinely using STAR to map "small RNA" (~<200b) data within the ENCODE project with the following parameters:
--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1
(>=16b matched to the genome, number of mismatches <= 5% of mapped length, i.e. 0MM for 16-19b, 1MM for 20-39b etc, splicing switched off).

You can clip 3' adapter before feeding the reads to STAR, or you can use simple built-in clipper --clip3pAdapterSeq TGGAATTCTC --clip3pAdapterMMp 0.1 (second parameter is the proportion of mismatches in the matched adapter length).

You would also likely want to filter out reads that STAR "genomically" trims at the 5'.
This simple awk one-liner will filter out all alignments that are trimmed by more than 1 base from the 5':
Code:
awk '{S=0; split($6,C,/[0-9]*/); n=split($6,L,/[NMSID]/);  if (and($2,0x10)>0 && C[n]=="S") {S=L[n-1]} else if (and($2,0x10)==0 && C[2]=="S") {S=L[1]}; if (S<=1) print }' Aligned.out.sam > Aligned.filtered.sam
alexdobin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO