Seqanswers Leaderboard Ad

**zee** · 07-17-2009, 12:12 AM

I was looking for the same thing, however the -s also gave me the individual bases. I settled with a perl one-liner:

Code:

samtools pileup  -cf genome.fasta  file.bam |perl -lane 'print join"\t",@F[0..7]'

For more complicated things I am using the Bio:: DB::Sam pileup methods.
Hope this helps.

Originally posted by ohofmann View Post

Bit of a newbie question. I've been trying to use the pileup analysis on a BWA dataset. Is there any way to switch of the read bases, read quality and alignment quality information in the output file and get a summarized format instead?

I'm looking at a small number of sequences that have a coverage of 50.000X upwards, and as a result the pileup output becomes almost unmanageable.

Thanks!

**pparg** · 07-21-2009, 02:22 PM

I tried samtools tview to browse the alignment results in bam format. However, several keys do not work, including:
Arrows Small scroll movement
J, K Large scroll movement
backspace Scroll back one screen
..
I haven’t checked other keys. But would like to report what I saw. BTW, is there any other ways besides tview to view the alignment results in sam/bam format? Thanks a lot!

**lparsons** · 07-22-2009, 06:03 AM

I have noticed the same behavior with samtools tview. The biggest issue I have is with very deep alignments since I can find no way to scroll up or down in large chunks.

I also have a question about how people generally use tview. I have an alignment from a virus that is very deep (2000x) since it is such a small genome. I suspect this may be why there are MANY indels reported by BWA. However, most of those indels do not pass varFilter. Is there a way to regenerate the SAM/BAM file with the "false" variations removed? In other words, to view the SAM/BAM as filtered by samtools varFilter? Thanks.

**jess** · 07-22-2009, 11:31 AM

Does anyone know of any way SAM generates a .snp file?? This is needed as the input when i use the SNPfilter option in samtools.pl...so i was wondering if i am not aware of any option which generated .snp file. Let me knw ASAP

**iloveneworleans** · 07-27-2009, 09:12 PM

I have a problem about converting sam format file into bam format file with samtools.
I downloaded the latest version:v0.1.5c
and my fasta file is about 3G; I met a "Bus error" when I run the below command,
samtools view -bt h_sapiens.fa -o output.bam h_sapiens.sam

The error message is:
[sam_header_read2] 23631565 sequence loaded.
Bus error

Does anyone meet the kind of problem before?
Could you tell me how to solve this problem?
thanks a lot

**nilshomer** · 07-27-2009, 09:27 PM

Originally posted by iloveneworleans View Post

I have a problem about converting sam format file into bam format file with samtools.
I downloaded the latest version:v0.1.5c
and my fasta file is about 3G; I met a "Bus error" when I run the below command,
samtools view -bt h_sapiens.fa -o output.bam h_sapiens.sam

The error message is:
[sam_header_read2] 23631565 sequence loaded.
Bus error

Does anyone meet the kind of problem before?
Could you tell me how to solve this problem?
thanks a lot

I think the problem is the "-t" option:

Code:

         -t FILE  list of reference names and lengths (force -S) [null]

It requires the names and lengths of the contigs in the reference. To create this file, use "samtools faidx" on your reference. A ".fai" should be created, which is what you should use instead for the "-t" above.

**iloveneworleans** · 07-28-2009, 07:03 PM

Thanks for nilshomer's help!
Now the error message is disappeared.
But I only can see a black screen when I typed the command:
samtools tview aln.sorted.bam ref.fasta
I want to see how many short reads are aligned on certain one position, I don't know if the command can show the information.

**nilshomer** · 07-28-2009, 07:09 PM

Originally posted by iloveneworleans View Post

Thanks for nilshomer's help!
Now the error message is disappeared.
But I only can see a black screen when I typed the command:
samtools tview aln.sorted.bam ref.fasta
I want to see how many short reads are aligned on certain one position, I don't know if the command can show the information.

When you enter tview, you are put at the first contig, first position. Try navigating to the region of interest (type "g" etc.). I think the help screen is the "?" key.

**xguo** · 07-31-2009, 06:26 AM

Can anyone help me with two options in samtools view?

-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]

I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?

In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

thanks
Xiang

**nilshomer** · 07-31-2009, 09:52 AM

Originally posted by xguo View Post

Can anyone help me with two options in samtools view?

-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]

I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?

See the SAM format for a description of the "FLAG" field. With the "FLAG" field, you can filter out unmapped reads, unpaired reads (paired end reads where one end maps), among other things. To remove low quality mappings (enforce a mapping quality of 50 or greater), I would just use the following command:

Code:

samtools view <in.bam> | awk '{ if($5 > 50) { print $0 }}'

You can save the above to a file, or import it back into a "bam" file.

In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

thanks
Xiang

In my opinion it is better to use as much information during mapping as the alignment program should be able to squeeze out extra information from the last five bases. You can then "filter" the alignment file (SAM) and remove the last five bases (you will have to do this yourself).

**xguo** · 07-31-2009, 11:05 AM

nilshomer, thanks for your prompt reply.

I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth? The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq? Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them? In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

Sorry for so many questions. Thanks a lot for any insight.

**nilshomer** · 07-31-2009, 07:42 PM

Originally posted by xguo View Post

nilshomer, thanks for your prompt reply.

I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?

Why did you restrict it in the first place? I am probably missing something obvious.

The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?

If you use MAQ then the maximum mapping quality is the same. If you go to this page you will see a description of the consensus calling, which mentions the maximum mapping quality among other things. I don't see RMS though.

Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?

Assuming they represent true probabilities (a liberal assumption) then you should select it to match your maximum false-positive rate to obtain the largest number of SNPS.

In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

The above is a very crude approximation, due to the extreme variability of coverage as well as the large dependency on the local repeat structure (context) when it comes to mapping. In the pileup format, there is a column for the # of reads covering the given consensus call.

Sorry for so many questions. Thanks a lot for any insight.

I don't use MAQ very much, since I am an author of my own aligner BFAST. On the other hand, samtools was developed by the same author as MAQ (Heng Li) so you should be safe making some of the above assumptions about MAQ and samtools.

**xguo** · 08-01-2009, 06:45 AM

[QUOTE=nilshomer;7023]Why did you restrict it in the first place? I am probably missing something obvious.

I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".

**nilshomer** · 08-01-2009, 10:09 AM

I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".

The question is how do you identify PCR duplicates with high coverage data? If you have 50bp reads with 1000x coverage, then on average you should get 20 reads with the same start site at a given position. So if you remove PCR duplicates in this situation, you should on average reduce your coverage by a factor of 20.

Now if you have paired end reads, and the insert distribution is not tight (say >500bp), you could enforce both ends must map to the same start site, which would only remove say >1% of the true data (making this up).

If you are trying to measure gene expression, then I would not filter out PCR duplicates at high coverage data. If you are trying to find novel splice events, then the case is more ambiguous.

**griffon42** · 08-04-2009, 09:39 AM

To add a bit to xguo's questions regarding the varFilter in samtools:

I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.

Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?

Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 47 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 48 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News