Seqanswers Leaderboard Ad

**dariober** · 02-14-2013, 12:46 AM

Originally posted by obk View Post

Hi,

Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

Code:

POS A  C  G  T N I D
1   50 0  0  0 0 0 0
2   0  50 0  0 0 0 0
3   0  25 25 0 0 0 0
4   50 0  0  0 0 0 0
5   0  50 0  0 0 0 0

or alternatively, the total coverage (50 for all POSitions in the above example) with % of each of ACGTNID.

Hi- Have a look at bedtools nuc, it should do exactly what you need or get you very close. From the help:

Code:

bedtools nuc -h

Tool:    bedtools nuc (aka nucBed)
Version: v2.16.2
Summary: Profiles the nucleotide content of intervals in a fasta file.

Usage:   bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
	-fi	Input FASTA file

	-bed	BED/GFF/VCF file of ranges to extract from -fi

	-s	Profile the sequence according to strand.

	-seq	Print the extracted sequence

	-pattern	Report the number of times a user-defined sequence
			is observed (case-sensitive).

	-C	Igore case when matching -pattern. By defaulty, case matters.

Output format: 
	The following information will be reported after each BED entry:
	    1) %AT content
	    2) %GC content
	    3) Number of As observed
	    4) Number of Cs observed
	    5) Number of Gs observed
	    6) Number of Ts observed
	    7) Number of Ns observed
	    8) Number of other bases observed
	    9) The length of the explored sequence/interval.
	    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
	    11) The number of times a user's pattern was observed.
	        (opt., if -pattern is used.)

Dario

**obk** · 02-14-2013, 09:12 AM

Hi Dario - thanks for the reply.
As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.

**dariober** · 02-14-2013, 10:02 AM

Originally posted by obk View Post

Hi Dario - thanks for the reply.
As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.

...I see, sorry I misunderstood your question. Indded bedtools nuc gives you the base composition of the reference fasta, but not of the reads aligned to it. I'll try again...
What about samtools mpileup? The input/output looks like this (haven't tested, check the syntax, comments are mine):

Code:

samtools mpileup -B -Q0 -d100000 myreads.bam -f myreference.fasta
chr7    3036461 G       2       ,,      I+    ## Reference at this pos is G, there are two reads aligned on the reverse (,,) 
chr7    3047000 C       2       TT      IG    ## Ref is C, two reads aligned to forward as TT
chr7    3047001 G       2       ..      HF    ## Ref is G, two reads aligned on forward

See if this helps... samtools view and mpileup are quite flexible and you can do a lot of filtering on reads and regions of interest.

Dario

**obk** · 02-14-2013, 10:10 AM

Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.

As mentioned in the original post (screenshot), this data is already viewable in igv... I just wished I could generate and access that data as a text file...

Thanks for your example though!

**dariober** · 02-14-2013, 10:22 AM

Originally posted by obk View Post

Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.

Ooops... Indeed you mention mpileup in your original post. No, in this case I can't think of anything more direct right now...

Dario

**obk** · 02-14-2013, 10:30 AM

Ok, thanks for your help! You saved me a few hours of searching manuals/docs and forum threads!

**olorin** · 03-22-2013, 01:45 AM

Hi. I know this is probably too late for you, but I was caught in a similar situation.

After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...

**dariober** · 03-22-2013, 01:55 AM

Originally posted by olorin View Post

Hi. I know this is probably too late for you, but I was caught in a similar situation.

After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...

Thanks for posting this! That's good to know indeed!
Dario

**obk** · 03-22-2013, 06:06 AM

Thanks for the reply! I did find igvtools count, but as far as I can tell it doesn't output the result in non-binary format...

From the manual (http://www.broadinstitute.org/igv/ig..._commandline):

Code:

Usage:
          igvtools count [options] [inputFile] [outputFile] [genome]

Required arguments:
          inputFile    The input file (see supported formats above).
          [B][COLOR="Red"]outputFile   Binary output file[/COLOR][/B].  Must end in ".tdf" or ".wig".  To indicate that you want to
                             output both a .tdf and a .wig file, list both output filenames as a single string,
                             separated by a comma with no other delimiters.  To display feature
                             intensity in IGV, the density must be computed with this command, and the
                             resulting file must be named <feature track filename>.tdf.

... unless I'm still missing something...

I still want to implement this feature in my script (though low in priority), so if you/anyone knows how to do it I'd love to know!

**yl01** · 03-22-2013, 08:17 AM

The outputFile doesn't need to be binary .tdf. You can give a file name like myfile.wig, Wig file is a text file.

**Giffredo** · 02-27-2014, 04:07 AM

I am interested in this conversation..
I d like to reach the number of reads that map a selected base on my reference genome to understand if that site is or not an editing site (a changed from C to T).

I tried the command:

Code:

igvtools count -z 5 -w 1 --strands [read] --bases  sorted.bam sorted.bam.wig reference.fas &

I don t know what -z in the practise want to mean... anyway do you think that this can work for reaching my purpose?

**TiborNagy** · 02-27-2014, 06:29 AM

z specifies the max zoom level and you do not need [] around 'read'

**Giffredo** · 02-28-2014, 01:18 AM

OK.. so which is the difference between z and w?

**dariober** · 03-04-2014, 12:42 AM

Originally posted by obk View Post

Hi,

Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

Code:

POS A  C  G  T N I D
1   50 0  0  0 0 0 0
2   0  50 0  0 0 0 0
3   0  25 25 0 0 0 0
4   50 0  0  0 0 0 0
5   0  50 0  0 0 0 0

Hello,

As mentioned in previous posts, igvtools count does the job. However, I just came across this tool pysamstats which, among many other things, can output matches and mismatches per position, including indels. E.g.:

Code:

pysamstats --fasta myref.fa --type variation myaln.bam

Which gives a nice tabular file:

Code:

chrom	pos	ref	reads_all	reads_pp	matches	matches_pp	mismatches	mismatches_pp	deletions	deletions_pp	insertions	insertions_pp	A	A_pp	C	C_pp	T	T_pp	G	G_pp	N	N_pp
CP001509.3	3	C	1	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
CP001509.3	4	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	5	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	6	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	7	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	8	C	4	0	2	0	2	0	0	0	0	0	0	0	2	0	2	0	0	0	0	0
CP001509.3	9	A	4	0	4	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0
CP001509.3	10	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	11	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
...

Many thanks to the developer!

Dario

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Get igv coverage track

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News