SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
IGV:export track sequence karnu General 5 02-13-2014 12:43 PM
IGV v2 performance much worse than IGV 1.5? pmiguel Bioinformatics 1 05-17-2012 08:05 PM
How to create GC% track for IGV? pmiguel Bioinformatics 3 05-18-2011 07:50 AM
IGV display of coverage john_mu Bioinformatics 2 06-05-2010 11:40 PM

Reply
 
Thread Tools
Old 02-13-2013, 05:54 PM   #1
obk
Member
 
Location: West Coast

Join Date: Dec 2012
Posts: 12
Default Get igv coverage track

Hi,

Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

Code:
POS A  C  G  T N I D
1   50 0  0  0 0 0 0
2   0  50 0  0 0 0 0
3   0  25 25 0 0 0 0
4   50 0  0  0 0 0 0
5   0  50 0  0 0 0 0
or alternatively, the total coverage (50 for all POSitions in the above example) with % of each of ACGTNID.

This data can be found/displayed in IGV (coverage track) as in the screenshot here, and probably exported from IGV, but I want to get at this data with a command line tool, without the gui IGV.



Other comments:
- igv count almost does what I want, but I can only seem to load and view its output in IGV.
- coverageBed also almost does what I want, but I don't think it will report the base composition at each position.
- I could parse the samtools mpileup output for the particular POSitions I'm interested in, but I was wondering if this has been implemented already somewhere.
- I also found the thread below, but is quite old
http://seqanswers.com/forums/showthread.php?t=3650

Thanks for your help.

PS I'm new to bioinformatics and apologies if this is trivial.
obk is offline   Reply With Quote
Old 02-14-2013, 12:46 AM   #2
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by obk View Post
Hi,

Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

Code:
POS A  C  G  T N I D
1   50 0  0  0 0 0 0
2   0  50 0  0 0 0 0
3   0  25 25 0 0 0 0
4   50 0  0  0 0 0 0
5   0  50 0  0 0 0 0
or alternatively, the total coverage (50 for all POSitions in the above example) with % of each of ACGTNID.
Hi- Have a look at bedtools nuc, it should do exactly what you need or get you very close. From the help:
Code:
bedtools nuc -h

Tool:    bedtools nuc (aka nucBed)
Version: v2.16.2
Summary: Profiles the nucleotide content of intervals in a fasta file.

Usage:   bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
	-fi	Input FASTA file

	-bed	BED/GFF/VCF file of ranges to extract from -fi

	-s	Profile the sequence according to strand.

	-seq	Print the extracted sequence

	-pattern	Report the number of times a user-defined sequence
			is observed (case-sensitive).

	-C	Igore case when matching -pattern. By defaulty, case matters.

Output format: 
	The following information will be reported after each BED entry:
	    1) %AT content
	    2) %GC content
	    3) Number of As observed
	    4) Number of Cs observed
	    5) Number of Gs observed
	    6) Number of Ts observed
	    7) Number of Ns observed
	    8) Number of other bases observed
	    9) The length of the explored sequence/interval.
	    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
	    11) The number of times a user's pattern was observed.
	        (opt., if -pattern is used.)
Dario
dariober is offline   Reply With Quote
Old 02-14-2013, 09:12 AM   #3
obk
Member
 
Location: West Coast

Join Date: Dec 2012
Posts: 12
Default

Hi Dario - thanks for the reply.
As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.
obk is offline   Reply With Quote
Old 02-14-2013, 10:02 AM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by obk View Post
Hi Dario - thanks for the reply.
As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.
...I see, sorry I misunderstood your question. Indded bedtools nuc gives you the base composition of the reference fasta, but not of the reads aligned to it. I'll try again...
What about samtools mpileup? The input/output looks like this (haven't tested, check the syntax, comments are mine):

Code:
samtools mpileup -B -Q0 -d100000 myreads.bam -f myreference.fasta
chr7    3036461 G       2       ,,      I+    ## Reference at this pos is G, there are two reads aligned on the reverse (,,) 
chr7    3047000 C       2       TT      IG    ## Ref is C, two reads aligned to forward as TT
chr7    3047001 G       2       ..      HF    ## Ref is G, two reads aligned on forward
See if this helps... samtools view and mpileup are quite flexible and you can do a lot of filtering on reads and regions of interest.

Dario
dariober is offline   Reply With Quote
Old 02-14-2013, 10:10 AM   #5
obk
Member
 
Location: West Coast

Join Date: Dec 2012
Posts: 12
Default

Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.

As mentioned in the original post (screenshot), this data is already viewable in igv... I just wished I could generate and access that data as a text file...

Thanks for your example though!
obk is offline   Reply With Quote
Old 02-14-2013, 10:22 AM   #6
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by obk View Post
Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.
Ooops... Indeed you mention mpileup in your original post. No, in this case I can't think of anything more direct right now...

Dario
dariober is offline   Reply With Quote
Old 02-14-2013, 10:30 AM   #7
obk
Member
 
Location: West Coast

Join Date: Dec 2012
Posts: 12
Default

Ok, thanks for your help! You saved me a few hours of searching manuals/docs and forum threads!
obk is offline   Reply With Quote
Old 03-22-2013, 02:45 AM   #8
olorin
Junior Member
 
Location: Asia-Pacific

Join Date: Oct 2008
Posts: 1
Default

Hi. I know this is probably too late for you, but I was caught in a similar situation.

After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...
olorin is offline   Reply With Quote
Old 03-22-2013, 02:55 AM   #9
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by olorin View Post
Hi. I know this is probably too late for you, but I was caught in a similar situation.

After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...
Thanks for posting this! That's good to know indeed!
Dario
dariober is offline   Reply With Quote
Old 03-22-2013, 07:06 AM   #10
obk
Member
 
Location: West Coast

Join Date: Dec 2012
Posts: 12
Default

Thanks for the reply! I did find igvtools count, but as far as I can tell it doesn't output the result in non-binary format...

From the manual (http://www.broadinstitute.org/igv/ig..._commandline):

Code:
Usage:
          igvtools count [options] [inputFile] [outputFile] [genome]

Required arguments:
          inputFile    The input file (see supported formats above).
          outputFile   Binary output file.  Must end in ".tdf" or ".wig".  To indicate that you want to
                             output both a .tdf and a .wig file, list both output filenames as a single string,
                             separated by a comma with no other delimiters.  To display feature
                             intensity in IGV, the density must be computed with this command, and the
                             resulting file must be named <feature track filename>.tdf.
... unless I'm still missing something...

I still want to implement this feature in my script (though low in priority), so if you/anyone knows how to do it I'd love to know!
obk is offline   Reply With Quote
Old 03-22-2013, 09:17 AM   #11
yl01
Member
 
Location: Germany

Join Date: Aug 2012
Posts: 25
Default

The outputFile doesn't need to be binary .tdf. You can give a file name like myfile.wig, Wig file is a text file.
yl01 is offline   Reply With Quote
Old 02-27-2014, 04:07 AM   #12
Giffredo
Member
 
Location: Paris

Join Date: Feb 2014
Posts: 36
Smile

I am interested in this conversation..
I d like to reach the number of reads that map a selected base on my reference genome to understand if that site is or not an editing site (a changed from C to T).

I tried the command:

Code:
igvtools count -z 5 -w 1 --strands [read] --bases  sorted.bam sorted.bam.wig reference.fas &
I don t know what -z in the practise want to mean... anyway do you think that this can work for reaching my purpose?
Giffredo is offline   Reply With Quote
Old 02-27-2014, 06:29 AM   #13
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

z specifies the max zoom level and you do not need [] around 'read'
TiborNagy is offline   Reply With Quote
Old 02-28-2014, 01:18 AM   #14
Giffredo
Member
 
Location: Paris

Join Date: Feb 2014
Posts: 36
Default

OK.. so which is the difference between z and w?
Giffredo is offline   Reply With Quote
Old 03-04-2014, 12:42 AM   #15
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by obk View Post
Hi,

Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

Code:
POS A  C  G  T N I D
1   50 0  0  0 0 0 0
2   0  50 0  0 0 0 0
3   0  25 25 0 0 0 0
4   50 0  0  0 0 0 0
5   0  50 0  0 0 0 0
Hello,

As mentioned in previous posts, igvtools count does the job. However, I just came across this tool pysamstats which, among many other things, can output matches and mismatches per position, including indels. E.g.:

Code:
pysamstats --fasta myref.fa --type variation myaln.bam
Which gives a nice tabular file:

Code:
chrom	pos	ref	reads_all	reads_pp	matches	matches_pp	mismatches	mismatches_pp	deletions	deletions_pp	insertions	insertions_pp	A	A_pp	C	C_pp	T	T_pp	G	G_pp	N	N_pp
CP001509.3	3	C	1	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
CP001509.3	4	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	5	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	6	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	7	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	8	C	4	0	2	0	2	0	0	0	0	0	0	0	2	0	2	0	0	0	0	0
CP001509.3	9	A	4	0	4	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0
CP001509.3	10	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
CP001509.3	11	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
...
Many thanks to the developer!

Dario

Last edited by dariober; 03-04-2014 at 12:43 AM. Reason: Correct formatting
dariober is offline   Reply With Quote
Old 03-26-2014, 05:06 AM   #16
Giffredo
Member
 
Location: Paris

Join Date: Feb 2014
Posts: 36
Default

I am very interested on this discussion..
Can Pysamstats give me the difference between reds (forward or reverse)?

I need a table with these informations:

Code:
Chr/position/reference_base/read_base (number or %(A,C,G,T) found in forward or reverse strand)/insertion/deletion

Example:

chr1 20 A A:30 C:20 G:0 T:0 a:0 c:0 g:0 t:0 Ins:0 del:0
IGV give me something like this when I put the arrow of the mouse over a read but I need of a table...
Giffredo is offline   Reply With Quote
Old 03-26-2014, 05:19 AM   #17
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Giffredo View Post
I am very interested on this discussion..
Can Pysamstats give me the difference between reds (forward or reverse)?
Use variation_strand instead of variation. From the manual (https://pypi.python.org/pypi/pysamstats)

Quote:
* variation_strand - as variation but with forward/reverse strand counts
dariober is offline   Reply With Quote
Old 04-10-2014, 01:44 AM   #18
Giffredo
Member
 
Location: Paris

Join Date: Feb 2014
Posts: 36
Default

Unfortunately I cannot install pysamstat package..
So, Is it possible to reach the same results using GATK for you?.. looking at the tutorial I think it cannot discriminate the variation in one strand from the other.
Giffredo is offline   Reply With Quote
Old 04-10-2014, 02:06 AM   #19
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by Giffredo View Post
Unfortunately I cannot install pysamstat package..
What problems do you encounter? Installation errors or permissions?
pysamstats depends on numpy and pysam which can be tricky to install but it should be possible.
dariober is offline   Reply With Quote
Old 04-10-2014, 02:12 AM   #20
Giffredo
Member
 
Location: Paris

Join Date: Feb 2014
Posts: 36
Default

I have not any idea.. I cannot install anything in this Lab. there is an expert for this and he told me that is not possible. I guess it is a problem about compatibility..
Giffredo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO