Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Get igv coverage track

    Hi,

    Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

    Code:
    POS A  C  G  T N I D
    1   50 0  0  0 0 0 0
    2   0  50 0  0 0 0 0
    3   0  25 25 0 0 0 0
    4   50 0  0  0 0 0 0
    5   0  50 0  0 0 0 0
    or alternatively, the total coverage (50 for all POSitions in the above example) with % of each of ACGTNID.

    This data can be found/displayed in IGV (coverage track) as in the screenshot here, and probably exported from IGV, but I want to get at this data with a command line tool, without the gui IGV.



    Other comments:
    - igv count almost does what I want, but I can only seem to load and view its output in IGV.
    - coverageBed also almost does what I want, but I don't think it will report the base composition at each position.
    - I could parse the samtools mpileup output for the particular POSitions I'm interested in, but I was wondering if this has been implemented already somewhere.
    - I also found the thread below, but is quite old
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    Thanks for your help.

    PS I'm new to bioinformatics and apologies if this is trivial.

  • #2
    Originally posted by obk View Post
    Hi,

    Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

    Code:
    POS A  C  G  T N I D
    1   50 0  0  0 0 0 0
    2   0  50 0  0 0 0 0
    3   0  25 25 0 0 0 0
    4   50 0  0  0 0 0 0
    5   0  50 0  0 0 0 0
    or alternatively, the total coverage (50 for all POSitions in the above example) with % of each of ACGTNID.
    Hi- Have a look at bedtools nuc, it should do exactly what you need or get you very close. From the help:
    Code:
    bedtools nuc -h
    
    Tool:    bedtools nuc (aka nucBed)
    Version: v2.16.2
    Summary: Profiles the nucleotide content of intervals in a fasta file.
    
    Usage:   bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
    
    Options: 
    	-fi	Input FASTA file
    
    	-bed	BED/GFF/VCF file of ranges to extract from -fi
    
    	-s	Profile the sequence according to strand.
    
    	-seq	Print the extracted sequence
    
    	-pattern	Report the number of times a user-defined sequence
    			is observed (case-sensitive).
    
    	-C	Igore case when matching -pattern. By defaulty, case matters.
    
    Output format: 
    	The following information will be reported after each BED entry:
    	    1) %AT content
    	    2) %GC content
    	    3) Number of As observed
    	    4) Number of Cs observed
    	    5) Number of Gs observed
    	    6) Number of Ts observed
    	    7) Number of Ns observed
    	    8) Number of other bases observed
    	    9) The length of the explored sequence/interval.
    	    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
    	    11) The number of times a user's pattern was observed.
    	        (opt., if -pattern is used.)
    Dario

    Comment


    • #3
      Hi Dario - thanks for the reply.
      As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.

      Comment


      • #4
        Originally posted by obk View Post
        Hi Dario - thanks for the reply.
        As far as I can tell, this will provide statistics for a window (which I suppose could be one base in length) in a single sequence or multiple sequences separately. What I'm attempting to do is get the base composition (histogram) -- just like the output from bedtools nuc -- but of each base position, and from a number of reads that are aligned to a reference.
        ...I see, sorry I misunderstood your question. Indded bedtools nuc gives you the base composition of the reference fasta, but not of the reads aligned to it. I'll try again...
        What about samtools mpileup? The input/output looks like this (haven't tested, check the syntax, comments are mine):

        Code:
        samtools mpileup -B -Q0 -d100000 myreads.bam -f myreference.fasta
        chr7    3036461 G       2       ,,      I+    ## Reference at this pos is G, there are two reads aligned on the reverse (,,) 
        chr7    3047000 C       2       TT      IG    ## Ref is C, two reads aligned to forward as TT
        chr7    3047001 G       2       ..      HF    ## Ref is G, two reads aligned on forward
        See if this helps... samtools view and mpileup are quite flexible and you can do a lot of filtering on reads and regions of interest.

        Dario

        Comment


        • #5
          Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.

          As mentioned in the original post (screenshot), this data is already viewable in igv... I just wished I could generate and access that data as a text file...

          Thanks for your example though!

          Comment


          • #6
            Originally posted by obk View Post
            Yes, I figured I could parse the mpileup output as you mentioned, but was curious if there was a tool that will generate the table I wanted.
            Ooops... Indeed you mention mpileup in your original post. No, in this case I can't think of anything more direct right now...

            Dario

            Comment


            • #7
              Ok, thanks for your help! You saved me a few hours of searching manuals/docs and forum threads!

              Comment


              • #8
                Hi. I know this is probably too late for you, but I was caught in a similar situation.

                After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...

                Comment


                • #9
                  Originally posted by olorin View Post
                  Hi. I know this is probably too late for you, but I was caught in a similar situation.

                  After some RTFM, it turns out that the info we are after could be exported using "igvtools count". The IGV manual didn't make this very obvious ...
                  Thanks for posting this! That's good to know indeed!
                  Dario

                  Comment


                  • #10
                    Thanks for the reply! I did find igvtools count, but as far as I can tell it doesn't output the result in non-binary format...

                    From the manual (http://www.broadinstitute.org/igv/ig..._commandline):

                    Code:
                    Usage:
                              igvtools count [options] [inputFile] [outputFile] [genome]
                    
                    Required arguments:
                              inputFile    The input file (see supported formats above).
                              [B][COLOR="Red"]outputFile   Binary output file[/COLOR][/B].  Must end in ".tdf" or ".wig".  To indicate that you want to
                                                 output both a .tdf and a .wig file, list both output filenames as a single string,
                                                 separated by a comma with no other delimiters.  To display feature
                                                 intensity in IGV, the density must be computed with this command, and the
                                                 resulting file must be named <feature track filename>.tdf.
                    ... unless I'm still missing something...

                    I still want to implement this feature in my script (though low in priority), so if you/anyone knows how to do it I'd love to know!

                    Comment


                    • #11
                      The outputFile doesn't need to be binary .tdf. You can give a file name like myfile.wig, Wig file is a text file.

                      Comment


                      • #12
                        I am interested in this conversation..
                        I d like to reach the number of reads that map a selected base on my reference genome to understand if that site is or not an editing site (a changed from C to T).

                        I tried the command:

                        Code:
                        igvtools count -z 5 -w 1 --strands [read] --bases  sorted.bam sorted.bam.wig reference.fas &
                        I don t know what -z in the practise want to mean... anyway do you think that this can work for reaching my purpose?

                        Comment


                        • #13
                          z specifies the max zoom level and you do not need [] around 'read'

                          Comment


                          • #14
                            OK.. so which is the difference between z and w?

                            Comment


                            • #15
                              Originally posted by obk View Post
                              Hi,

                              Are there any tools that can output the per-base base composition in a non-binary format? I'm looking for something like this:

                              Code:
                              POS A  C  G  T N I D
                              1   50 0  0  0 0 0 0
                              2   0  50 0  0 0 0 0
                              3   0  25 25 0 0 0 0
                              4   50 0  0  0 0 0 0
                              5   0  50 0  0 0 0 0
                              Hello,

                              As mentioned in previous posts, igvtools count does the job. However, I just came across this tool pysamstats which, among many other things, can output matches and mismatches per position, including indels. E.g.:

                              Code:
                              pysamstats --fasta myref.fa --type variation myaln.bam
                              Which gives a nice tabular file:

                              Code:
                              chrom	pos	ref	reads_all	reads_pp	matches	matches_pp	mismatches	mismatches_pp	deletions	deletions_pp	insertions	insertions_pp	A	A_pp	C	C_pp	T	T_pp	G	G_pp	N	N_pp
                              CP001509.3	3	C	1	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
                              CP001509.3	4	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              CP001509.3	5	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              CP001509.3	6	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              CP001509.3	7	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              CP001509.3	8	C	4	0	2	0	2	0	0	0	0	0	0	0	2	0	2	0	0	0	0	0
                              CP001509.3	9	A	4	0	4	0	0	0	0	0	0	0	4	0	0	0	0	0	0	0	0	0
                              CP001509.3	10	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              CP001509.3	11	T	4	0	4	0	0	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0
                              ...
                              Many thanks to the developer!

                              Dario
                              Last edited by dariober; 03-04-2014, 12:43 AM. Reason: Correct formatting

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X