Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to calculate average gap size

    After completing an alignment to a reference how do I calculate the average size of the gaps between contigs? I've looked at HTSeq but can't figure out if this program does this. Any other suggestions besides writing a custom script?

    Thanks in advance!

  • #2
    Can you be a bit more specific.
    Which contigs are you talking about - the reference? i.e. you want to use paired end sequences that map to different reference contigs to calculate the size of the gaps between your reference contigs?
    Or do you mean you just want to calculate the average gap between mapped paired reads over all reference sequences?

    Comment


    • #3
      Originally posted by Graham Etherington View Post
      Or do you mean you just want to calculate the average gap between mapped paired reads over all reference sequences?
      Yes the gaps between the mapped reads over all the reference sequence. Ideally I would like to see a histogram or distribution of the gap sizes as well.
      Sorry if I wasn't clear before.

      Thanks!

      Comment


      • #4
        For each read, the distance to the mate right there in the .sam file. So you could use awk or something to pull that out.

        I use bwa to align and make same files, and it tells me what the average is as it makes the .sam file, though it doesn't seem to save that info anywhere.

        Comment


        • #5
          I was hoping to avoid making a custom script. I would have thought this was common enough that someone has made it easier to get. CLCgenomics calculates it I think, but of course that program is very expensive.

          Comment


          • #6
            You should use Picard for this:


            The method 'CollectInsertSizeMetrics' reads a SAM or BAM file and writes a file containing metrics about the statistical distribution of insert size (excluding duplicates) and generates a histogram plot.

            Cheers,
            Graham

            Comment


            • #7
              Thanks I'll check out picard but I don't see a way to calculate gap size. Am I missing something?

              Originally posted by Graham Etherington View Post
              You should use Picard for this:


              The method 'CollectInsertSizeMetrics' reads a SAM or BAM file and writes a file containing metrics about the statistical distribution of insert size (excluding duplicates) and generates a histogram plot.

              Cheers,
              Graham

              Comment


              • #8
                Originally posted by odoyle81 View Post
                Am I missing something?
                Yes.
                The gap size (or 'insert size' to give it the correct name) is included in the OUTPUT file of CollectInsertSizeMetrics. Amongst other statistical information, the following is given in relation to insert size:
                MEDIAN_INSERT_SIZE
                MEDIAN_ABSOLUTE_DEVIATION
                MIN_INSERT_SIZE
                MAX_INSERT_SIZE
                MEAN_INSERT_SIZE
                STANDARD_DEVIATION
                READ_PAIRS
                PAIR_ORIENTATION

                Cheers,
                Graham

                Comment


                • #9
                  Thanks for your help - I'm new to this stuff and still learning the right words.
                  I'm getting an error when I try to run this on a sorted bam file. Maybe I should continue this discussion via the picard help list, but I thought since you are being some helpful maybe you would know what the error means:

                  All data categories were discarded because they contained < 0.05 of the total aligned paired data
                  the bam file that I'm trying to process is illumina reads aligned to a reference genome. Here is the full output. It doesn't output any files either. I read about the data categories on the picard website but I don't understand what the data categories are?

                  java -Xmx2g -jar CollectInsertSizeMetrics.jar HISTOGRAM_FILE=picard_stats INPUT=../../clc\ output/all_mutants_combined_sorted.bam OUTPUT=picard_out
                  [Wed Sep 07 14:43:02 MDT 2011] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=picard_stats INPUT=../../clc output/all_mutants_combined_sorted.bam OUTPUT=picard_out DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
                  [Wed Sep 07 14:43:02 MDT 2011] Executing as paul on Mac OS X 10.5.8 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_26-b03-384-9M3425
                  WARNING 2011-09-07 14:43:02 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
                  INFO 2011-09-07 14:43:05 SinglePassSamProgram Processed 1,000,000 records.
                  INFO 2011-09-07 14:43:07 SinglePassSamProgram Processed 2,000,000 records.
                  INFO 2011-09-07 14:43:08 SinglePassSamProgram Processed 3,000,000 records.
                  INFO 2011-09-07 14:43:10 SinglePassSamProgram Processed 4,000,000 records.
                  INFO 2011-09-07 14:43:12 SinglePassSamProgram Processed 5,000,000 records.
                  INFO 2011-09-07 14:43:14 SinglePassSamProgram Processed 6,000,000 records.
                  INFO 2011-09-07 14:43:15 SinglePassSamProgram Processed 7,000,000 records.
                  INFO 2011-09-07 14:43:17 SinglePassSamProgram Processed 8,000,000 records.
                  INFO 2011-09-07 14:43:19 SinglePassSamProgram Processed 9,000,000 records.
                  INFO 2011-09-07 14:43:21 SinglePassSamProgram Processed 10,000,000 records.
                  INFO 2011-09-07 14:43:22 SinglePassSamProgram Processed 11,000,000 records.
                  INFO 2011-09-07 14:43:24 SinglePassSamProgram Processed 12,000,000 records.
                  INFO 2011-09-07 14:43:26 SinglePassSamProgram Processed 13,000,000 records.
                  INFO 2011-09-07 14:43:27 SinglePassSamProgram Processed 14,000,000 records.
                  INFO 2011-09-07 14:43:29 SinglePassSamProgram Processed 15,000,000 records.
                  INFO 2011-09-07 14:43:31 SinglePassSamProgram Processed 16,000,000 records.
                  INFO 2011-09-07 14:43:33 SinglePassSamProgram Processed 17,000,000 records.
                  INFO 2011-09-07 14:43:35 SinglePassSamProgram Processed 18,000,000 records.
                  INFO 2011-09-07 14:43:36 SinglePassSamProgram Processed 19,000,000 records.
                  INFO 2011-09-07 14:43:38 SinglePassSamProgram Processed 20,000,000 records.
                  INFO 2011-09-07 14:43:40 SinglePassSamProgram Processed 21,000,000 records.
                  INFO 2011-09-07 14:43:41 SinglePassSamProgram Processed 22,000,000 records.
                  INFO 2011-09-07 14:43:43 SinglePassSamProgram Processed 23,000,000 records.
                  INFO 2011-09-07 14:43:45 SinglePassSamProgram Processed 24,000,000 records.
                  INFO 2011-09-07 14:43:47 SinglePassSamProgram Processed 25,000,000 records.
                  INFO 2011-09-07 14:43:48 SinglePassSamProgram Processed 26,000,000 records.
                  INFO 2011-09-07 14:43:50 SinglePassSamProgram Processed 27,000,000 records.
                  INFO 2011-09-07 14:43:52 SinglePassSamProgram Processed 28,000,000 records.
                  WARNING 2011-09-07 14:43:52 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data.
                  WARNING 2011-09-07 14:43:52 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0
                  [Wed Sep 07 14:43:52 MDT 2011] net.sf.picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.84 minutes.
                  Runtime.totalMemory()=85000192
                  Last edited by odoyle81; 09-07-2011, 03:05 PM. Reason: removed computer specific information

                  Comment


                  • #10
                    There are a couple of things going on here. Firstly CollectInsertSizeMetrics thinks the BAM file isn't sorted, so you could try resorting it with samtools sort (and of course samtools index on the new sorted bam file). That might fix things.
                    But it might also be simply that none of your reads map to the reference (as pairs anyway). You could knock up a quick Perl script to check and see if the reads are paired or use Picard CollectAlignmentSummaryMetrics using ASSUME_SORTED=false.

                    The following Perl script might also be of help:
                    use strict;
                    use warnings;
                    use Bio:B::Sam;

                    my $bamfile = shift or die; #which bamfile to examine
                    my $fastaref = shift or die; #the fasta reference

                    #open the bam file
                    my $sam = Bio:B::Sam->new(-bam => $bamfile, -fasta => $fastaref);

                    my $mapped_pairs = 0;

                    #get paired reads
                    my @pairs = $sam->features(-type=>'read_pair');

                    foreach my $p (@pairs)
                    {
                    my @ends = $p->get_SeqFeatures;
                    #test to see if the left and right reads are unmapped...
                    my $left_unmapped = $ends[0]->unmapped;
                    my $right_unmapped = $ends[1]->unmapped;
                    #..if they're not increment mapped_pairs by one
                    if ($left_unmapped == 0 && $right_unmapped == 0)
                    {
                    $mapped_pairs++;
                    }
                    }

                    print "Total paired reads = $mapped_pairs\n";

                    Comment


                    • #11
                      Picard tools CollectAlignmentSummaryMetrics

                      Hi,

                      I have a output SAM file from SOAP deNovo.When I use Validatesam in picard tools it gives me a error saying " not enough fields". I had the same error while using CollectAlignmentSummaryMetrics. I am I missing any filed in the input sam file??? Here is the error :

                      java -jar /opt/picard-tools-1.58/CollectAlignmentSummaryMetrics.jar I=SOAP_300_49_ctg_SAM O= summary AS= true VALIDATION_STRINGENCY= LENIENT

                      [Fri Jan 20 10:48:53 MST 2012] net.sf.picard.analysis.CollectAlignmentSummaryMetrics done. Elapsed time: 0.00 minutes.
                      Runtime.totalMemory()=2026569728
                      Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File SOAP_300_49_ctg_SAM; Line 1
                      Line: HWI-ST967_66:1:1101:1126:2182#0/2 + 2038511 2569 TTCCGTGGGTGAACGGTTAGTTCATTTCACGGCATGGGCGTCTAAGCGTGTCCTATACAATGGTGTTGTGCCACGTGGGGAACGACTGNNNATTTGCCCA @@@FFFDFF<DDHIJIHIJJHHHHIIJDIJJGIJJIIJHHHHIIIIIJJGF?EHFFFEFFECE>C=?A;15@:@B?BDDBBBDD9@B@###++2<@ACCC 0 88:A>N,89:T>N,90:T>N
                      at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(SAMTextReader.java:223)
                      at net.sf.samtools.SAMTextReader.access$400(SAMTextReader.java:40)
                      at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:315)
                      at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:278)
                      at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:250)
                      at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:641)
                      at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:619)
                      at net.sf.picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:109)
                      at net.sf.picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:54)
                      at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
                      at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
                      at net.sf.picard.analysis.CollectAlignmentSummaryMetrics.main(CollectAlignmentSummaryMetrics.java:111)

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X