Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    The sorted and not-sorted bam files are the same size

    Code:
    -bash-4.1$ pwd
    /usit/abel/u1/maxib/1_data/1_project/1st_assembly_strategy
    -bash-4.1$ du -sh *
    7,0G	1_align.sam
    84K	chrysanthemum_indicum_chloroplast.fasta
    3,5K	chrysanthemum_indicum_chloroplast.fasta.fai
    15G	contig.fa
    1,8G	file.bam
    1,8G	file_sorted.bam
    6,5K	file_sorted.bam.bai
    0	file.vcf.gz
    0	out.fa
    512	sam.sh
    6,3G	scafseq.fa
    1,5K	test.vcf.gz
    0	vcffile
    Running manually mpileup produces the same error

    Code:
    -bash-4.1$ /usit/abel/u1/maxib/8_samtools/bin/samtools mpileup  -v -f chrysanthemum_indicum_chloroplast.fasta file_sorted.bam -o file.vcf.gz 
    [mpileup] 1 samples in 1 input files
    <mpileup> Set max per-file depth to 8000
    Abandon

    Comment


    • #17
      By chance do you have extremely deep coverage (> 8000)? That is a small genome and the result is a large bam file.

      Comment


      • #18
        Well, I'm working with WGS data to extract, for the moment, the chloroplast genome.
        So I have 307 210 727 reads of mean length 151 which equals 46 696 030 504 base pairs.
        The chloroplast I've mapped them to is 86444 bp.
        So the coverage is around 540 188...

        Well, I think you found the problem ! Thanks for your help, I'll randomly subsample my fastq files before alignment by 1000 folds !
        ps : to whom might be interested, here is a script to do it :

        Code:
        # Written by  Aaronquinlan
        # https://www.biostars.org/p/6544/
        # Starting FASTQ files
        export FQ1=1.fq
        export FQ2=2.fq
        
        # The names of the random subsets you wish to create
        export FQ1SUBSET=1.rand.fq
        export FQ2SUBSET=2.rand.fq
        
        # How many random pairs do we want?
        export N=100
        
        # paste the two FASTQ such that the 
        # header, seqs, seps, and quals occur "next" to one another
          paste $FQ1 $FQ2 | \
        # "linearize" the two mates into a single record.  Add a random number to the front of each line
          awk 'BEGIN{srand()}; {OFS="\t"; \
                                getline seqs; getline sep; getline quals; \
                                print rand(),$0,seqs,sep,quals}' | \
        # sort by the random number
          sort -k1,1 | \
        # grab the first N records
          head -n $N | \
        # Convert the stream back to 2 separate FASTQ files.
          awk '{OFS="\n"; \
                print $2,$4,$6,$8 >> ENVIRON["FQ1SUBSET"]; \
                print $3,$5,$7,$9 >> ENVIRON["FQ2SUBSET"]}'
        Last edited by MaximeOfOslo; 06-18-2015, 07:15 AM.

        Comment


        • #19
          540,000x

          Reformat.sh from BBMap can also subsample directly from sam/bam.

          This can save you some alignment time.
          Code:
          $ reformat.sh in=in.sam out=out.sam sample=some_number_here

          Comment


          • #20
            Picard tools can also randomly downsample a .bam file.

            And the cheesy way to do it yourself would be to use awk or grep to only grab reads from a particular tile, one not on the edge would be preferable.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            66 views
            0 likes
            Last Post seqadmin  
            Working...
            X