Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem sequencing Hapmap Exome - no variant found!

    Hello,
    I recently started to work with exome. To test my competency, I aligned exome from the HapMap project using the following commands:

    first i downloaded SRR292250.sra form SRA

    $ fastqdump SRR292250.sra SRR292250.fastq
    $ bwa aln -t 8 SRR292250.fastq > SRR292250.sai
    $ bwa samse -r "@RG\tID:IDa\tSM:SM\tPL:Illumina" gatk/ensembl_g37/human_g1k_v37.fasta SRR292250.sai SRR292250.fastq > SRR292250.sam
    $java -Xmx4g -Djava.io.tmpdir=/tmp -jar picard/SortSam.jar SO=coordinate INPUT=SRR292250.sam OUTPUT=SRR292250.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

    I continued with the GATK pipeline to call variants. But it didn't work. I think my problem is in my alignment. Below are the results from flagstat and idx stat. Do they look correct?

    $ samtools flagstat SRR292250.bam
    85493722 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1680 + 0 mapped (0.00%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    $ samtools idxstats SRR292250.bam
    1 249250621 22 0
    2 243199373 1332 0
    3 198022430 26 0
    4 191154276 30 0
    5 180915260 23 0
    6 171115067 24 0
    7 159138663 9 0
    8 146364022 20 0
    9 141213431 19 0
    10 135534747 9 0
    11 135006516 10 0
    12 133851895 20 0
    13 115169878 21 0
    14 107349540 16 0
    15 102531392 5 0
    16 90354753 16 0
    17 81195210 10 0
    18 78077248 5 0
    19 59128983 2 0
    20 63025520 7 0
    21 48129895 8 0
    22 51304566 3 0
    X 155270560 40 0
    Y 59373566 3 0
    MT 16569 0 0
    GL000207.1 4262 0 0
    ...
    * 0 0 85492042
    Is it correct?
    if you want i can also give the results from the coverage.

    any hints?
    thank you very much for your help.
    tuka.
    Last edited by aituka; 03-24-2012, 04:25 AM.

  • #2
    You have 85 million reads, and only 1600 mapped. That's obviously your problem. Maybe these reads aren't human?

    Comment


    • #3
      Code:
      $ bwa aln -t 8 SRR292250.fastq > SRR292250.sai
      You don't seem to have specified a reference genome for BWA aln here.

      If that was just a typo in your post and you did specify a reference genome when you ran it, could it possibly be an error with the bwa index? BWA changed its index format with version 0.6.0, so if you're running 0.6.0+ with a pre-0.6.0 index, that might cause something weird like this. I've never tried it myself though, so I've no idea what actually happens when you do that.

      Comment


      • #4
        Thanks for your hint Rocketknight.
        First, as you suggested, yes it was a typo: i provided the reference genome.

        But as you said, i might have indexed with a different version of BWA. so i reindexed the reference genome with BWA 0.6.1. however, I don't get the right files:

        $ bwa index -a bwtsw human_g1k_v37.fasta

        give me:
        human_g1k_v37.fasta human_g1k_v37.fasta.ann human_g1k_v37.fasta.pac
        human_g1k_v37.fasta.amb human_g1k_v37.fasta.bwt human_g1k_v37.fasta.sa

        I don't know why, but there is no .rbwt file.
        did i missed something?
        I will check with BWA version 0.6.0
        thx
        tuka.

        Comment


        • #5
          I think the RBWT file may no longer be created in the new index format. You should be able to continue with aln or samse without it.

          Comment


          • #6
            Hi,
            i re-apply the pipeline
            First, download of one HUMAN exome: from http://sra.dnanexus.com/studies/SRP007298 , http://dnanexus-sra.commondatastorag...92250.fastq.gz

            second i installed bwa6.1 and I realigned my reference genome
            bwa index -a bwtsw human_g1k_v37.fasta
            then i computed the bam:
            bwa aln -t 8 /gatk/ensembl_g37/bwa6.1/human_g1k_v37.fasta SRR292250.fastq > SRR292250.sai

            bwa samse -r "@RG\tID:IDa\tSM:SM\tPL:Illumina" /gatk/ensembl_g37/bwa6.1/human_g1k_v37.fasta SRR292250.sai SRR292250.fastq >SRR292250.sam

            java -Xmx4g -Djava.io.tmpdir=/tmp -jar /home/insilico/nextgen/picard-tools-1.62/SortSam.jar SO=coordinate INPUT=SRR292250.sam OUTPUT=SRR292250.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
            Still, I obtain the same result:
            $samtools flagstat SRR292250.bam
            85493722 + 0 in total (QC-passed reads + QC-failed reads)
            0 + 0 duplicates
            1680 + 0 mapped (0.00%:-nan%)
            0 + 0 paired in sequencing
            0 + 0 read1
            0 + 0 read2
            0 + 0 properly paired (-nan%:-nan%)
            0 + 0 with itself and mate mapped
            0 + 0 singletons (-nan%:-nan%)
            0 + 0 with mate mapped to a different chr
            0 + 0 with mate mapped to a different chr (mapQ>=5)
            everything is done with bwa6.1

            Maybe the fastq is not in the good format, below are the first reads
            @SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y length=100
            GGGTCATGTGGGCTCATTATTTTCCTCTTTCTTTTACCCAAGTGGACAAGGGTCTTGAACTCCTGGTTTGCACATCTAAGGGGCTGCGACAGGACCTGTG
            +SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y length=100
            HHHHHHHHFHHHGHH6GGGHHHHHHHHFHGEHHHHHHGFFDG@BGDEGGGHHFHHHHHHHHHHHHHHHDHHEHHGHHHHHHHHHHHHBCHCFFEFFFHFG
            @SRR292250.2 80DYAABXX_000168:7:1:10000:103139:Y length=100
            ACCTGCAGGAACTCTCCAGTGGACTGGCCGTAGCGGTGCACGCTATGCTCGGTCATGGTGTAAGCATTTTCTCAAGCTATTTTCCTTCTTGCCTCATCTG
            +SRR292250.2 80DYAABXX_000168:7:1:10000:103139:Y length=100
            GHHHHHHHHGHHHGHHHHHH?HEGHHHHHGEFBFFHFHBHEFB6F;DADAHGHHHGGGEGFFGGGHHHHHHHHBHHHHHEHHHHHHHHDHFEHHHFHHHH
            @SRR292250.3 80DYAABXX_000168:7:1:10000:104494:Y length=100
            GAAGAGACCATCATCTTTGCATAAACGAAGTGGCACTGATACATCATTCTATTGCTTTGTTTTGGAGTCCCCCTTTTCTTCTTTCCTCTATAGAATGATG
            +SRR292250.3 80DYAABXX_000168:7:1:10000:104494:Y length=100
            HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHGHHHGHHFHFHFHHHHHEHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHGGHHHHHHHHH
            @SRR292250.4 80DYAABXX_000168:7:1:10000:105301:Y length=100
            AGAGAAGGCACCGGGCTCACGGAGGGCTGCCTGGTGTCCGTGGACGACTTACTGCTGGCTCTGCGGCGGCCAGTACTGCAAGAATGTGTCCACGTCCACC
            +SRR292250.4 80DYAABXX_000168:7:1:10000:105301:Y length=100
            GHHHHHHHHGHHHHHHHHHHGHHHHFHHFHHHFHEEFBGGFFGFHHHGFHHHHHHHHHHHHHHFHFHHHAC9A5?8.:<9E<AD*;391;==A5=<>.6B
            @SRR292250.5 80DYAABXX_000168:7:1:10000:107861:Y length=100
            GCCTTTTTTTCCTTTTATCTGTGGTTTAAATTCCTAAGTTTGTATGTCTCAAGTGAGCATAGCAAGTATATTTTAAAACATGTTCATTTCAGTCGTCACA
            +SRR292250.5 80DYAABXX_000168:7:1:10000:107861:Y length=100
            HFHHHHHHHHHHHHHGHHHHHHFHHHHFHHHHHFDFEHEHHGFAHGFDCFHHHHHHHHHHHHHFHHHHHHHHHHHEGHGHHHHHGHHFGHGHGBHGGBG@
            any idea?
            thanks
            tuka.

            Comment


            • #7
              <duplicate error>
              Last edited by aituka; 03-26-2012, 12:00 PM.

              Comment


              • #8
                I threw the first 250000 reads from that file at bwa on my own system and only 2 of them mapped to the human genome for me. The read format looks fine, but something is definitely wrong with the reads. Possibly untrimmed barcode sequences? Or they could be mislabelled results from another run. Either way, this issue isn't your fault.

                If you're looking for proper exome fastq files to test your pipeline on, try:

                ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (1st reads of pair)

                ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (2nd reads of pair)

                ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (unpaired reads)

                Comment


                • #9
                  thanks for your answer Rocketknight.
                  You were right. It seems that the list of command I am executing are ok. The problem was with the fastq coming from the Hapmap SRP007298.
                  This leads to two I have two questions.

                  1/First what do you mean by "Possibly untrimmed barcode sequences". Is there a way to check if the barcode are trimmed, and if not, an algorithm to trim them?
                  In fact I was suspecting that problem and I tested the same pipeline to samples coming from two other datasets form SRA: SRR004076 and SRP008162 (both coming from SRA). Statistics from samtools flagstat gives me respectively '0 mapped (0.00%)' and '115 + 0 mapped (0.00%:-nan%)'. What is wrong?

                  2/ I checked the samples you suggested and indeed it do work. I tried with the .filt.fastq.gz file coming form the 1000 genome project and with the fastq file coming from sra (SRR231270 coming from SRP004061) which is not filtered. My question is: what is that filtering doing? results from flagstat are without filtering:
                  148318736 in total
                  0 QC failure
                  0 duplicates
                  141922418 mapped (95.69%)
                  148318736 paired in sequencing
                  74159368 read1
                  74159368 read2
                  139291198 properly paired (93.91%)
                  141011182 with itself and mate mapped
                  911236 singletons (0.61%)
                  1556783 with mate mapped to a different chr
                  1448948 with mate mapped to a different chr (mapQ>=5)
                  with .filt.
                  146907704 in total
                  0 QC failure
                  0 duplicates
                  140747864 mapped (95.81%)
                  146907704 paired in sequencing
                  73453852 read1
                  73453852 read2
                  138143478 properly paired (94.03%)
                  139853076 with itself and mate mapped
                  894788 singletons (0.61%)
                  1547655 with mate mapped to a different chr
                  1440436 with mate mapped to a different chr (mapQ>=5)
                  3/ additional question: any idea why sometimes flagstat's output has 2 values per line and somtimes just one
                  eg:
                  85493722 + 0 in total (QC-passed reads + QC-failed reads)
                  0 + 0 duplicates
                  1680 + 0 mapped (0.00%:-nan%)

                  vs

                  146907704 in total
                  0 QC failure
                  0 duplicates
                  140747864 mapped (95.81%)
                  thank you for any hints
                  tuka.

                  Comment


                  • #10
                    Trimming barcodes/indexes requires you to know their length and position in the reads. For standard Illumina primers, the indexes won't be within the normal reads and will instead be read with a different index read primer, so trimming is unneccessary, unless your insert size is so small that you read through the entire insert and out into the adapter at the other side. For people making their own barcodes, it will differ depending on the exact barcode structure used.

                    As for what the difference between the .filt and the unfiltered files is, I'm afraid I don't actually know.
                    Last edited by Rocketknight; 03-27-2012, 06:14 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X