Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • No reads mapped using BWA SOLiD

    I am trying to troubleshoot why no reads were mapped in my call to BWA SoLiD in Galaxy. I also posted about this on Biostar here (https://biostar.usegalaxy.org/p/7444/#7459).

    Basically, below is the background of my data
    1) Downloaded .sra file.
    2) Used fastq-dump to convert .sra file to .fastq file
    3) Only kept 1/10 sequences from .fastq file (it was too large).
    4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
    5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
    6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file

    I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

    Jen, a member from Galaxy, noted: "It appears that your mapping run did not produce any hits. Look further down in the SAM file (past the @SQ lines). You can also generate general counts in Galaxy: run a tool from the SAMTools or Picard group such as 'flagstat' or 'SAM/BAM Alignment Summary Metrics'. From there, you may need to adjust the BWA setting to capture hits. If I remember correctly, the sequence fragments in your mrna.fa custom reference genome were quite short."

    So, I ran both of her suggestions, and these were the outputs:

    1) I ran "flagstat" on one of my BAM files, and it gave me an 11-line output:

    6263972 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    406 + 0 mapped (0.01%:-nan%)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (-nan%:-nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (-nan%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    2) Below is the output "log" of the SAM/BAM Alignment Summary Metrics on a BAM file (I bolded a line that does not look too good):

    INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CreateSequenceDictionary.jar REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=None returned status 1 and stderr:
    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
    [Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=false NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Mon May 05 10:49:41 CDT 2014] Executing as [email protected] on Linux 2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_40-b43
    [Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
    Runtime.totalMemory()=2025324544
    Exception in thread "main" net.sf.samtools.SAMException: Sequence name contains invalid character: AF001540 1
    at net.sf.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:80)
    at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceRecord(CreateSequenceDictionary.java:147)
    at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:138)
    at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
    at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)


    INFO:root:## executing samtools sort /galaxy-repl/main/files/008/119/dataset_8119150.dat dataset_8119150.dat.sorted returned status 0 and nothing on stderr

    INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt R=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta TMP_DIR=/galaxy/main/scratch INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam returned status 0 and nothing on stderr



    And the output "txt" of the same program:

    ## net.sf.picard.metrics.StringHeader
    # net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=[/galaxy/main/scratch] VALIDATION_STRINGENCY=LENIENT METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    ## net.sf.picard.metrics.StringHeader
    # Started on: Mon May 05 10:50:52 CDT 2014

    ## METRICS CLASS net.sf.picard.analysis.AlignmentSummaryMetrics
    CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
    UNPAIRED 6263972 6263972 1 0 406 0.000065 19883 63 3078 2442 0 0.00176 0.004873 0.001257 49.999925 0 0 0 0.578818 0 0



    The seq fragments in my mrna.fa reference genome varied in length. In my .fastq files, they were 50 nt.


    I have no idea how to adjust my BWA to capture this, though. Could anyone help me with how to interpret this? I guess there is no solution but to remap (but if there is a faster solution, I am going for faster time right now), but what should I change in my BWA SOLiD to remap , and hope to get some hits this time?

    Many thanks...

  • #2
    In case it is useful, below is the head of some of my files:

    Below is the head output of some of my files:

    hg19.gtf:

    chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";


    mrna.fasta:

    >AF001540 1
    ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
    ctgatcccgctgctattagaatgcattgtgaaacgactggagtatgatta
    aaagttgtgttccccaatgcttggagtagtgattgttgaaggaaaaaatc
    cagctgagtgataaggctgagtgttgaggaaatttctgcagttttaagca
    gtcgtatttgtgattgaagctgagtacatttgctggtgtatttttaggta
    aaatgcttttttgttcatttctgggtggtgggaggggactgaagccttta
    gtcttttccagatgcaaccttaaaatcagtgacaagaaacattccaaaca
    agcaacagtcttcaagaaattaaactggcaagtggaaatgtttaaacagt
    tcagtgatctttagtgcattgtttatgtgtgggtttctctctcccctccc


    FourA.sam:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    FourA.count (head and tail)

    head:
    NM_000014 0
    NM_000015 0
    NM_000016 0
    NM_000017 0
    NM_000018 0
    NM_000019 0
    NM_000020 0
    NM_000021 0
    NM_000022 0
    NM_000023 0

    tail:
    NR_046431 0
    NR_046432 0
    NR_046433 0
    NR_046435 0
    NR_046436 0
    no_feature 257
    ambiguous 0
    too_low_aQual 0
    not_aligned 6366159
    alignment_not_unique 0

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM
    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    27 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    31 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    27 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-04-2024, 09:00 AM
    0 responses
    52 views
    0 likes
    Last Post seqadmin  
    Working...
    X