I am trying to troubleshoot why no reads were mapped in my call to BWA SoLiD in Galaxy. I also posted about this on Biostar here (https://biostar.usegalaxy.org/p/7444/#7459).
Basically, below is the background of my data
1) Downloaded .sra file.
2) Used fastq-dump to convert .sra file to .fastq file
3) Only kept 1/10 sequences from .fastq file (it was too large).
4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file
I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)
Jen, a member from Galaxy, noted: "It appears that your mapping run did not produce any hits. Look further down in the SAM file (past the @SQ lines). You can also generate general counts in Galaxy: run a tool from the SAMTools or Picard group such as 'flagstat' or 'SAM/BAM Alignment Summary Metrics'. From there, you may need to adjust the BWA setting to capture hits. If I remember correctly, the sequence fragments in your mrna.fa custom reference genome were quite short."
So, I ran both of her suggestions, and these were the outputs:
1) I ran "flagstat" on one of my BAM files, and it gave me an 11-line output:
6263972 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
406 + 0 mapped (0.01%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
2) Below is the output "log" of the SAM/BAM Alignment Summary Metrics on a BAM file (I bolded a line that does not look too good):
INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CreateSequenceDictionary.jar REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=None returned status 1 and stderr:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=false NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon May 05 10:49:41 CDT 2014] Executing as [email protected] on Linux 2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_40-b43
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2025324544
Exception in thread "main" net.sf.samtools.SAMException: Sequence name contains invalid character: AF001540 1
at net.sf.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:80)
at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceRecord(CreateSequenceDictionary.java:147)
at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:138)
at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)
INFO:root:## executing samtools sort /galaxy-repl/main/files/008/119/dataset_8119150.dat dataset_8119150.dat.sorted returned status 0 and nothing on stderr
INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt R=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta TMP_DIR=/galaxy/main/scratch INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam returned status 0 and nothing on stderr
And the output "txt" of the same program:
## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=[/galaxy/main/scratch] VALIDATION_STRINGENCY=LENIENT METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon May 05 10:50:52 CDT 2014
## METRICS CLASS net.sf.picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
UNPAIRED 6263972 6263972 1 0 406 0.000065 19883 63 3078 2442 0 0.00176 0.004873 0.001257 49.999925 0 0 0 0.578818 0 0
The seq fragments in my mrna.fa reference genome varied in length. In my .fastq files, they were 50 nt.
I have no idea how to adjust my BWA to capture this, though. Could anyone help me with how to interpret this? I guess there is no solution but to remap (but if there is a faster solution, I am going for faster time right now), but what should I change in my BWA SOLiD to remap , and hope to get some hits this time?
Many thanks...
Basically, below is the background of my data
1) Downloaded .sra file.
2) Used fastq-dump to convert .sra file to .fastq file
3) Only kept 1/10 sequences from .fastq file (it was too large).
4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file
I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)
Jen, a member from Galaxy, noted: "It appears that your mapping run did not produce any hits. Look further down in the SAM file (past the @SQ lines). You can also generate general counts in Galaxy: run a tool from the SAMTools or Picard group such as 'flagstat' or 'SAM/BAM Alignment Summary Metrics'. From there, you may need to adjust the BWA setting to capture hits. If I remember correctly, the sequence fragments in your mrna.fa custom reference genome were quite short."
So, I ran both of her suggestions, and these were the outputs:
1) I ran "flagstat" on one of my BAM files, and it gave me an 11-line output:
6263972 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
406 + 0 mapped (0.01%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
2) Below is the output "log" of the SAM/BAM Alignment Summary Metrics on a BAM file (I bolded a line that does not look too good):
INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CreateSequenceDictionary.jar REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=None returned status 1 and stderr:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=false NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon May 05 10:49:41 CDT 2014] Executing as [email protected] on Linux 2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_40-b43
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2025324544
Exception in thread "main" net.sf.samtools.SAMException: Sequence name contains invalid character: AF001540 1
at net.sf.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:80)
at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceRecord(CreateSequenceDictionary.java:147)
at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:138)
at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)
INFO:root:## executing samtools sort /galaxy-repl/main/files/008/119/dataset_8119150.dat dataset_8119150.dat.sorted returned status 0 and nothing on stderr
INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt R=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta TMP_DIR=/galaxy/main/scratch INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam returned status 0 and nothing on stderr
And the output "txt" of the same program:
## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=[/galaxy/main/scratch] VALIDATION_STRINGENCY=LENIENT METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon May 05 10:50:52 CDT 2014
## METRICS CLASS net.sf.picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
UNPAIRED 6263972 6263972 1 0 406 0.000065 19883 63 3078 2442 0 0.00176 0.004873 0.001257 49.999925 0 0 0 0.578818 0 0
The seq fragments in my mrna.fa reference genome varied in length. In my .fastq files, they were 50 nt.
I have no idea how to adjust my BWA to capture this, though. Could anyone help me with how to interpret this? I guess there is no solution but to remap (but if there is a faster solution, I am going for faster time right now), but what should I change in my BWA SOLiD to remap , and hope to get some hits this time?
Many thanks...
Comment