Dear all,
I have used the fasta genome provided by NCBI. The headers of this file are:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
>chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
>chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
[...]
>chrUn_GL000218v1 AC:GL000218.1 gi:224183305 LN:161147 rl:unplaced M5:1d708b54644c26c7e01c2dad5426d38c AS:GRCh38
>chrEBV AC:AJ507799.2 gi:86261677 LN:171823 rl:decoy M5:6743bd63b3ff2b5b8985d8933c53290a SP:Human_herpesvirus_4 tp:circular
I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.
I prepared the header for the virus genome as follows:
>chrV AC:XXXXXXXX.1 gi:00000000 LN:370064105 rl:Chromosome M5:5aa5be7025d7baa666a8651e0909e4ce AS:1 SP:All_viruses tp:linear
I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).
Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.
LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.
I separated the fields with two spaces.
I concatenated the human genome and the made up virus genome with `cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with
bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>
However, I got this error message:
[bns_restore_core] Parse error reading <fusion.fa>.amb
and the SAM file is virtually empty:
cat <file_ALN-SRT.sam>
@HD VN:1.3 SO:coordinate
May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.
I tried with both sed -i 's/\s*$//g' (for spaces in the sequence) and sed -i 's/^[^>]\s*$//g' (for spaces in the header) followed by bwa index but the result was always the same.
Any clues?
Thank you
I have used the fasta genome provided by NCBI. The headers of this file are:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
>chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
>chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
[...]
>chrUn_GL000218v1 AC:GL000218.1 gi:224183305 LN:161147 rl:unplaced M5:1d708b54644c26c7e01c2dad5426d38c AS:GRCh38
>chrEBV AC:AJ507799.2 gi:86261677 LN:171823 rl:decoy M5:6743bd63b3ff2b5b8985d8933c53290a SP:Human_herpesvirus_4 tp:circular
I need to use a fusion genome built by concatenating this human genome with one obtained from selected virus sequences. This virus genome is formed by a single header and a long stretch of nucleotides derived from individual virus sequences.
I prepared the header for the virus genome as follows:
>chrV AC:XXXXXXXX.1 gi:00000000 LN:370064105 rl:Chromosome M5:5aa5be7025d7baa666a8651e0909e4ce AS:1 SP:All_viruses tp:linear
I made up accession number AC to XXXXXXXX.1 because there is no real entry for my made-up genome in Genbank & NCBI; since the IDs given in the human genome are 8 digit long, I gave a 8 letters fake entry and a ".1" because this is first time I am using this genome (maybe I should have used two letter, 6 numbers?).
Same for the GI number: the made up genome is not recorded in GenBank, thus I simply gave a fake 8 digit number.
LN is the length of the genome, I treated it as a real chromosome and M5 derives for md5sum I made on the fasta file. AS and SP are free text fields (I assumed) and the genome is linear.
I separated the fields with two spaces.
I concatenated the human genome and the made up virus genome with `cat <human.fa> <virus.fa> > <fusion.fa> and I prepared the indices for this genome and aligned the samples with
bwa index -a bwtsw <fusion.fa>
bwa mem -t 8 -R <read_group> <fusion.fa> <R1.fq.gz> <R2.fq.gz> | \
samtools sort -o <file_ALN-SRT.sam>
However, I got this error message:
[bns_restore_core] Parse error reading <fusion.fa>.amb
and the SAM file is virtually empty:
cat <file_ALN-SRT.sam>
@HD VN:1.3 SO:coordinate
May I ask what I got wrong? When aligning against either one or the other genome separately the alignment is OK, thus it must be a problem with the headers I guess.
I tried with both sed -i 's/\s*$//g' (for spaces in the sequence) and sed -i 's/^[^>]\s*$//g' (for spaces in the header) followed by bwa index but the result was always the same.
Any clues?
Thank you