Hello, guys.
I have many questions about NGS and please help me!!!
1. where do i download hg19 reference file?
As I know, NCBI and UCSC provides fasta format reference sequence per chromosome.
Is it matter where I get the ref sequence?
Actually, I downloaded each chromosome reference sequence files from NCBI:
http://www.ncbi.nlm.nih.gov/genome/?term=homo%20sapiens -> 'genome' tab -> files from Genome Reference Consortium.
However, as i searched other posts, other users recommends to download files from 1000genome or UCSC golden path??...
is there any difference?
2. when using BWA, what's the minimal unit of reference file?
do i have to use whole hg19 ref sequence as a ref.fasta? or chr3.fasta?(ex)
Or, can i use specific gene's fasta format sequence file as a reference file?
(I also downloaded it from NCBI. if I want to use EGFR as a reference, enter 'EGFR' from 'gene' category, and click the result of homo sapiens, and download sequence as fasta format.)
as you know, fasta file format starts with '>~~~~~~~~' and from the next line, 'AGCTCCTG~~~~'.
the first line('>~~~~') is important for using BWA tool?
In case of using specific gene's fasta format sequence file, what should i write the first line of fasta file?
3. when i use bwa pair end mode align, as you know the command is like followings:
'bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam'
Actually, I used some barcode at read1.fq so I trimmed barcode sequence(=6bp) from read1.fq not by using command option but using programming code.(I also trimmed the quality score for 6 characters)
In this situation, the length of lines in read1.fq and read2.fq doesn't same.
I runned pair end mode align command, the terminal window shows 'weird pair' but anyway it made result file 'aln.sam'.
is it okay? does anyone who had same experience like this?
I have many questions about NGS and please help me!!!
1. where do i download hg19 reference file?
As I know, NCBI and UCSC provides fasta format reference sequence per chromosome.
Is it matter where I get the ref sequence?
Actually, I downloaded each chromosome reference sequence files from NCBI:
http://www.ncbi.nlm.nih.gov/genome/?term=homo%20sapiens -> 'genome' tab -> files from Genome Reference Consortium.
However, as i searched other posts, other users recommends to download files from 1000genome or UCSC golden path??...
is there any difference?
2. when using BWA, what's the minimal unit of reference file?
do i have to use whole hg19 ref sequence as a ref.fasta? or chr3.fasta?(ex)
Or, can i use specific gene's fasta format sequence file as a reference file?
(I also downloaded it from NCBI. if I want to use EGFR as a reference, enter 'EGFR' from 'gene' category, and click the result of homo sapiens, and download sequence as fasta format.)
as you know, fasta file format starts with '>~~~~~~~~' and from the next line, 'AGCTCCTG~~~~'.
the first line('>~~~~') is important for using BWA tool?
In case of using specific gene's fasta format sequence file, what should i write the first line of fasta file?
3. when i use bwa pair end mode align, as you know the command is like followings:
'bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam'
Actually, I used some barcode at read1.fq so I trimmed barcode sequence(=6bp) from read1.fq not by using command option but using programming code.(I also trimmed the quality score for 6 characters)
In this situation, the length of lines in read1.fq and read2.fq doesn't same.
I runned pair end mode align command, the terminal window shows 'weird pair' but anyway it made result file 'aln.sam'.
is it okay? does anyone who had same experience like this?
Comment