Seqanswers Leaderboard Ad

**Xianglinwu** · 07-30-2014, 11:07 AM

Suresect adapter sequence

what is sureselect adapter sequences? indexing strand of Sureselect is different from indexing strand of Truseq LT, can anyone give some information?

**bioBob** · 07-31-2014, 04:26 AM

Love the location. You should sign your name the same way.

I have not done the 'correction' method you have referenced. I normally download the reference and annotation file from igenomes in tophat ready format. Can you post a snippet from your corrected gtf file?

I believe HTSeq will now output a samfile with a field indicating what the read contributed to , have you looked at that and looked to see where they map?

-o <samout>, --samout=<samout>
write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)

**ExMachina** · 07-31-2014, 04:31 AM

One other change I just tried was the HTSeq option "--mode=intersection-nonempty "

I was hoping to get more information, but this option only decreased my "no_feature" count by about 5000. So I'm still left with the question: is it reasonable to have only ~15% of my mapped reads associate with annotated genes?

**bioBob** · 07-31-2014, 04:33 AM

No, that is not reasonable. Can you post a few lines of your corrected gtf?

**ExMachina** · 07-31-2014, 04:48 AM

Originally posted by bioBob View Post

I have not done the 'correction' method you have referenced. I normally download the reference and annotation file from igenomes in tophat ready format. Can you post a snippet from your corrected gtf file?

Here's a small chunk from chr1. As you can see some genes have corrected formatting but others (generally non-coding, I think) remain uncorrected:

chr1 knownGene exon 69091 70008 . + . gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
chr1 knownGene CDS 69091 70005 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
chr1 knownGene start_codon 69091 69093 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
chr1 knownGene stop_codon 70006 70008 . + 0 gene_id "Q8NH21"; transcript_id "uc001aal.1"; exon_number "1"; exon_id "uc001aal.1.1"; gene_name "Q8NH21";
chr1 knownGene exon 134773 139696 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
chr1 knownGene CDS 138533 139696 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
chr1 knownGene exon 139790 139847 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "2"; exon_id "uc021oeg.2.2"; gene_name "B4DF06";
chr1 knownGene CDS 139790 139792 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "2"; exon_id "uc021oeg.2.2"; gene_name "B4DF06";
chr1 knownGene exon 140075 140566 . - . gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "3"; exon_id "uc021oeg.2.3"; gene_name "B4DF06";
chr1 knownGene start_codon 139790 139792 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
chr1 knownGene stop_codon 138530 138532 . - 0 gene_id "B4DF06"; transcript_id "uc021oeg.2"; exon_number "1"; exon_id "uc021oeg.2.1"; gene_name "B4DF06";
chr1 knownGene exon 182393 182746 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "1"; exon_id "uc031tlc.1.1";
chr1 knownGene exon 183132 183240 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "2"; exon_id "uc031tlc.1.2";
chr1 knownGene exon 183740 184878 . + . gene_id "uc031tlc.1"; transcript_id "uc031tlc.1"; exon_number "3"; exon_id "uc031tlc.1.3";

I'm also wondering if we're jumping the gun on trying to use the newer hg38 assembly and annotation? As you have alluded, there are more refined input files available for hg19/37.

Currently I am waiting on the imminent ENSEMBL release of 38 to run the same pipeline on an compare results to the UCSC-based results.

I believe HTSeq will now output a samfile with a field indicating what the read contributed to , have you looked at that and looked to see where they map?

-o <samout>, --samout=<samout>
write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)

Wow, that's a great tip. That's exactly what I need to have to see what's going on. Thanks!

**bioBob** · 07-31-2014, 05:56 AM

If it were me, I would align and count to the previous release just to see what the count assignments look like in terms of the number of reads in no_feature etc.

Another thing, are you sure the chromosome id's are all identical in the fasta and gtf? In the build I have, the chromosomes are labeled as 1, 2 etc instead of chr1, chr2 etc.

**GenoMax** · 07-31-2014, 06:00 AM

Give "featureCounts" a try, while you are at it.

As bioBio pointed out, check to make sure your chrom ID's match in your ref/BAM/GTF files.

**bioBob** · 07-31-2014, 06:07 AM

Oh, and make sure the reads are sorted in the bam file. HTSeq now has a flag for which method the sorting was performed, ie name or position.

**ExMachina** · 07-31-2014, 06:19 AM

If it were me, I would align and count to the previous release just to see what the count assignments look like in terms of the number of reads in no_feature etc.

Good idea and I already tried that

Same general results

Another thing, are you sure the chromosome id's are all identical in the fasta and gtf? In the build I have, the chromosomes are labeled as 1, 2 etc instead of chr1, chr2 etc.

The "1, 2..." IDs are ENSEMBL while UCSC uses "chr1, chr2..."

And good point on the sorting bioBob. I checked and all my bam files are sorted ( SO:coordinate)

**bioBob** · 07-31-2014, 06:24 AM

Right, so you need to set that flag as the default is name. You probably had a lot of messages in the console about mate not found.

-r <order>, --order=<order>¶
For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

**ExMachina** · 07-31-2014, 06:39 AM

Originally posted by bioBob View Post

Right, so you need to set that flag as the default is name. You probably had a lot of messages in the console about mate not found.

-r <order>, --order=<order>¶
For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.

Interesting. It looks like "-r" and "--order" are not options in my version (0.5.4). I wonder if that's the problem right there?!

EDIT: actually, I don't think this is a problem for me here, since all I have are SE reads.

**ExMachina** · 07-31-2014, 11:22 AM

Thanks for the input on this. As an update, here's what I've figured out:

The UCSC gtf file should not be made with the "knownGenes" table but should instead be made from the "refGene" table--the "knownGenes" table has too many overlapping coordinates.

Using the refGene annotations, I now get ~33% of my mapped reads mapping (uniquely) to known genes. I feel a little more confident with this result, especially given the RiboMinus library prep.

More comments are always welcome and I will report back here once the new ENSEMBL annotation is released.

Thanks for the help!

-David

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

HTSeq, human genomes and low read counts: am I doing anything wrong?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News