Apart from the problem when using more than one core and random problems there I've got some other problems:
Thus I have 18850030 reads. Tophat reports:
Ok, maybe some of them where rejected a priori for some quality issue.
Then:
I have a lot of lines in the .sam files without read IDs (in another dataset this caused problems with htseq_count, here not but as long as in the past I did not notice something like that I would like to understand). Is this ok? Are lines without IDs reported from reads mapping in more than one position on the transcriptome/genome?
The SAM format guide says that missing read names should be marked with a '*'...
Code:
$ wc -l ../reads/SRR306839.fastq 75400120 ../reads/SRR306839.fastq
Code:
18645993 reads; of these: 18645993 (100.00%) were unpaired; of these: 7576936 (40.64%) aligned 0 times 5480546 (29.39%) aligned exactly 1 time 5588511 (29.97%) aligned >1 times 59.36% overall alignment rate
Then:
Code:
$ samtools flagstat accepted_hits.bam 41956190 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 41956190 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
The SAM format guide says that missing read names should be marked with a '*'...
Comment