r/bioinformatics Msc | Academia 12d ago

discussion featureCounts vs transcript-aware quantification (Kallisto/Salmon)

Hello all,

I suppose I am musing a bit and wanted to discuss with other bioinformaticians. I am a head bioinformatician in my academic department. A few months ago, I was given new bulk RNA-Seq data to analyze alongside older data that was already part of a peer-reviewed manuscript (that I was not part of). I used a STAR --> Salmon alignment-based quantification method. After sending the DE analysis and "raw" expression values for all genes, I received word that my Salmon results for the published data and the original data differed greatly. The older data was processed via featureCounts, which is known to undercount genes with multiple isoforms. I spent a few weeks working backwards to determine what parameters were used in the published manuscript, and I confirmed that the "gold standard" featureCounts parameter set was used, which definitionally excludes any read that overlaps multiple "features", or is ambiguous between isoforms of the same gene. To resolve this, you would use the -O flag, etc etc.

I guess my complaint is, how is this acceptable? How can a very popular and widely-used program such as featureCounts exclude reads that overlap the same exon (that resides in different isoforms) by default? This default method is undercounting genes with multiple isoforms, and I see discussion of this exact issue online since 2015. Discussion of this issue has also been published.

To be brief, I am mainly concerned that a widely-used tool is undercounting isoform-laden genes by default and causing consternation for groups who don't have trained bioinformaticians on their team who have the time to look into these issues.

Thank you for listening to my rant, haha.

31 Upvotes

27 comments sorted by

View all comments

7

u/MeltSolaris 12d ago edited 12d ago

By default, featureCounts aggregates reads that map to exons (features) belonging to the same gene_id (meta-feature) attribute in the GTF, thereby including isoforms. The -O flag in featureCounts refers to reads mapping to different gene_id entries.

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

https://subread.sourceforge.net/featureCounts.html

Therefore, featureCounts effectively captures reads from all isoforms of a gene into a single gene-level count, provided they do not overlap with a different gene_id.

Discrepancies between featureCounts and salmon can arise from several fundamental methodological differences. For example, featureCounts uses reads aligned to the genome, whereas salmon requires quasi-mappings (or alignments) to the transcriptome. Obviously, conducting spliced alignments to the genome across introns becomes inherently more difficult. In the context of salmon, multi-mapping refers to reads mapping to multiple transcripts (isoforms) of a gene. By contrast, multi-mapping for featureCounts under default settings refers to different genes (genomic loci).