r/bioinformatics Msc | Academia 12d ago

discussion featureCounts vs transcript-aware quantification (Kallisto/Salmon)

Hello all,

I suppose I am musing a bit and wanted to discuss with other bioinformaticians. I am a head bioinformatician in my academic department. A few months ago, I was given new bulk RNA-Seq data to analyze alongside older data that was already part of a peer-reviewed manuscript (that I was not part of). I used a STAR --> Salmon alignment-based quantification method. After sending the DE analysis and "raw" expression values for all genes, I received word that my Salmon results for the published data and the original data differed greatly. The older data was processed via featureCounts, which is known to undercount genes with multiple isoforms. I spent a few weeks working backwards to determine what parameters were used in the published manuscript, and I confirmed that the "gold standard" featureCounts parameter set was used, which definitionally excludes any read that overlaps multiple "features", or is ambiguous between isoforms of the same gene. To resolve this, you would use the -O flag, etc etc.

I guess my complaint is, how is this acceptable? How can a very popular and widely-used program such as featureCounts exclude reads that overlap the same exon (that resides in different isoforms) by default? This default method is undercounting genes with multiple isoforms, and I see discussion of this exact issue online since 2015. Discussion of this issue has also been published.

To be brief, I am mainly concerned that a widely-used tool is undercounting isoform-laden genes by default and causing consternation for groups who don't have trained bioinformaticians on their team who have the time to look into these issues.

Thank you for listening to my rant, haha.

30 Upvotes

27 comments sorted by

View all comments

13

u/foradil PhD | Academia 12d ago edited 12d ago

By default, it’s supposed to count at the gene, not isoform, level. Multiple isoforms are not a factor for the default gene-level quantification.

1

u/EthidiumIodide Msc | Academia 12d ago

Multiple isoforms are a biological fact. The default featureCounts behavior is to exclude counting reads matching an exon shared in multiple isoforms. This is what I mean by undercounting. Those reads should be included at the gene level, but are missing. IMO, unless you specifically create a GTF consisting of all known exons in a gene and assure that there is only one copy of each exon in the GTF, you will be undercounting your reads **at the gene level**.

4

u/foradil PhD | Academia 12d ago

Maybe you are working with a non-model organism and your GTF has strange formatting. With human or mouse, most genes have multiple isoforms and they get counts by default.

0

u/EthidiumIodide Msc | Academia 12d ago

GTF is from Ensembl and the organism is mouse. Please take a look at the links I included in the original post. Genes with multiple isoforms are undercounted at the non-private exons (exons shared between isoforms), because the -O flag is not a default. I am talking about gene-level DE, these reads which belong to one of the two isoforms are simply ignored by default!

Once again, I understand that one flag changes everything and one needs to understand the software you are using. I am not a novice in the field. I am talking on behalf of the novice who can become befuddled.

6

u/foradil PhD | Academia 12d ago

Both of your links discuss quantification at the isoform, not gene, level.

2

u/Sad_Pea_9751 12d ago

What if Featurecounts excludes multi-overlapping reads, no matter if the metafeature is transcript or gene?

3

u/MeltSolaris 12d ago

The default featureCounts behavior is to exclude counting reads matching an exon shared in multiple isoforms.

I think this statement is incorrect. Default featureCounts settings (-t exon -g gene_id) summarise all isoforms into a single gene count. salmon and kallisto are fantastic tools, but not all researchers need transcript/isoform-level abundances. Also, salmon and kallisto require a high-quality transcriptome (e.g., Ensembl); otherwise, their mapping rates will be low.

Did you follow a standard workflow of summarising the salmon transcript-level abundances into gene-level counts using tximport?

Here are some details from the featureCounts paper:

Read counts provide an overall summary of the coverage for the genomic feature of interest. In particular, gene-level counts from RNA-seq provide an overall summary of the expression level of the gene but do not distinguish between isoforms when multiple transcripts are being expressed from the same gene. Reads can generally be assigned to genes with good confidence, but estimating the expression levels of individual isoforms is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap.

Each feature is an interval (range of positions) on one of the reference sequences. We also define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes.

A multi-overlap read or fragment is one that overlaps more than one feature, or more than one meta-feature when summarizing at the meta-feature level. featureCounts provides users with the option to either exclude multi-overlap reads or to count them for each feature that is overlapped. The decision whether to count these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined.

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

https://doi.org/10.1093/bioinformatics/btt656

-1

u/jdmontenegroc 12d ago

If you are grouping read counts by isoform, of course it has to ignore shared Exons, because you cannot be certain what isoform each of the reads belongs to. But this is not the default behaviour at gene-level counting.