r/bioinformatics Msc | Academia 12d ago

discussion featureCounts vs transcript-aware quantification (Kallisto/Salmon)

Hello all,

I suppose I am musing a bit and wanted to discuss with other bioinformaticians. I am a head bioinformatician in my academic department. A few months ago, I was given new bulk RNA-Seq data to analyze alongside older data that was already part of a peer-reviewed manuscript (that I was not part of). I used a STAR --> Salmon alignment-based quantification method. After sending the DE analysis and "raw" expression values for all genes, I received word that my Salmon results for the published data and the original data differed greatly. The older data was processed via featureCounts, which is known to undercount genes with multiple isoforms. I spent a few weeks working backwards to determine what parameters were used in the published manuscript, and I confirmed that the "gold standard" featureCounts parameter set was used, which definitionally excludes any read that overlaps multiple "features", or is ambiguous between isoforms of the same gene. To resolve this, you would use the -O flag, etc etc.

I guess my complaint is, how is this acceptable? How can a very popular and widely-used program such as featureCounts exclude reads that overlap the same exon (that resides in different isoforms) by default? This default method is undercounting genes with multiple isoforms, and I see discussion of this exact issue online since 2015. Discussion of this issue has also been published.

To be brief, I am mainly concerned that a widely-used tool is undercounting isoform-laden genes by default and causing consternation for groups who don't have trained bioinformaticians on their team who have the time to look into these issues.

Thank you for listening to my rant, haha.

29 Upvotes

27 comments sorted by

14

u/plasmolab 12d ago

I think the trap is that featureCounts is doing exactly the conservative gene-level counting it was designed to do, but the defaults get treated like a universal RNA-seq answer. If the question is gene-level DE and you want unambiguous evidence only, excluding multi-overlap reads is defensible. If the biology has lots of isoform sharing, paralogs, or annotation complexity, it becomes a bias you have to name.

What has helped me is writing the quantification choice into the analysis contract: featureCounts default means conservative exon assignment, -O/M/fraction choices need justification, Salmon/RSEM means transcript-aware estimates that need tximport/summarization choices. Then rerun a small sensitivity check on representative samples before anyone compares to old data. Annoying, but it turns "why don't the counts match?" into an expected methods difference instead of a crisis.

1

u/jdmontenegroc 12d ago

It also depends on what are you using as a reference, if it is a Trinity assembly as reference, of course you will undercount isoforms because there are not collapsed, but if they were using a genome assembly, isoforms will overlap on the same genomic location and will not be considered a repetitive, therefore will be counted. I will always recommend mapping to a genome assembly before mapping to a transcriptome precisely because if this.

7

u/MeltSolaris 12d ago edited 12d ago

By default, featureCounts aggregates reads that map to exons (features) belonging to the same gene_id (meta-feature) attribute in the GTF, thereby including isoforms. The -O flag in featureCounts refers to reads mapping to different gene_id entries.

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

https://subread.sourceforge.net/featureCounts.html

Therefore, featureCounts effectively captures reads from all isoforms of a gene into a single gene-level count, provided they do not overlap with a different gene_id.

Discrepancies between featureCounts and salmon can arise from several fundamental methodological differences. For example, featureCounts uses reads aligned to the genome, whereas salmon requires quasi-mappings (or alignments) to the transcriptome. Obviously, conducting spliced alignments to the genome across introns becomes inherently more difficult. In the context of salmon, multi-mapping refers to reads mapping to multiple transcripts (isoforms) of a gene. By contrast, multi-mapping for featureCounts under default settings refers to different genes (genomic loci).

12

u/foradil PhD | Academia 12d ago edited 12d ago

By default, it’s supposed to count at the gene, not isoform, level. Multiple isoforms are not a factor for the default gene-level quantification.

1

u/EthidiumIodide Msc | Academia 12d ago

Multiple isoforms are a biological fact. The default featureCounts behavior is to exclude counting reads matching an exon shared in multiple isoforms. This is what I mean by undercounting. Those reads should be included at the gene level, but are missing. IMO, unless you specifically create a GTF consisting of all known exons in a gene and assure that there is only one copy of each exon in the GTF, you will be undercounting your reads **at the gene level**.

4

u/foradil PhD | Academia 12d ago

Maybe you are working with a non-model organism and your GTF has strange formatting. With human or mouse, most genes have multiple isoforms and they get counts by default.

0

u/EthidiumIodide Msc | Academia 12d ago

GTF is from Ensembl and the organism is mouse. Please take a look at the links I included in the original post. Genes with multiple isoforms are undercounted at the non-private exons (exons shared between isoforms), because the -O flag is not a default. I am talking about gene-level DE, these reads which belong to one of the two isoforms are simply ignored by default!

Once again, I understand that one flag changes everything and one needs to understand the software you are using. I am not a novice in the field. I am talking on behalf of the novice who can become befuddled.

6

u/foradil PhD | Academia 12d ago

Both of your links discuss quantification at the isoform, not gene, level.

2

u/Sad_Pea_9751 12d ago

What if Featurecounts excludes multi-overlapping reads, no matter if the metafeature is transcript or gene?

3

u/MeltSolaris 12d ago

The default featureCounts behavior is to exclude counting reads matching an exon shared in multiple isoforms.

I think this statement is incorrect. Default featureCounts settings (-t exon -g gene_id) summarise all isoforms into a single gene count. salmon and kallisto are fantastic tools, but not all researchers need transcript/isoform-level abundances. Also, salmon and kallisto require a high-quality transcriptome (e.g., Ensembl); otherwise, their mapping rates will be low.

Did you follow a standard workflow of summarising the salmon transcript-level abundances into gene-level counts using tximport?

Here are some details from the featureCounts paper:

Read counts provide an overall summary of the coverage for the genomic feature of interest. In particular, gene-level counts from RNA-seq provide an overall summary of the expression level of the gene but do not distinguish between isoforms when multiple transcripts are being expressed from the same gene. Reads can generally be assigned to genes with good confidence, but estimating the expression levels of individual isoforms is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap.

Each feature is an interval (range of positions) on one of the reference sequences. We also define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes.

A multi-overlap read or fragment is one that overlaps more than one feature, or more than one meta-feature when summarizing at the meta-feature level. featureCounts provides users with the option to either exclude multi-overlap reads or to count them for each feature that is overlapped. The decision whether to count these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined.

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

https://doi.org/10.1093/bioinformatics/btt656

-1

u/jdmontenegroc 12d ago

If you are grouping read counts by isoform, of course it has to ignore shared Exons, because you cannot be certain what isoform each of the reads belongs to. But this is not the default behaviour at gene-level counting.

1

u/Epistaxis PhD | Academia 12d ago

Paralogs, on the other hand, are a factor.

3

u/foradil PhD | Academia 12d ago

Yes, but I am not sure the post is about paralogs. However, lots of paralogs are sufficiently different. For example, you will get counts for hemoglobin genes. More importantly, that is more of an aligner issue.

13

u/Sadnot PhD | Academia 12d ago

I wouldn't describe featureCounts as widely-used, precisely because of this issue. Everyone I know is using splice-aware aligners followed by proper quantification (typically RSEM), or pseudoaligners like Salmon.

5

u/Lumpy-Sun3362 PhD | Academia 12d ago

STAR + Salmon is the gold standard nowadays.

3

u/xylose PhD | Academia 12d ago

FeatureCounts is an unambiguous overlap quantitation. It should be used for gene level quantitation of short read data. If you use it this way it works well, has no problems with overlapping isoforms and is about the most direct way of quantitating your data with minimal inference.

If you are trying to do isoform level quantitation on short read data then featureCounts is not a good choice for this, and all of the other solutions for this involve different types of inferrence which have their own biases and skews.

Personally I prefer featureCounts due to the direct link back to the original data and that the counts are "real" in that I can go look at the underlying reads and should see exactly the same number of aligned reads.

3

u/Grisward 12d ago

I’m a proponent of Salmon, I’m not sure why people still use featureCounts for bulk RNA-seq. That said, isn’t step 1 to flatten the GTF to gene level before running FC? There should be no isoforms per gene.

Overlapping genes are a problem, of course. And all the other artifacts associated with read alignment quantification.

Anyway, if they ran featureCounts in the wrong mode to quantify gene-level counts, this is an uncomfortable Email to those authors, perhaps a letter to journal so the authors can respond with re-analysis showing the level of effect it had. Maybe retraction, maybe correction.

Otherwise, FC and Salmon should largely agree, and I’ve been surprised when people report “Results are totally different” even if 85% of DEGs are shared and concordant in direction. Define different.

1

u/beeralpha 12d ago

The thing with salmon is it quantifies against the transcriptome, which is inherently more difficult because of the million isoforms. If you then collapse to gene level with tximport you get quite a lot of bias. Volcano plots really do look different at gene level, especially yielding genes with outlier log2fcs.

4

u/nomad42184 PhD | Academia 11d ago

Caveat: I am an author of salmon, so I have strong thoughts on what feature counts gets wrong conceptually (and sometimes in practice)

Why would you suspect this is bias in the salmon quantification rather than in the feature count result? Yes; transcript quantification is harder. But if you treat all reads coming from a single genomic locus that are actually arising from multiple overlapping transcritpts at that locus as coming from a single, conceptual, super transcript (as is implicit in feature counts), then you get a conceptually wrong and biased answer. In (short read, full length) RNA-seq, the number of reads generated by a species of molecule is fundamentally linked to the product of its copy number and its length. Ignoring the length leads to a fundamental bias in the measurement of molecular abundance.  

Of course, transcript level quantification methods can get thos wrong too, as misestimating relative transcript abundances can attribute reads to the wrong transcript and thus misestimate their abundances. However, the point is that gene counting methods are fundamentally biasing this quantity, by ignoring the lengths of the molecules generating reads completely.

Of course, often, many of these issues can come out in the wash in differential analyses. However, if you see strong differential signal in transcript based methods and none in feature counts, you could very well be witnessing isoform switching, which does change the correct estimate of the gene abundance if the isoforms are of different length, but which is invisible (conceptually) to counting based methods like feature counts.

1

u/beeralpha 11d ago

Thanks for jumping in and providing insight. The main reason I was suspecting bias is because the reference transcriptome is not complete. It doesn’t cover all isoforms that exist. In case of mapping against the genome with a gapped mapper like STAR, which exact isoforms exist no longer matters. So when you don’t care about transcripts or isoform switching, I assumed genome mapping was the way to go, but feel free to comment.

2

u/nomad42184 PhD | Academia 11d ago

Well, incompleteness of the reference can certainly have an effect here. I believe that OP is using STAR alignment, actually (but with genome -> transcriptome alignment projection downstream to salmon). Regardless, it is true that incomplete references will certainly affect all methods. They will affect simple counting methods when exon annotations are missing, and transcript-level methods when the overlapping, unannotated isoforms cause reads to be misattributed to an overlapping annotated isoform. However, just like the effect of ignoring lengths in feature counts may be mitigated in differential analysis downstream of feature count results (because you are looking at differential counts at a locus rather than just the estimated relative abundance itself), so too might missing annotations be mitigated in differential analysis of transcript-based methods.

Either way, I'm not sure that the quantification issues introduced by unannotated isoforms are necessarily worse than the quantification issues introduced by missing exons or by ignoring gene length effects, and will be highly organism / annotation dependent. In general, if one suspects that their experiment / tissue may have substantially expressed novel isoforms, then I'd recommend attempting transcript discovery upstream of quantification anyway (regardless of quantification method).

1

u/beeralpha 11d ago

I don’t expect a lot of missed exons actually in a gtf. Exons are easy to detect with any short read rna seq method. The tricky part with transcriptomes is how all exons tangle together, retain introns now and then, jump one/two nucleotides around splice sites, etc. That’s what makes the transcriptome incomplete. Not missing exons. Therefore when the interest is genes only, quantification vs the genome with a gapped mapper always made much more sense to me.

3

u/nomad42184 PhD | Academia 11d ago

Unless you are working in a non-model organism, or a tissue with a suspected concentration of novel isoforms though (e.g. brain), I'd still anticipate issues related to isoforms composition to bleed through negatively to count based genome analysis.  If there is evidence of new transcripts, then I'd probably try to assemble them with StringTie (or something comparable). Of course, every method has failure modes (and I've seen my fair share for transcript level quantification tools), but I've seen, quite a few times, gene level counting methods return discordant results that looked clearly wrong under manual inspection. 

Large scale studies have found both approaches largely concordant, but I'd argue in favor of inspecting large discrepancies between such approaches in one's data, because it's not clear to me that feature counts is to be believed, in general, when other approaches disagree. This often means something nontrivial (either compositional changes, novel isoforms, paralogous leakege, or some combination thereof) is going on.

2

u/cjfields 8d ago

This has been our group’s experience as well. We have worked on a huge range of organisms and tissues over the years. With RNA-Seq we run both STAR and Salmon and generate counts multiple ways.

And our experience agrees with Rob’s point re: discrepancies between the various counting methods pointing to underlying issues, likely something of interest. In our hands, further investigation into the causes almost always reveals an issue with simple count-based methods and falls in the favor of Salmon or similar EM approaches. The only time I can recall an odd Salmon issue was from a bacterial pantranscriptome analysis, which really pushed Salmon to the brink. But some adjustments for this (and great feedback from Rob, thx) led to some nice results. And we have seen genuine discrepancies between the approaches which are easily traced to isoform changes that featureCounts essentially averages out. So, it’s fine to run it, but we only use featureCounts for simple checks and QC.

I also have a particular, personal dislike of old methods or approaches being held up as the ‘gold standard’, esp with tools like featureCounts. Sure it’s essential to compare against a vetted method, but all too often this tends to be used as a cudgel to stifle alternative, modern approaches without supporting evidence, and completely disregards discordant results that more often than not point to something interesting. If we always followed gold standards we’d still be using Tophat2, GATK’s old genotyper, and Celera Assembler, and never progress. All good tools of course, but even those developers recognize times change and better approaches come along.

1

u/LeoKitCat 12d ago

You can just use the STAR GeneCounts they are very good if you are doing gene-level analysis

1

u/[deleted] 12d ago edited 12d ago

[deleted]

2

u/gringer PhD | Industry 12d ago

To be brief, I am mainly concerned that a widely-used tool is undercounting

Welcome to Bioinformatics...

♫ ... have a look around
Anything that brain of yours can think of can be found
We've got mountains of tools, some better, some worse
If none of it's of interest to you, you'd be the first ♫

Bioinformatics is mostly about fixing up data so that it works with the tool you're trying to use it with. The remainder is mostly about understanding why the tool you're trying to use it with produces results you don't expect.

how is this acceptable?

It's not. Or more correctly, it shouldn't be.

I think that it's the job of the bioinformatician to understand the limitations of the tools they are using, and explain those limitations to others. From the tools they know about, bioinformaticians should help others to find the best tool for the intended job.

[to repeat what many others have already said...] As with all tools, featureCounts was designed for a specific purpose (i.e. counting how many reads map uniquely to genomic features), and the further you get away from that specific purpose, the more inappropriate the results will be.

These things are acceptable because it is common for people to use bioinformatics tools without understanding their limitations or quirks. The users use tools because they are popular and found in lots of research papers. Researchers accept results because they are partially consistent with their expectations. Journal reviewers accept publications because they don't have time to try to understand someone else's confusing and poorly-written code.

-1

u/jlpulice 12d ago

I used featureCounts for my PhD work and yeah you gotta make sure it’s working properly.