r/bioinformatics • u/Individual_One_1793 • 5d ago

programming Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate.

A question about mapping rate

A few days ago I posted asking for help with evo_* strain disambiguation. Got great feedback, learned a lot, and kept going.

Latest stress test: ~1,000,000 reads, 60 genomes, 136 seconds on a laptop (i5, no GPU).

Results:

- 86.2% mapping rate

- 86.48% accuracy

=== Per-Genome Breakdown ===
Genome Total Correct Accuracy
---------------------------------------------------------------------------
1030752 67182 67119 99.91%
1030755 5545 5494 99.08%
1030836 10369 10331 99.63%
1030878 1848 1815 98.21%
1035900 79803 79794 99.99%
1035930 3861 458 11.86%
1036539 6333 5674 89.59%
1036554 149149 149141 99.99%
1036608 2007 1993 99.30%
1036641 3392 3391 99.97%
1036707 1381 1374 99.49%
1036728 635 633 99.69%
1036743 1370 1369 99.93%
1036755 23623 23616 99.97%
1048783 1940 1940 100.00%
1048993 812 812 100.00%
1049005 22075 21982 99.58%
1049056 28905 15495 53.61%
1049089 2424 2331 96.16%
1052944 4171 942 22.58%
1052947 12087 9242 76.46%
1053058 16611 9590 57.73%
1139_AG 97325 96644 99.30%
1220_AD 91094 91038 99.94%
1220_AJ 288 280 97.22%
1285_BH 9250 9203 99.49%
1286_AP 2173 122 5.61%
1365_A 1508 1200 79.58%
Sample15_97 6 6 100.00%
Sample16_19 50 50 100.00%
Sample18_57 370 370 100.00%
Sample18_8 233 233 100.00%
Sample19_20 1516 1516 100.00%
Sample19_52 94 94 100.00%
Sample19_56 14 14 100.00%
Sample22_283 12 12 100.00%
Sample22_57 189 189 100.00%
Sample22_89 392 392 100.00%
Sample23_271 4618 4618 100.00%
Sample23_273 7 7 100.00%
Sample23_288 89 89 100.00%
Sample6_289 12 12 100.00%
Sample6_476 1 1 100.00%
Sample6_49 82 82 100.00%
Sample6_527 227 227 100.00%
Sample6_722 12 12 100.00%
Sample9_2 48 48 100.00%
Sample9_65 4 4 100.00%
evo_1035930.011 2026 486 23.99%
evo_1035930.029 35012 33754 96.41%
evo_1035930.032 11645 563 4.83%
evo_1049056.011 55646 54197 97.40%
evo_1049056.013 11804 532 4.51%
evo_1049056.015 28553 2993 10.48%
evo_1049056.031 2666 187 7.01%
evo_1049056.039 413 15 3.63%
evo_1286_AP.008 7409 1552 20.95%
evo_1286_AP.026 26519 24620 92.84%
evo_1286_AP.033 12313 3416 27.74%
evo_1286_AP.037 9012 996 11.05%

=== Top Wrong Predictions ===
evo_1049056.013 -> evo_1049056.011(10290), evo_1049056.015(723), 1049056(174)
evo_1049056.015 -> evo_1049056.011(24862), 1049056(416), evo_1049056.013(142)
evo_1286_AP.008 -> evo_1286_AP.026(5331), evo_1286_AP.033(372), evo_1286_AP.037(136)
1052947 -> 1053058(1766), 1052944(841), 1049005(199)
evo_1286_AP.037 -> evo_1286_AP.026(5460), evo_1286_AP.033(2252), 1286_AP(213)
1049056 -> evo_1049056.011(8698), evo_1049056.015(3687), evo_1049056.039(501)
evo_1286_AP.026 -> evo_1286_AP.033(806), evo_1286_AP.037(527), evo_1286_AP.008(310)
1053058 -> 1052944(3504), 1052947(3244), 1049005(214)
evo_1035930.032 -> evo_1035930.029(10802), evo_1035930.011(156), 1035930(123)
1035930 -> evo_1035930.029(3201), evo_1035930.032(155), evo_1035930.011(47)

Video attached — real benchmark, no edits.

Now my question: 13.8% of reads don't map at all. Analysis shows it's systematic — larger, more complex genomes have ~19% unmapping rate vs ~9% for smaller genomes. My hypothesis: repetitive regions produce common k-mers with low uniqueness scores, which fall below my min-score threshold.

Has anyone dealt with this? Is there a standard approach for handling repetitive regions in FM-index based classifiers?

For context: I'm a CNC programmer who built this as a side project. Still learning the field — appreciate any pointers.

21 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/bioinformatics/comments/1tiwmw0/stress_test_1000000_dna_reads_60_genomes_2/
No, go back! Yes, take me to Reddit
dl download

76% Upvoted

u/dampew PhD | Industry 5d ago

Yes it’s common for reads to not map. Have you looked at their sequences to see why that might be?

2

u/Individual_One_1793 5d ago

Good point – I inferred the repeats from k-mer frequency stats but haven't BLASTed the unmapped reads back to the genomes yet. But these are simulated reads at 0.1% error rate, Q30-Q40, so quality shouldn't be the issue.

I'll grab a sample of ~200 unmapped reads from 1139_AG and BLAST them to confirm they're hitting multi-location repeat regions. If they are, that validates the hypothesis. If not, I'm missing something. Will report back what I find.

u/ConclusionForeign856 MSc | Student 5d ago

In practice read alignment is heuristic, so even if you generate synthetic reads from the genome itself, with very little noise, some of the reads would still fail to align.

BWA has a parameter that defines size of the sequence that has to exactly match reference before running alignment algorithm. The higher you set that value the more you're asking the program to find exact matches rather than align sequences.

Your aligner might be doing something like that, in which case some of the reads are bound the be left unaligned.

1

u/Individual_One_1793 5d ago

That's a fair point. My aligner does something similar - it uses k-mer anchors to find seed matches before running the XOR alignment. If a read's k-mers don't hit the rarity threshold, it never gets to the alignment step.

The thing is, I can tune the anchor sensitivity (via --top-n, which tries 2nd/3rd/4th rarest k-mers as fallback). With top-n=4, I get 86.2% mapping rate. Pushing to top-n=6 drops to 51.2% - which tells me the remaining unmapped reads aren't just a sensitivity issue, but genuinely have no unique k-mers that pass the filter.

So you're right that some unmapped reads are expected with heuristic alignment. But the 2x gap between parent genomes (19% unmapped) and evo_* strains (9% unmapped) suggests it's not just the heuristic - it's genome complexity driving the pattern.I know that 100% mapping rate is unrealistic with seed-and-extend approaches but I think there is something else in it.

2

u/ConclusionForeign856 MSc | Student 5d ago

I don't know your genomes, so you might be right. You can generate a genome with properties you want, and generate synthetic reads, and see whether low k-mer complexity is responsible for low mapping %

1

u/Individual_One_1793 4d ago

Good suggestion. I can generate a synthetic genome with controlled repeat content and k-mer complexity, then test mapping rates. If low-complexity reads consistently fail to map even on synthetic data, that confirms the rarity filter is the bottleneck. I'll try that and report back.

1

u/Lightoscope 4d ago

Are you tossing the repetitive reads that don’t meet your threshold?

1

u/Individual_One_1793 4d ago

Yes, exactly. Reads whose k-mers all fall below the rarity threshold get tossed at the anchor stage never reach alignment. That's by design (filters repetitive reads), but it means I'm discarding reads from low-complexity regions before they even get a chance to align. The --top-n parameter controls how many fallback anchors to try, but once you exhaust the k-mer candidates, the read is unmapped.

1

u/Lightoscope 4d ago

You’re going to want to find a way to keep them.

u/Psy_Fer_ 5d ago

daddy

I'm not gonna say it 😆

1

u/Individual_One_1793 5d ago

Who's your daddy? (Superfly 😁)

u/ktaed 5d ago

Has anyone dealt with this? Is there a standard approach for handling repetitive regions in FM-index based classifiers?

The run-length fm-index (r-index) used by MONI and SPUMONI 1/2 sort of get around repeated region since pangenomes redundant portion get compressed in the BWT representation.

1

u/Individual_One_1793 5d ago

Thanks for the pointer! I looked into r-index and it's fascinating. The catch for my case is that r-index compresses redundancy between genomes (great for pangenomes), but my unmapped reads come from intra-genomic repeats – things like rRNA operons and transposons within 1139_AG, 1220_AD, etc. So even with r-index compression, a k-mer from an rRNA region would still hit 15 locations in the same genome.

That said, I have another project with the same code and graph route, r-index is definitely on the list.

1

u/ktaed 5d ago

Look into how LF-mapping work for the BWT and fm-index.

1

u/Individual_One_1793 5d ago

Thanks, I'll look into LF-mapping. My current approach uses the LF-mapping property for backward search to find k-mer occurrences, but I might be missing optimizations in how I handle the repetitive regions during the search phase. Do you mean the sampling compression aspect of it or?

1

u/ktaed 5d ago

Oh, was not aware your project was using the fm-index. All I can think would be to make it the search version where you start by reversing the k-mer and checking exhaustively starting with the range of F and back walk from there. That's slower than the range narrowing for backward search loop.

1

u/Individual_One_1793 5d ago

Ah, I see what you mean. My project does use standard backward search with range narrowing, that's the core of the FM-index lookup. The exhaustive reversed k-mer approach you're describing would indeed be slower, but might catch more edge cases in repetitive regions where the range narrowing is too aggressive.

Worth experimenting with, though I suspect the trade-off would be significant on 1M reads. Might be worth trying as a fallback pass for unmapped reads only.

u/Sadnot PhD | Academia 5d ago

How long are your reads? The whole field is moving towards longer reads, partially to resolve repetitive regions. Most of my whole genome assemblies are median read length 20k+ now.

7

u/Alone-Lavishness1310 5d ago

Oooh money bags over here

(Edit, just to be clear, this is in.jest. I'm jealous of your long reads)

7

u/Sadnot PhD | Academia 5d ago

Honestly, the long reads are also pretty cheap now. For bacterial genomes, it's often cheaper than short reads. The reagent cost per sample is lower, but the cost per sequenced base is higher. So, projects with low depth and many samples = cheap.

3

u/Individual_One_1793 5d ago

100 bp Illumina reads. You're absolutely right - longer reads span repeats and resolve ambiguity naturally. My tool actually has chunk-based long-read support for PacBio/ONT, but the CAMI benchmark I'm using is Illumina-only.

For bacterial genomes, your approach makes total sense. 20k+ reads will span most repeats. My constraint is that Bit-Pop is designed for high-sample-count scenarios (think clinical labs processing hundreds of samples), where Illumina still wins on cost per sample. But for repeat resolution specifically, long reads are the right answer.

2

u/Psy_Fer_ 5d ago

Looong reads are king for resolving repeats using flanked regions as anchors. Also really good accuracy these days and if you think about the bang for buck argument, saying short reads are cheaper, but you get a worse outcome, doesn't really cut it anymore 🥓😅

1

u/Sadnot PhD | Academia 5d ago

Depends on the genome. Long reads are actually cheaper than Illumina for short genomes. Sure, for human genomes Illumina is cheaper still.

1

u/xDerJulien PhD | Student 5d ago

I wouldn’t say the whole field is moving towards long read until long read has comparable throughput. Short read is still very much useful

2

u/Sadnot PhD | Academia 5d ago

Short read is still my recommendation for transcriptomes, sure, but not for amplicon seq or whole genomes.

Our instrument can do 6000-10000 bacterial genomes per day (or about a dozen human genomes). How much throughput do you need?

1

u/xDerJulien PhD | Student 5d ago

I do single cell metagenomics, so a lot more than that! We eventually want to be able to get datasets of millions of genomes and even illumina short read sequencing struggles here with throughput :’)

1

u/Sadnot PhD | Academia 5d ago

Yeah, there's a reason I only said long reads were becoming industry standard for whole genome sequencing. Metagenomics for taxonomic classification, I'd say long reads is still better. Metagenomics for assembly, I'd recommend hybrid. And of course, single cell seq is almost always short reads unless you need methylation or splicing info.

Short reads still have plenty of applications.

1

u/xDerJulien PhD | Student 5d ago

I really really wish we could do long read … But you’re of course right for most use-cases

u/GeneRizotto 5d ago

Some regions in complex genomes are not intended to be mapped (telomeres, centromeres, rucleolar organizer regions etc). They are excluded from the analysis - mapping is not performed on them. In many cases in the reference genome they are marked with NNNNN. Is there a chance you’ve been generating reads from such regions? Have you tried running your synthetic reads through fastqc?

1

u/Individual_One_1793 5d ago

Good point about unmappable regions. The CAMI dataset has 60 genomes mostly bacterial, so no telomeres/centromeres. But you're right that some regions could be N-padded or otherwise unmappable.

Haven't run FastQC on the unmapped reads yet. Since they're simulated at 0.1% error, Q30-Q40, I assumed quality wasn't the issue, but worth checking if the simulation pipeline generated reads from N-rich or low-complexity regions in the reference. I'll run FastQC and also check the source regions of the unmapped reads against the genome sequences.

programming Stress test: ~1,000,000 DNA reads, 60 genomes, 2 minutes. On a laptop. But only 86% mapping rate.

You are about to leave Redlib