Your Bioinformatics Tools Need to be AI-Ready
If you're not building tools that emit rich data for machine learning, you're wasting your compute.
Every bioinformatics tool I’ve written over the past two decades has had the same job: take biological data in, produce a result out. Align reads. Call variants. Build consensus sequences. And over the years, I’ve learned (sometimes the hard way) that the primary output alone is never enough. You need QC metrics. You need debug output. You need the evidence trail that lets you and your users understand what happened and why. I’ve put a lot of work into making tools like fgbio produce rich, useful metrics at every step, and it has consistently paid off.
But I’ve been designing those metrics for humans. For a scientist staring at a MultiQC report, or for myself trying to track down a subtle bug. The question I keep coming back to is: what if that same data were structured and complete enough to train a model?
We are in the middle of two big shifts in how bioinformatics gets done. First, AI research and coding assistants are not just writing code; they are reasoning about experimental designs, interpreting analysis results, and proposing next steps. They are becoming junior scientists in our labs. Second, machine learning is eating every problem where sufficient training data exists. In genomics, we have oceans of data. Both shifts demand that we rethink how we design bioinformatics tools.
I want to make a simple argument: every new bioinformatics tool and method you write should be AI-ready. AI assistants need to work with your code, your outputs, and your analysis logic. Machine learning needs to train on the rich data your tools produce. If you’re not gathering this data as you go, you’re leaving value on the table.
Halfway there
The idea that tools should produce rich, structured metrics is not new. Picard, developed at the Broad Institute (and originally written by my co-founder Tim Fennell, and I later helped maintain along with htsjdk), was doing this over a decade ago. CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, CollectWgsMetrics, MarkDuplicates: the Picard philosophy was that every tool should emit detailed, structured metrics alongside its primary output. That approach shaped how Tim and I thought about tool design when we founded Fulcrum Genomics and built fgbio.
In fgbio, when we call consensus reads from UMI-tagged data, we compute a lot of information along the way and capture much of it in BAM tags and metrics files. CollectDuplexSeqMetrics alone produces eight output files, continuing that Picard tradition.
That makes tools like Picard and fgbio halfway AI-ready. The data is there. The metrics exist. But they were designed for human interpretation: summary tables, aggregate statistics, plots you eyeball. They weren’t designed to be feature vectors that a model can ingest at scale across thousands of samples. The gap between “useful QC” and “ML-ready features” is smaller than most people think, but it’s real, and closing it requires intentional design.
Most tools don’t even get halfway. Your aligner gives you a BAM with mapping qualities, but not the distribution of sub-optimal alignments it considered. Your variant caller gives you a VCF with QUAL scores, but not the per-site feature vectors that went into that decision. All of that discarded information is training data for models you haven’t built yet. In the ML engineering world, they call this “data exhaust”. In bioinformatics, we are incinerating ours.
Two kinds of AI-ready
When I say “AI-ready,” I mean two things.
Your tools need to be legible to AI. Not just to coding assistants that autocomplete your Python, but to AI agents acting as junior scientists in your lab: designing analyses, interpreting outputs, proposing hypotheses, troubleshooting failures. A tool with clean APIs, consistent output formats, well-documented parameters, and structured logs is a tool not only useful to humans but also a tool an AI scientist can reason about. A tool that dumps cryptic column names and unstructured stderr is hostile to humans and AI alike.
This is already a real problem. The emerging field of “agentic bioinformatics“ (Phan et al. 2025) is running headfirst into inconsistent interfaces and poorly documented outputs, as tools like BioAgents (Li et al. 2025), AutoBA (Fan et al. 2024), and MCPmed (Li et al. 2025) all demonstrate. Our tools were built for humans reading man pages. The next generation of consumers won’t be human.
Your tools also need to be generative with metadata. Every step in a pipeline is an opportunity to emit features: quality scores, distributional summaries, error profiles, alignment characteristics, signal-to-noise ratios. These aren’t just nice-to-have QC metrics. They’re input features for downstream ML models that can learn to make better decisions than your hand-tuned heuristics.
Where this matters most: filtering
If there’s one place where AI-ready output would have the most immediate impact, it’s filtering. Every domain in genomics has its own version of the signal-vs-artifact problem. HLA typing, where alignment ambiguity and allele-level read support bury the difference between right and wrong. MSI detection, where true microsatellite length changes hide behind polymerase stutter. SV calling, where split reads, discordant pairs, read depth changes, and assembly contigs form a multi-dimensional feature set that ML thrives on. Somatic variant calling, where the difference between a real low-frequency mutation and a sequencing artifact is a subtle pattern no single hard filter captures well.
In all of these cases, the path to better filtering is the same: emit the features, not just the calls. Give downstream models (and downstream scientists, human or AI) the evidence they need to find your false positives and recover your false negatives.
We already have existence proofs. DeepVariant (Poplin et al. 2018) works because the upstream pileup data, base qualities, mapping qualities, and strand information were already captured in a structured format, enabling a CNN to outperform hand-tuned statistical models (Barbitoff et al. 2022). GATK’s VQSR works because the variant caller emits rich per-variant annotations (QD, FS, SOR, MQ, MQRankSum, ReadPosRankSum) that a Gaussian mixture model can train on. If those callers had only emitted PASS/FAIL, neither approach would have been possible. The features are the product.
Prior art
I’m not the first person to think about this. Karpathy’s “Software 2.0” (2017) argued that neural networks shift the work from writing code to curating data; if Software 2.0 is coming for bioinformatics, our Software 1.0 tools need to emit the training data it will consume. Andrew Ng’s data-centric AI movement (2021) makes the case that the bottleneck is data quality, not model architecture. Most directly relevant is the NIH Bridge2AI program ($130M), whose Standards Working Group (Jain et al. 2024) defines AI-readiness criteria for biomedical data and observes that most data generated for other tools is not suitable for ML without significant rework. My argument is that we should stop generating unsuitable data in the first place.
The pushback
“You want me to 10x my output size for features nobody uses yet.” The marginal cost of emitting structured metrics alongside your primary output is small. Storage is cheap. Compute is expensive. You’re already doing the computation; I’m asking you to write down what you learned. The cost of not having this data when you need it is re-running everything, or discovering you can’t answer a question because the intermediate data is gone.
“I’ll add metrics when someone asks for them.” By then it’s too late. The value of these features comes from having them across large cohorts, computed consistently, from the beginning. Retroactively adding feature emission and re-processing thousands of samples is orders of magnitude more expensive than doing it right the first time.
Design principles
If I were writing a set of principles for AI-native bioinformatics tools:
Observable. Emit structured, semantically rich metadata at every step. Not just final answers, but the evidence trail. Per-read, per-site, per-family, per-molecule. Use well-defined tags, clear column headers, and machine-readable formats.
Composable. Consistent, well-documented interfaces that both humans and AI agents can discover, chain, and reason about. Standard formats. Clear help text. Predictable behavior. Outputs parseable by a post-doc who happens to be an LLM.
Trainable. Feature-rich representations that serve as training data for downstream ML. Emit the features, not just the conclusions.
The pitch
You’re already paying for the compute to run these analyses. The marginal cost of capturing richer output is small. Every time you run a pipeline without capturing intermediate features, you’re burning money twice: once on the compute you’re using now, and once on the compute you’ll need later to regenerate what you threw away.
I should be honest: AI is already changing how we work at Fulcrum Genomics. It has changed how we write code, how we review analyses, and how we think about tool design. We are adapting our business to leverage it, and we think every genomics organization should be thinking about the same.
But here’s what we keep seeing: AI makes expert bioinformaticians more productive, but it doesn’t replace the judgment needed to produce AI-ready data, design the right experiments, or know when the model is wrong. The gap between experts and everyone else is widening, not narrowing. You still need human experts to build the foundation that AI stands on.
For most of my career, the bottleneck has been doing the analysis: writing the pipeline, wrangling the formats, debugging the failures. AI is changing that. It handles the routine work. We get to spend more time on the science: asking the right questions and directing AI to explore the ideas. But the AI post-doc is only as good as the data you give it. With rich, structured features, it can spot batch effects in your consensus calling, flag GC-correlated false positives in your SV calls, catch allele balance inconsistencies in your HLA typing. With minimal, undocumented output, it’s working blind.
That’s what the fulcrum in our name has always meant: a small, well-placed point of leverage that amplifies force. AI is a lot of force. We’re here to help you apply it well.
The tools we build today will either be the foundation for tomorrow’s ML models, or they’ll be replaced by tools that are. I know which side of that I want to be on.
References
Barbitoff YA, Abasov R, Tvorogova VE, et al. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23(1):396. doi: 10.1186/s12864-022-08365-3
Fan J, Chen Z, Chen J, et al. AutoBA: An Autonomous Bioinformatics Agent. Bioinformatics. 2024. PMC11600294
Fennell T, et al. Picard: A set of command line tools for manipulating high-throughput sequencing data. Broad Institute. broadinstitute.github.io/picard
Homer N. fgbio: Tools for working with genomic and high throughput sequencing data. github.com/fulcrumgenomics/fgbio
Jain S, Neumann M, Goenaga-Infante H, et al. Defining AI-Readiness for Biomedical Data: Bridge2AI Standards Working Group Recommendations. Nature Scientific Data. 2024. PMC11526931
Karpathy A. Software 2.0. Medium. 2017. karpathy.medium.com/software-2-0-a64152b37c35
Li H, Wang Q, Wang Y, et al. BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems. arXiv. 2025. arxiv.org/html/2501.06314v1
Li Z, Zeng Y, Zhu X, et al. MCPmed: A Standardized Protocol for Biomedical Tool Integration with LLM Agents. arXiv. 2025. arxiv.org/html/2507.08055v1
Ng A. Why it’s time for data-centric artificial intelligence. MIT Sloan Management Review. 2021. mitsloan.mit.edu
Phan L, Gururajan S, Goh B, et al. Agentic Bioinformatics. Briefings in Bioinformatics. 2025;26(5):bbaf505. doi:10.1093/bib/bbaf505
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36:983-987. github.com/google/deepvariant
Nils Homer is a Founding Partner at Fulcrum Genomics, where he builds bioinformatics tools and pipelines for the genomics community. He is the creator of fgbio and a co-author of the SAMtools paper. You can find him on LinkedIn or reach Fulcrum at contact@fulcrumgenomics.com




