What Human Genome Is This, Really?
Every bioinformatician has been there. You receive a file from a collaborator, or dig one out of cold storage, and it references a genome file like human_ref.fa or hg38.fasta. But which hg38? Aligned against what source, with what patch release, with how many alt contigs? The file name is, at best, a hint. At worst, it’s actively misleading.
Last year we received an email from a client struggling with exactly this problem. They had downloaded what they believed was the right reference FASTA from UCSC, 455 contigs, but the BAMs they were trying to work with had been aligned to a reference with 595 contigs. Contigs like chr11_KZ559110v1_alt appeared in the BAM headers but were nowhere to be found in the downloaded FASTA. After some sleuthing, the culprit turned out to be a specific GRCh38 patch release. Close, but not the same, and in genomics, “close but not the same” can silently corrupt an entire analysis.
This is a problem we’ve hit repeatedly in client work over the years, and it’s surprisingly hard to solve cleanly. Grepping the header for a version comment works maybe half the time. MD5-checking sequences requires having the right sequences to check against. Asking the data provider what reference they used is often met with a shrug or a filename. And trial-and-error alignment is expensive and frustrating.
So we built something better.
Introducing ref-solver
ref-solver is a tool that identifies which human genome reference a file is associated with by comparing its sequence metadata, contig names, lengths, and ordering, against a curated catalog of known references. Crucially, it never looks at the actual sequence data. That makes it fast, lightweight, and appropriate even for sensitive datasets where you wouldn’t want to upload raw sequences to a third party.
The tool accepts a wide range of input formats: SAM/BAM/CRAM headers, .dict files, .fai index files, or a full FASTA. It extracts the sequence dictionary and scores it against the catalog, returning a ranked list of the closest matching references, down to the specific patch release.
You can use it two ways:
Web app: Head to whatsmygenome.fulcrumgenomics.com, paste or upload your file, and get an answer in seconds. No installation, no accounts, nothing to configure.
Command line: Install via Bioconda (
conda install ref-solver) for integration into pipelines and automated workflows.
How It Works
The core idea is simple: two files derived from the same reference genome will have matching sequence dictionaries, the same contig names, the same lengths, in the same order. Different reference versions, sources, or patch releases will differ in at least one of those dimensions.
ref-solver builds a fingerprint from the sequence dictionary and compares it against a catalog that covers the major human genome reference flavors: UCSC (hg19, hg38), NCBI/Ensembl (GRCh37, GRCh38), Broad bundle releases, T2T-CHM13, and a range of GRCh38 patch releases from p1 through p14. When an exact match exists, you get a definitive answer. When the match is partial, for example, a BAM that was aligned to a reference with a subset of the catalog’s contigs, ref-solver returns a similarity score so you can identify the closest known reference and understand what’s missing or different.
When This Matters
The most obvious use case is provenance recovery: you have a BAM and you need to know what it was aligned to before you can do anything useful with it. This comes up constantly when working with legacy datasets, public repositories, or data shared between institutions where documentation is incomplete.
But it’s equally valuable as a validation step in active pipelines. Before you run variant calling or any reference-dependent analysis, you want to confirm that the FASTA you downloaded from UCSC last Tuesday is actually what you think it is. Reference files get updated, mirrors can serve stale content, and download errors are real. Spending thirty seconds running ref-solver before kicking off a multi-day pipeline is cheap insurance.
It’s also useful in multi-site studies or collaborations where different groups may have independently downloaded “the same” reference from different sources, UCSC versus Ensembl versus a Broad bundle, and ended up with subtly different files that will cause headaches downstream when you try to merge or compare results.
Try It
If you have a BAM, CRAM, FASTA, .dict, or .fai lying around whose exact provenance you’re not 100% certain of, give it a try at whatsmygenome.fulcrumgenomics.com. The web app takes seconds and requires nothing but the file.
The source code is on GitHub at fulcrumgenomics/ref-solver, and contributions, especially additions to the reference catalog, are very welcome. If there’s a reference build you’d like to see covered, open an issue or submit a PR.
Reference genomes are the foundation everything else is built on. Getting them wrong quietly is far worse than failing loudly.




