SamBuilder for Test SAM Files and Records
Have you ever written Python code to manipulate a BAM file?
How can you be sure the code does what you intended?
Does your code handle all edge cases properly?
Maybe you’ve run your code with existing BAM files,
or added those BAMs to a suite of test data.
But does the data in those BAMs really flex all edge cases of your code?
How easy is it to maintain a suite of test data files?
Are you tempted to re-use a data file across multiple tests?
In this post I’m going to show you how to use fgpyo’s SamBuilder to quickly generate SAM/BAM data to assert your bioinformatics Python code does what you think it does.
Motivation for Programmatic Test Data
Why use programmatically generated test data instead of flat files?
Keeps like concerns together. Viewing a separate test data file to see the specifics of a test case violates keeping like concerns together. Modifying existing test files to suit new unit tests can cause bugs. Modular code is easier to maintain and review.
Declarative approach. Using a programmatic API to build up a test file is declarative—a reader knows exactly which field or parameter is relevant to the test.
Inline comments. SAM/BAM files don’t support inline comments, and for other file types, comments in a test file violates (1) since the comment is separate from the unit test.
Avoids coupling multiple files (potentially of different types). If your function/tool manipulates SAM records based on information from another file type, e.g. filtering a BAM according to intervals in a BED file, then having multiple test files violates (1). Instead, you can generate the test data programmatically prior to each test.
Basic Usage of SamBuilder
Common Test Patterns Using SamBuilder
There are two common test patterns that make SamBuilder
valuable.
First we’re going to look at generating test data files on-the-fly. This pattern is most useful for integration testing. In bioinformatics-land, this usually means either:
Testing a standalone CLI tool, or
Testing a workflow that combines several tools (e.g. Nextflow or Snakemake)
Second, we’ll use SamBuilder
to write unit tests for Python functions. This pattern builds up reads in memory without ever writing them to disk and is most useful for unit testing. I’ll use this example to demonstrate some of the most commonly used SamBuilder
options.
Build the Working Environment
First, let’s build our working environment using the Conda/Mamba package manager. I’m using Mamba here, but the Conda package manager will also work. The YAML environment file and the following shell commands will create the environment sam_builder_example
:
name: sam_builder_example | |
dependencies: | |
- python=3.12 | |
- fgpyo=0.3.0 | |
- samtools=1.20 | |
- pysam=0.22.1 | |
- pytest=8.2.2 |
mamba env create -f "environment.yml" | |
mamba activate "sam_builder_example" |
Basic Usage of SamBuilder
First thing’s first; we’ll look at a simple script that uses SamBuilder
to generate aligned single-end reads in BAM
format. Download the script and try it out.
python "SamBuilder_create_se_test_bam.py" --output "test.bam" --num_reads 5 |
On line 4 the script imports SamBuilder
from the fgpyo
library.
from fgpyo.sam.builder import SamBuilder |
On lines 18-23, a SamBuilder
object is created to build “num_reads
“ single-end reads, and finally the script outputs those reads to a results file.
builder = SamBuilder(r1_len=10) | |
for offset in range(num_reads): | |
builder.add_single(chrom="chr1", start=100 + offset) | |
builder.to_path(test_bam_file) |
Let’s take a closer look at the important pieces.
First Create a SamBuilder
Object
builder = SamBuilder(r1_len=10) |
This line creates a SamBuilder
object with default values for all parameters except the R1 length, which is set to 10.
You can set quite a few useful values at this time:
Parameter | Definition | |
---|---|---|
r1_len | The default length of read 1 | |
r2_len | The default length of read 2 (if paired end) | |
base_quality | The default base quality | |
sd | A sequence dictionary defining reference sequences and lengths | |
rg | A default read group; some tools are strict about read groups | |
extra_header | Any extra headers for advanced usage | |
seed | Seed for anything randomly generated | |
sort_order | The output sort order which defaults to coordinate sort |
Add an Aligned Single-end Read
builder.add_single(chrom="chr1", start=100 + offset) |
Adds a single-end read to the builder (returns it too, but we’ll come to that later!) with specified mapping on chr1
starting at position 100 plus the loop index as a positional offset.
Write the Reads to a BAM file
builder.to_path(test_bam_file) |
This line writes all the accumulated reads to a BAM file, with a header determined in part by the sequence dictionary and read group information that can be passed to the constructor, and with sort order determined by the constructor. Writes an index by default if coordinate-sorted.
Run the script and take a look at the output file this script generates.
Yours will be the same as mine, because we set the random seed to the same value—convenient for me writing this blog post, essential for developers using SamBuilder
to generate test data!
samtools view "test.bam" |
QNAME | FLAG | RNAME | POS | CIGAR | MAPQ | CIGAR | RNEXT | TLEN | SEQ | QUAL | RG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
q0000 | 0 | chr1 | 101 | 60 | 10M | * | 0 | 0 | GACAGGTACA | ?????????? | RG:Z:1 | |
q0001 | 0 | chr1 | 102 | 60 | 10M | * | 0 | 0 | AGAAGGAGTA | ?????????? | RG:Z:1 | |
q0002 | 0 | chr1 | 103 | 60 | 10M | * | 0 | 0 | TGCATCAATG | ?????????? | RG:Z:1 | |
q0003 | 0 | chr1 | 104 | 60 | 10M | * | 0 | 0 | TGGTCGTGTG | ?????????? | RG:Z:1 | |
q0004 | 0 | chr1 | 105 | 60 | 10M | * | 0 | 0 | GAACAAACGC | ?????????? | RG:Z:1 |
Test Pattern 1: Test Data Files On-The-Fly
Let’s say we want to test the functionality of a command that performs simple read filtering based on SAM flags with samtools view
. Let’s test the following command:
samtools view -F 0x4 -hb |
This command removes unmapped reads from an input BAM file (-F 0x4
) and outputs the retained reads in BAM format (-b
) with the header (-h
). Our expected input will be any valid BAM file, but ideally one with a mix of mapped and unmapped reads. Our expected output will be a BAM file with only mapped reads retained. So, we need to generate two BAM files for integration testing:
Test Input: a BAM file with mapped and unmapped reads
Expected Output: a BAM file with only the mapped reads from Test Input
We will modify SamBuilder_create_se_test_bam.py
to perform this functionality. The final version of the script is available here.
Output a BAM File Containing Test Records
Essential arguments to parse will be added to the command line parser:
parser.add_argument( | |
"-e", "–expected", help="Output expected BAM file", | |
type=str, required=True | |
) | |
parser.add_argument( | |
"-u", "–num_unmapped", | |
help="Number of unmapped reads to write in test BAM file", | |
type=int, required=True | |
) |
And a create_test_bam
function needs to be defined:
def create_test_bam( | |
test_bam_file: Path, | |
num_reads: int, | |
num_unmapped: int, | |
) -> None: |
Finally, the new create_test_bam
function in a script main()
is called:
create_test_bam( | |
test_bam_file=Path(args.output), | |
num_reads=args.num_reads, | |
num_unmapped=args.num_unmapped, | |
) |
Add Unmapped Reads BAM to the Output
The expected output file will include only the mapped reads. The first num_reads
that are added to builder
are all mapped, because they all have chr
and start
specified. Let’s output those to expect_bam_file
by replacing test_bam_file
with expect_bam_file
in the call to builder.to_path()
.
builder.to_path(expect_bam_file) |
We can add the unmapped reads to the builder with another loop, this time adding num_unmapped
reads. Note that chr
and start
are not specified, so SamBuilder
knows to use ‘*’ in the output BAM
file to indicate they are unmapped.
for offset in range(num_unmapped): | |
builder.add_single() |
Now output all the reads to the test_bam_file
.
builder.to_path(test_bam_file) |
SamBuilder
is particularly useful in these scenarios where some reads should be filtered out using the pattern:
Build and add all reads that should be output
Output expected reads to an expected data file
Build and add all reads that should not be output
Output all reads to an input data file
You can use this pattern for unit testing as well, such as if you have a function that filters a
list
of reads. Instead of outputting the reads to a file, collect them in a list, saving a copy asexpected
output before adding the reads that should be filtered out.
Test Simple Read Filtering
Create the input and expected data files:
python "SamBuilder_create_samtools_view_test_data.py" \ --num_reads 5 \ --num_unmapped 2 \ --output "test.bam" \ --expected "expect.bam"
Run the command:
samtools view -F 0x4 "test.bam" > "output.sam"
Compare the expected and output files:
samtools view "expect.bam" | diff - "output.sam"
In this example, the tool behaves as expected since the difference between the files is zero. Good! In the real world, it might not. But now you have a chance to debug and rerun the test until you are certain the behavior is correct.
Test Pattern 2: Unit Test FASTQ to BAM Functions
Download SamBuilder_bam2fastq.py
which is a toy script for converting a BAM to a FASTQ file. To see how this script works, let’s first create a new test BAM file with 3 single-end reads.
python "SamBuilder_create_se_test_bam.py" --output "test.bam" --num_reads 3 |
Then convert that BAM to FASTQ with default parameters.
python "SamBuilder_bam2fastq.py" --input "test.bam" --output "test.fastq" |
Compare the BAM File and the FASTQ File
samtools view "test.bam" |
QNAME | FLAG | RNAME | POS | CIGAR | MAPQ | CIGAR | RNEXT | TLEN | SEQ | QUAL | RG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
q0000 | 0 | chr1 | 101 | 60 | 10M | * | 0 | 0 | GACAGGTACA | ?????????? | RG:Z:1 | |
q0001 | 0 | chr1 | 102 | 60 | 10M | * | 0 | 0 | AGAAGGAGTA | ?????????? | RG:Z:1 | |
q0002 | 0 | chr1 | 103 | 60 | 10M | * | 0 | 0 | TGCATCAATG | ?????????? | RG:Z:1 |
Look at the output of SamBuilder_bam2fastq.py
and notice that each of the entries in the BAM file were converted successfully to FASTQ format.
cat "test.fastq" |
@q0000 | |
GACAGGTACA | |
+ | |
?????????? | |
@q0001 | |
AGAAGGAGTA | |
+ | |
?????????? | |
@q0002 | |
TGCATCAATG | |
+ | |
?????????? |
Unit Tests for SamBuilder_bam2fastq.py
There are three functions in addition to main()
in the script. We’ll use SamBuilder
to test all three. Download the the script SamBuilder_test_bam2fastq.py
and we’ll walk through it together.
Testing the Function alignment_to_fastq
This function converts an alignment as represented by a pysam AlignedSegment
into a FASTQ string representation:
def alignment_to_fastq(alignment: AlignedSegment) -> str: | |
""" | |
Converts a pysam AlignedSegment to a FASTQ string representation. | |
Args: | |
alignment: pysam AlignedSegment object to convert. | |
Returns: | |
FASTQ string representation of the input alignment. | |
""" | |
qualities: str = "".join(map(lambda qual: chr(qual + 33), alignment.query_qualities)) | |
fastq_string: str = f"@{alignment.query_name}\n" + f"{alignment.query_sequence}\n" + f"+\n{qualities}\n" | |
return fastq_string |
Building the Unit Test
To test this function we’ll need some helper functions and classes including
SamBuilder
andAlignedSegment
.from fgpyo.sam.builder import SamBuilder from pysam import AlignedSegment from bam2fastq import bam2fastq
We then need an
AlignedSegment
with defined sequence bases and quality scores. We could rely on the default query name being of the formatqNNNN
withNNNN
auto-incremented starting from 0, and the default quality score being 30, but it’s better to be explicit in a unit test. Note that this is an unmapped read becausechrom
andstart
were not specified. The function we are testing doesn’t care whether or not a read is mapped.def test_alignment_to_fastq() -> None: """Tests generating a FASTQ string from a pysam AlignedSegment object.""" builder = SamBuilder(r1_len=10, base_quality=30) alignment: AlignedSegment = builder.add_single(name="query", bases="A" * 10)
Call
alignment_to_fastq
:test_fastq: str = alignment_to_fastq(alignment)
Compare the output to the expected FASTQ representation based on the known data. Note that
expected_fastq
contains the ASCII representation of base quality 30 (?
).expected_fastq = f"@query\n{‘A’*10}\n+\n{‘?’*10}\n" assert test_fastq == expected_fastq
Run the Test with pytest
Your output should look something like this:
pytest "SamBuilder_test_bam2fastq.py" |
==================== test session starts ==================== | |
platform darwin -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0 | |
rootdir: /Users/ameynert/scratch/sam_builder_blog | |
collected 1 item | |
SamBuilder_test_bam2fastq.py . [100%] | |
===================== 1 passed in 0.04s ===================== |
Testing the Function include_alignment
This function checks if an alignment should be included in the output according to the given flags. It would not be a function in production code, but I’ve included it here as it’s useful for demonstration.
def include_alignment( | |
alignment: AlignedSegment, include_supplementary: bool = False | |
) -> bool: | |
""" | |
Determines whether an alignment should be included in the output. | |
Args: | |
alignment: pysam AlignedSegment object to check. | |
include_supplementary: Whether to include supplementary reads | |
in the output (default: False). | |
Returns: | |
True if the alignment should be included, False otherwise. | |
""" | |
return include_supplementary or not alignment.is_supplementary |
Building the Unit Test
For this unit test, we’ll create alignments that are either primary or supplementary.
We’ll need to import our function and
pytest
too for some advanced testing features:import pytest from bam2fastq import include_alignment
pytest.mark.parametrize
will be used to generate the list of parameters to test over. This ensures each test gets run and that a failure of oneassert
does not prevent the others from being tested. We parametrize over("is_supplementary", "include_supplementary", "expected")
. A primary alignment should be included regardless of the value ofinclude_supplementary
. A supplementary alignment should only be included ifinclude_supplementary=True
@pytest.mark.parametrize( ("is_supplementary", "include_supplementary", "expected"), [ (False, True, True), (False, False, True), (True, True, True), (True, False, False), ], ) def test_include_alignment_primary_only( is_supplementary: bool, include_supplementary: bool, expected: bool ) -> None:
In the body of the test function, build an
AlignedSegment
object and testinclude_alignment
.builder = SamBuilder() alignment = builder.add_single(chrom="chr2", start=200, supplementary=is_supplementary) assert include_alignment(alignment=alignment, include_supplementary=include_supplementary) == expected
Run the Test with pytest
pytest "SamBuilder_test_bam2fastq.py" |
==================== test session starts ==================== | |
platform darwin -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0 | |
rootdir: /Users/ameynert/scratch/sam_builder_blog | |
collected 5 items | |
SamBuilder_test_bam2fastq.py ..... [100%] | |
===================== 5 passed in 0.09s ===================== |
Notice there are now 5 tests that have been run–the first test from above and an additional 4 tests from the parameterized test we just wrote.
Testing the Function bam2fastq
The final function is the one that’s actually reading in the BAM file and writing it to FASTQ.
def bam2fastq( | |
bam_file: Path, | |
fastq_file: Path, | |
include_supplementary: bool = False, | |
) -> None: | |
""" | |
Converts reads in a BAM file to FASTQ format. | |
Excludes supplementary reads by default. | |
Conversion to ASCII encoded quality scores adds 33 to each integer score and converts it to an | |
ASCII character, then joins the converted quality scores together into a string. | |
Args: | |
bam_file: Path to the input BAM file. | |
fastq_file: Path to the output FASTQ file. | |
include_supplementary: Whether to include supplementary reads in the output (default: | |
False). | |
""" | |
with reader(bam_file) as bam, open(fastq_file, "w") as fastq: | |
for alignment in bam: | |
if include_alignment( | |
alignment=alignment, | |
include_supplementary=include_supplementary, | |
): | |
fastq.write(f"{alignment_to_fastq(alignment)}") |
Building the Unit Test
Given that we already have tests for the two helper functions, a sufficient unit test for this function is to create a BAM file with a mix of primary and supplementary alignments, call the function with include_supplementary=False
, and read back in the FASTQ to see if it only includes the primary alignments.
Add imports for reading FASTQ files from
pysam
:from pathlib import Path from pysam import FastqProxyfrom pysam import FastxFile from bam2fastq import bam2fastq
Create a BAM file with a mix of reads:
def test_bam2fastq(tmp_path: Path) -> None: """ Tests converting a BAM file to FASTQ format, excluding supplementary reads. """ builder = SamBuilder(r1_len=10, base_quality=30) builder.add_single(name="query1", bases="A" * 10) builder.add_single(name="query2", bases="C" * 10) builder.add_single(name="query3", bases="G" * 10, supplementary=True) bam_file = tmp_path / "test.bam" builder.to_path(bam_file)
Convert it to FASTQ:
fastq_file = tmp_path / “test.fastq” bam2fastq(bam_file=bam_file, fastq_file=fastq_file, include_supplementary=False)
Read in the FASTQ file and verify the contents are as expected:
with FastxFile(fastq_file) as fh: query_to_fastq: dict[str, FastqProxy] = {entry.name: entry for entry in fh} assert len(query_to_fastq.keys()) == 2 assert "query1" in query_to_fastq assert "query2" in query_to_fastq assert "query3" not in query_to_fastq
Run the Test with pytest
pytest "SamBuilder_test_bam2fastq.py" |
==================== test session starts ==================== | |
platform darwin -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0 | |
rootdir: /Users/ameynert/scratch/sam_builder_blog | |
collected 6 items | |
SamBuilder_test_bam2fastq.py ...... [100%] | |
===================== 6 passed in 0.04s ===================== |
Additional SamBuilder Functionality
Creating Paired-end Read Alignments
Take a look at SamBuilder_create_pe_test_bam.py
. The only functional differences between this and the previously written scripts to generate single-end read data are in the call to SamBuilder
, where a read length for R2
is provided, and the substitution of add_pair()
for add_single()
. The default strand orientation for add_pair()
is F1R2
(strand1='+', strand2='-'
).
builder = SamBuilder(r1_len=10, r2_len=10) | |
for offset in range(num_reads): | |
builder.add_pair( | |
chrom1=”chr1″, | |
start1=100 + offset, | |
chrom2=”chr2″, | |
start2=200 – offset | |
) | |
for offset in range(num_unmapped): | |
builder.add_pair() | |
builder.to_path(test_bam_file) |
Create a paired-end BAM
file.
python "SamBuilder_create_pe_test_bam.py" -o "test_pe.bam" -n 3 -u 2 |
View the SAM
records with samtools view
.
samtools view "test_pe.bam" |
QNAME | FLAG | RNAME | POS | CIGAR | MAPQ | CIGAR | RNEXT | TLEN | SEQ | QUAL | RG | MATE_CIGAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
q0000 | 97 | chr1 | 101 | 60 | 10M | chr2 | 201 | 0 | GACAGGTACA | ?????????? | RG:Z:1 | MC:Z:10M | |
q0001 | 97 | chr1 | 102 | 60 | 10M | chr2 | 200 | 0 | TGCATCAATG | ?????????? | RG:Z:1 | MC:Z:10M | |
q0002 | 97 | chr1 | 103 | 60 | 10M | chr2 | 199 | 0 | GAACAAACGC | ?????????? | RG:Z:1 | MC:Z:10M | |
q0002 | 145 | chr2 | 199 | 60 | 10M | chr1 | 103 | 0 | CACTGGAGAC | ?????????? | RG:Z:1 | MC:Z:10M | |
q0001 | 145 | chr2 | 200 | 60 | 10M | chr1 | 102 | 0 | TGGTCGTGTG | ?????????? | RG:Z:1 | MC:Z:10M | |
q0000 | 145 | chr2 | 201 | 60 | 10M | chr1 | 101 | 0 | AGAAGGAGTA | ?????????? | RG:Z:1 | MC:Z:10M | |
q0003 | 77 | * | 0 | 0 | * | * | 0 | 0 | TGGGTTAACC | ?????????? | RG:Z:1 | ||
q0003 | 141 | * | 0 | 0 | * | * | 0 | 0 | ATTCGCTCCA | ?????????? | RG:Z:1 |
When unit testing, builder.add_pair
returns the read pair as a Tuple
of two AlignedSegment
objects.
builder = SamBuilder() | |
(read1, read2) = builder.add_pair() |
The Template
Container
fgpyo
has a handy Template
container for all the AlignedSegment
objects belonging to the same original sequence template. The Template.build()
function is used to group these together with some basic validation.
builder = SamBuilder() | |
(read1, read2) = builder.add_pair() | |
template = Template.build([read1, read2]) |
Template
has only 7 properties, a name
plus 3 for each of R1
and R2
:
r[1,2]
: optional primary alignmentr[1,2]_supplementals
: supplementary alignmentsr[1,2]_secondaries
: secondary alignments
This blog hasn’t looked at secondary alignments at all, but just to remind you a secondary alignment is an alternative mapping of the representative alignment, and a supplementary alignment is an alignment of a part of the read that was not aligned in the primary alignment.
Supplementary Alignments
The examples above with supplementary alignments were creating standalone alignments that were not associated with a given primary alignment. Template
can be useful to group these together.
Here’s an example where R1
has a supplementary alignment.
builder = SamBuilder(r1_len=20, r2_len=20) | |
(read1, read2) = builder.add_pair( | |
name="query", | |
chrom1="chr1", | |
start1=100, | |
bases1="A" * 10, | |
chrom2="chr1", | |
start2=200, | |
bases2="C" * 20, | |
) | |
read1_supp = builder.add_single( | |
name="query", | |
read_num=1, | |
chrom="chr2", | |
start=190, | |
supplementary=True | |
) | |
template = Template.build([read1, read2, read1_supp]) |
Complex sets of pairs and supplementary alignments can be built to represent split-read and read pair evidence for testing tools that detect structural variation or genome editing.
Suggested Advanced Uses
Use cigar strings to test clipping or variant detection. Check out the
cigar
submodule infgpyo
for some additional handy manipulation functions.Use the optional predicate function in
builder.to_path()
to filter SAM records on-the-fly.Add header and read group information to mock the output from picky tools.
Use a custom sequence dictionary for testing code that’s designed to work on PCR constructs or transcripts.
Further Reading
Kamali et al 2015 ‘How to test bioinformatics software?’ – discusses the difficulty of covering the test space in bioinformatics: “the test case selection problem”