| Title: | DuckDB-Backed Interface for Large-Scale Genomic Data |
|---|---|
| Description: | Provides a DuckDB-backed infrastructure for working with large-scale genomic data that exceeds available memory. Supports lazy evaluation of genomic ranges and efficient overlap queries. Integrates with Bioconductor classes (GRanges, BiocIO) and provides plyranges-compatible API. |
| Authors: | Edward C. Ruiz [aut, cre] (ORCID: <https://orcid.org/0000-0002-9174-5387>) |
| Maintainer: | Edward C. Ruiz <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.99.0 |
| Built: | 2026-05-28 02:56:15 UTC |
| Source: | https://github.com/dbverse-org/dbSequence |
S3 generic for converting various objects to GRanges. This generic is provided for plyranges compatibility when plyranges is not loaded.
S3 method for plyranges compatibility. Converts a dbSequence object to a GRanges object by collecting data from DuckDB.
as_granges(.data, ..., keep_mcols = TRUE) ## Default S3 method: as_granges(.data, ..., keep_mcols = TRUE) ## S3 method for class 'dbSequence' as_granges(.data, ..., keep_mcols = TRUE)as_granges(.data, ..., keep_mcols = TRUE) ## Default S3 method: as_granges(.data, ..., keep_mcols = TRUE) ## S3 method for class 'dbSequence' as_granges(.data, ..., keep_mcols = TRUE)
.data |
A dbSequence object |
... |
Additional arguments (currently unused) |
keep_mcols |
logical: whether to keep metadata columns (default: TRUE) |
This method enables plyranges-style workflows:
db_seq %>% as_granges()
Note that this collects all data from DuckDB into memory. For large datasets, consider filtering first using dplyr verbs.
A GRanges object
A GRanges object
# Import BED file to dbSequence bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Convert to GRanges (collects data) gr <- as_granges(db_seq)# Import BED file to dbSequence bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Convert to GRanges (collects data) gr <- as_granges(db_seq)
Returns a view of the dbSequence with only the core range columns (seqnames, start, end, strand). This is useful for range operations that don't need metadata columns.
## S4 method for signature 'dbSequence' asRanges(x, ...)## S4 method for signature 'dbSequence' asRanges(x, ...)
x |
(required) A dbSequence object |
... |
(optional) Additional arguments (currently unused) |
A dbSequence view with only range columns
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_path <- tempfile(fileext = ".duckdb") db_seq <- read_bed(bed, dest = DuckDBFile(db_path), lazy = FALSE) asRanges(db_seq)bed <- system.file("extdata", "example.bed", package = "dbSequence") db_path <- tempfile(fileext = ".duckdb") db_seq <- read_bed(bed, dest = DuckDBFile(db_path), lazy = FALSE) asRanges(db_seq)
Extract genomic ranges from dbSequence objects
asRanges(x, ...)asRanges(x, ...)
x |
(required) A dbSequence object |
... |
(optional) Additional arguments |
A view of the dbSequence with only range columns (seqnames, start, end, strand)
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_path <- tempfile(fileext = ".duckdb") db_seq <- read_bed(bed, dest = DuckDBFile(db_path), lazy = FALSE) asRanges(db_seq)bed <- system.file("extdata", "example.bed", package = "dbSequence") db_path <- tempfile(fileext = ".duckdb") db_seq <- read_bed(bed, dest = DuckDBFile(db_path), lazy = FALSE) asRanges(db_seq)
Methods for converting between dbSequence and GRanges objects.
as(dbSequence, "GRanges") materializes a dbSequence object into an
in-memory GRanges object, triggering data collection from the DuckDB database.
as(GRanges, "dbSequence") loads a GRanges object into a DuckDB table
and wraps it as dbSequence, enabling lazy evaluation of operations.
from |
Object to convert (dbSequence or GRanges) |
Converted object (GRanges or dbSequence respectively)
Compute coverage (count of overlapping fragments) in bins across a genomic region. This is useful for visualization and summarizing fragment density.
compute_coverage(x, region, window = 100, ...) ## S3 method for class 'dbSequence' compute_coverage(x, region, window = 100, ...)compute_coverage(x, region, window = 100, ...) ## S3 method for class 'dbSequence' compute_coverage(x, region, window = 100, ...)
x |
A dbSequence object |
region |
A GRanges object or string like "chr1:1000-2000" |
window |
Bin size in base pairs (default: 100) |
... |
Additional arguments |
A data frame with columns: bin_start, bin_end, count
bed <- system.file("extdata", "example.bed", package = "dbSequence") fragments <- read_bed(bed) coverage <- compute_coverage(fragments, "chr1:100-500", window = 100) coveragebed <- system.file("extdata", "example.bed", package = "dbSequence") fragments <- read_bed(bed) coverage <- compute_coverage(fragments, "chr1:100-500", window = 100) coverage
S3 method for dplyr::compute() that materializes the underlying lazy
query into a DuckDB table, returning a new dbSequence pointing at the
computed table.
compute.dbSequence( x, name = tableName(x), temporary = TRUE, overwrite = TRUE, ... )compute.dbSequence( x, name = tableName(x), temporary = TRUE, overwrite = TRUE, ... )
x |
A dbSequence object |
name |
Character table name to create. Defaults to |
temporary |
Logical, create a temporary table (default: TRUE) |
overwrite |
Logical, overwrite existing table (default: TRUE) |
... |
Passed to |
A dbSequence object backed by the materialized table.
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) materialized <- compute.dbSequence(db_seq, name = "bed_materialized") materializedbed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) materialized <- compute.dbSequence(db_seq, name = "bed_materialized") materialized
Creates a dbSequence object by wrapping a table name. This is typically called internally by import methods.
dbSequence(table_name, file_source = NULL, .conn = NULL)dbSequence(table_name, file_source = NULL, .conn = NULL)
table_name |
(required) character: name of the table in the DuckDB database |
file_source |
(required) character or DuckDBFile: source file or database |
.conn |
(optional) DBIConnection: existing connection for in-memory databases (default: NULL) |
A dbSequence object
db_path <- tempfile(fileext = ".duckdb") con <- DBI::dbConnect(duckdb::duckdb(), dbdir = db_path) DBI::dbWriteTable(con, "ranges", data.frame(seqnames = "chr1", start = 1, end = 10)) DBI::dbDisconnect(con, shutdown = TRUE) db_seq <- dbSequence("ranges", DuckDBFile(db_path)) db_seqdb_path <- tempfile(fileext = ".duckdb") con <- DBI::dbConnect(duckdb::duckdb(), dbdir = db_path) DBI::dbWriteTable(con, "ranges", data.frame(seqnames = "chr1", start = 1, end = 10)) DBI::dbDisconnect(con, shutdown = TRUE) db_seq <- dbSequence("ranges", DuckDBFile(db_path)) db_seq
S4 class for genomic sequences stored in DuckDB
Extends dbData to wrap genomic data stored in DuckDB tables.
valuedplyr tbl representing the genomic data in the database (inherited from dbData)
namecharacter table name in database (inherited from dbData)
file_sourcecharacter source file path or identifier (immutable after creation)
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) db_seqbed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) db_seq
Creates a DuckDBFile object that wraps a file path for DuckDB databases. This is a simple path wrapper following the BiocFile pattern and supports both file paths and in-memory databases.
DuckDBFile(resource)DuckDBFile(resource)
resource |
(required) character: path to DuckDB file. Can be a file path or ":memory:" for in-memory databases |
DuckDBFile object containing the path information
# Create DuckDBFile for existing database db_file <- DuckDBFile(tempfile(fileext = ".duckdb")) # Create DuckDBFile for in-memory database mem_db <- DuckDBFile(":memory:")# Create DuckDBFile for existing database db_file <- DuckDBFile(tempfile(fileext = ".duckdb")) # Create DuckDBFile for in-memory database mem_db <- DuckDBFile(":memory:")
File class for DuckDB database files, extending BiocFile
This mirrors the File classes in rtracklayer, giving us
a concrete object to dispatch import() / export() methods to.
Extends BiocFile to be compatible with BiocIO import/export framework.
pathCharacter string specifying the path to the DuckDB database file. Can be a file path or ":memory:" for in-memory databases.
db_file <- DuckDBFile(tempfile(fileext = ".duckdb")) db_filedb_file <- DuckDBFile(tempfile(fileext = ".duckdb")) db_file
Accessor for file_source slot
fileSource(object) ## S4 method for signature 'dbSequence' fileSource(object) ## S4 replacement method for signature 'dbSequence' fileSource(object) <- valuefileSource(object) ## S4 method for signature 'dbSequence' fileSource(object) ## S4 replacement method for signature 'dbSequence' fileSource(object) <- value
object |
A dbSequence object |
value |
New value (will be rejected) |
The file_source value
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) fileSource(db_seq) try(fileSource(db_seq) <- "other.bed")bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) fileSource(db_seq) try(fileSource(db_seq) <- "other.bed")
This method prevents modification of the file_source slot after object creation. The file_source is set during import and should not be changed to maintain data integrity.
fileSource(object) <- valuefileSource(object) <- value
object |
A dbSequence object |
value |
New value (will be rejected) |
Always errors because file sources are immutable after creation.
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) try(fileSource(db_seq) <- "other.bed")bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) try(fileSource(db_seq) <- "other.bed")
Filter a dbSequence object to only include ranges that overlap with a set of query ranges. This operation is performed in DuckDB using efficient SQL filtering.
filter_by_overlaps(x, y, ...) ## S3 method for class 'dbSequence' filter_by_overlaps(x, y, ...)filter_by_overlaps(x, y, ...) ## S3 method for class 'dbSequence' filter_by_overlaps(x, y, ...)
x |
A dbSequence object |
y |
A GRanges object or dbSequence object representing query ranges |
... |
Additional arguments (currently unused) |
Two intervals overlap if: start1 <= end2 AND start2 <= end1 AND chr1 == chr2
The operation is pushed down to DuckDB, so only matching rows are retrieved.
A dbSequence object filtered to overlapping ranges
# Filter fragments to a specific region bed <- system.file("extdata", "example.bed", package = "dbSequence") fragments <- read_bed(bed) region <- GenomicRanges::GRanges("chr1:100-500") filtered <- filter_by_overlaps(fragments, region) filtered# Filter fragments to a specific region bed <- system.file("extdata", "example.bed", package = "dbSequence") fragments <- read_bed(bed) region <- GenomicRanges::GRanges("chr1:100-500") filtered <- filter_by_overlaps(fragments, region) filtered
Import BAM files into DuckDB
## S4 method for signature 'BamFile,missing,ANY' import(con, format, text, dest, table_name = "bam_data", .conn = NULL, ...)## S4 method for signature 'BamFile,missing,ANY' import(con, format, text, dest, table_name = "bam_data", .conn = NULL, ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
.conn |
Optional existing DBI connection. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import BED files into DuckDB
## S4 method for signature 'BEDFile,missing,ANY' import(con, format, text, dest, table_name = "bed_data", ...)## S4 method for signature 'BEDFile,missing,ANY' import(con, format, text, dest, table_name = "bed_data", ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import character file paths into DuckDB (auto-detect format)
## S4 method for signature 'character,missing,ANY' import(con, format, text, dest, table_name = NULL, ...)## S4 method for signature 'character,missing,ANY' import(con, format, text, dest, table_name = NULL, ...)
con |
A file path. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. If |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import FASTA files into DuckDB
## S4 method for signature 'FastaFile,missing,ANY' import(con, format, text, dest, table_name = "fasta_data", ...)## S4 method for signature 'FastaFile,missing,ANY' import(con, format, text, dest, table_name = "fasta_data", ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import GFF files into DuckDB
## S4 method for signature 'GFFFile,missing,ANY' import(con, format, text, dest, table_name = "gff_data", ...)## S4 method for signature 'GFFFile,missing,ANY' import(con, format, text, dest, table_name = "gff_data", ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import GTF files into DuckDB
## S4 method for signature 'GTFFile,missing,ANY' import(con, format, text, dest, table_name = "gtf_data", ...)## S4 method for signature 'GTFFile,missing,ANY' import(con, format, text, dest, table_name = "gtf_data", ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Import VCF files into DuckDB
## S4 method for signature 'VcfFile,missing,ANY' import(con, format, text, dest, table_name = "vcf_data", ...)## S4 method for signature 'VcfFile,missing,ANY' import(con, format, text, dest, table_name = "vcf_data", ...)
con |
A file connection object. |
format |
Import format; unused for these methods. |
text |
Text input; unused for these methods. |
dest |
A |
table_name |
Table name to create. |
... |
Additional arguments. |
A dbSequence object backed by a DuckDB table.
Aggregates values in a lazy database table by grouping columns. This is
the database-native equivalent of dplyr::group_by() + summarise() but
with a simpler interface optimized for common pooling operations.
pool( x, group_by, value_col = NULL, fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## Default S3 method: pool( x, group_by, value_col = NULL, fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'tbl_duckdb_connection' pool( x, group_by, value_col = "x", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'tbl_dbi' pool( x, group_by, value_col = "x", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'dbSequence' pool( x, group_by, value_col = "score", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... )pool( x, group_by, value_col = NULL, fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## Default S3 method: pool( x, group_by, value_col = NULL, fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'tbl_duckdb_connection' pool( x, group_by, value_col = "x", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'tbl_dbi' pool( x, group_by, value_col = "x", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... ) ## S3 method for class 'dbSequence' pool( x, group_by, value_col = "score", fun = "sum", name = NULL, filter_zero = TRUE, temporary = TRUE, overwrite = TRUE, ... )
x |
A dbSequence object, tbl_duckdb_connection, or other lazy table |
group_by |
Character vector of column names to group by |
value_col |
Character, the column to aggregate. Default is "score" for dbSequence, "x" for tbl objects. |
fun |
Character, aggregation function: "sum" (default), "mean", "count", "min", "max" |
name |
Character, optional name for the resulting computed table. If NULL, returns a lazy query without materializing. |
filter_zero |
Logical, if TRUE (default) filter out rows where aggregated value == 0 |
temporary |
Logical, if TRUE (default) create a temporary table, otherwise create a permanent table |
overwrite |
Logical, if TRUE (default) overwrite existing table |
... |
Additional arguments passed to dplyr::compute() |
Same type as input (lazy, still in database). For dbSequence, returns dbSequence if range columns are preserved, otherwise returns tbl.
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Pool feature scores by feature name pooled <- pool(db_seq, group_by = "name", value_col = "score") pooledbed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Pool feature scores by feature name pooled <- pool(db_seq, group_by = "name", value_col = "score") pooled
plyranges-style function to read BAM files. Returns a dbSequence object backed by DuckDB.
read_bam(file, dest = DuckDBFile(":memory:"), table_name = "alignments", ...)read_bam(file, dest = DuckDBFile(":memory:"), table_name = "alignments", ...)
file |
Path to BAM file |
dest |
DuckDBFile or path: destination database (default: in-memory) |
table_name |
character: name for the table (default: "alignments") |
... |
Additional arguments passed to import() |
A dbSequence object
Unlike plyranges::read_bam which returns a DeferredGenomicRanges, this function immediately imports the BAM data into DuckDB.
if (nzchar(system.file(package = "exonr"))) { bam <- system.file("extdata", "example.bam", package = "dbSequence") db_seq <- read_bam(bam, dest = DuckDBFile(tempfile(fileext = ".duckdb"))) db_seq }if (nzchar(system.file(package = "exonr"))) { bam <- system.file("extdata", "example.bam", package = "dbSequence") db_seq <- read_bam(bam, dest = DuckDBFile(tempfile(fileext = ".duckdb"))) db_seq }
plyranges-style function to read BED files. Unlike the plyranges version which returns GRanges, this returns a dbSequence object backed by DuckDB for lazy evaluation.
read_bed( file, dest = DuckDBFile(":memory:"), table_name = "bed_data", lazy = TRUE, ... )read_bed( file, dest = DuckDBFile(":memory:"), table_name = "bed_data", lazy = TRUE, ... )
file |
Path to BED file |
dest |
DuckDBFile or path: destination database (default: in-memory) |
table_name |
character: name for the table (default: "bed_data") |
lazy |
logical: if TRUE (default), do not import/copy the file into a
DuckDB table. Instead, return a dbSequence backed by a DuckDB scan query
(an inline SQL SELECT over the file). Call |
... |
Additional arguments passed to import() |
A dbSequence object
# Read BED file (lazy, stays in DuckDB) bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Collect to GRanges when needed gr <- as_granges(db_seq)# Read BED file (lazy, stays in DuckDB) bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) # Collect to GRanges when needed gr <- as_granges(db_seq)
plyranges-style function to read GFF files. Returns a dbSequence object backed by DuckDB.
read_gff( file, dest = DuckDBFile(":memory:"), table_name = "gff_data", lazy = TRUE, ... ) read_gff3( file, dest = DuckDBFile(":memory:"), table_name = "gff_data", lazy = TRUE, ... )read_gff( file, dest = DuckDBFile(":memory:"), table_name = "gff_data", lazy = TRUE, ... ) read_gff3( file, dest = DuckDBFile(":memory:"), table_name = "gff_data", lazy = TRUE, ... )
file |
Path to GFF file |
dest |
DuckDBFile or path: destination database (default: in-memory) |
table_name |
character: name for the table (default: "gff_data") |
lazy |
logical: if TRUE (default), do not import/copy the file into a
DuckDB table. Instead, return a dbSequence backed by a DuckDB scan query
(an inline SQL SELECT over the file). Call |
... |
Additional arguments passed to import() |
A dbSequence object
gff <- system.file("extdata", "example.gff3", package = "dbSequence") db_seq <- read_gff(gff) db_seqgff <- system.file("extdata", "example.gff3", package = "dbSequence") db_seq <- read_gff(gff) db_seq
Read VCF variant files into DuckDB. Returns a dbSequence object.
read_vcf( file, dest = DuckDBFile(":memory:"), table_name = "variants", lazy = TRUE, ... )read_vcf( file, dest = DuckDBFile(":memory:"), table_name = "variants", lazy = TRUE, ... )
file |
Path to VCF file |
dest |
DuckDBFile or path: destination database (default: in-memory) |
table_name |
character: name for the table (default: "variants") |
lazy |
logical: if TRUE (default), do not import/copy the file into a
DuckDB table. Instead, return a dbSequence backed by a DuckDB scan query
(an inline SQL SELECT over the file). Call |
... |
Additional arguments passed to import() |
A dbSequence object
vcf <- system.file("extdata", "example.vcf", package = "dbSequence") db_seq <- read_vcf(vcf) db_seqvcf <- system.file("extdata", "example.vcf", package = "dbSequence") db_seq <- read_vcf(vcf) db_seq
Display a summary of the dbSequence object and preview the table data
## S4 method for signature 'dbSequence' show(object)## S4 method for signature 'dbSequence' show(object)
object |
A dbSequence object |
Invisibly returns NULL.
Display information about the DuckDBFile object.
## S4 method for signature 'DuckDBFile' show(object)## S4 method for signature 'DuckDBFile' show(object)
object |
A DuckDBFile object |
Invisibly returns NULL.
Extract the table name from a dbSequence object.
tableName(x) ## S4 method for signature 'dbSequence' tableName(x)tableName(x) ## S4 method for signature 'dbSequence' tableName(x)
x |
(required) A dbSequence object |
character: the table name
bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) tableName(db_seq)bed <- system.file("extdata", "example.bed", package = "dbSequence") db_seq <- read_bed(bed) tableName(db_seq)