--- title: "plyranges API" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{plyranges API} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, warning = FALSE, message = FALSE ) library(dbSequence) library(GenomicRanges) library(dbplyr) ``` ## Overview dbSequence provides plyranges-compatible functions that translate to SQL queries. Operations execute in DuckDB, not R, enabling analysis of datasets larger than memory. ## Core Functions | Function | Description | Returns | |----------|-------------|---------| | `filter_by_overlaps(x, y)` | Ranges in x overlapping y | `dbSequence` (lazy) | | `compute_coverage(x, region, window)` | Binned coverage counts | `data.frame` | | `as_granges(x)` | Convert to GRanges | `GRanges` (collected) | ## filter_by_overlaps Find ranges overlapping a region of interest: ```{r filter-overlaps} # Load example data bed_file <- system.file("extdata", "example.bed", package = "dbSequence") fragments <- read_bed(bed_file) # Define region of interest (example.bed has features on chr1:100-200, 300-400) region <- GRanges("chr1:100-500") # Filter to overlapping fragments - stays lazy overlapping <- filter_by_overlaps(fragments, region) overlapping ``` ### Multiple Regions ```{r multi-region} # Filter against multiple peaks peaks <- GRanges(c("chr1:100-150", "chr1:300-350")) overlapping <- filter_by_overlaps(fragments, peaks) ``` ### dbSequence to dbSequence Both arguments can be lazy. To join them, they must reside in the same database and **share the same connection object**. ```{r lazy-join} # Use a temporary persistent DB temp_db <- tempfile(fileext = ".duckdb") dest <- DuckDBFile(temp_db) # Import both tables to the same database file peaks_orig <- read_bed(bed_file, dest = dest, table_name = "peaks", lazy = FALSE) frags_orig <- read_bed(bed_file, dest = dest, table_name = "fragments", lazy = FALSE) # Create a shared connection (reuse the one from peaks_orig) # This is required because dplyr queries cannot span multiple connections con <- dbProject::conn(peaks_orig) frags_shared <- dbSequence("fragments", file_source = dest, .conn = con) # Now both objects share 'con', so we can join them lazily overlapping <- filter_by_overlaps(frags_shared, peaks_orig) ``` ## compute_coverage Calculate binned coverage for a region: ```{r coverage} # 50bp bins across region coverage <- compute_coverage( fragments, region = "chr1:0-500", window = 50 ) # Returns data.frame with bin_start, bin_end, count head(coverage) ``` ## as_granges Convert to GRanges when you need Bioconductor operations: ```{r as-granges} # Collect subset to GRanges gr <- fragments |> filter_by_overlaps(region) |> as_granges() # Now use standard GenomicRanges operations GenomicRanges::reduce(gr) ``` > **Note:** `as_granges()` collects data into R memory. Filter first. ## Workflow Example Typical analysis pattern: ```{r workflow} # 1. Load data (lazy) - using example.bed fragments <- read_bed(bed_file) # 2. Define analysis region promoter <- GRanges("chr1:50-250") # 3. Filter to region (still lazy) promoter_fragments <- filter_by_overlaps(fragments, promoter) # 4. Compute coverage coverage <- compute_coverage(promoter_fragments, promoter, window = 10) # 5. Collect if needed for custom analysis gr <- as_granges(promoter_fragments) ``` ## File Type Restrictions BAM/CRAM files are blocked from `filter_by_overlaps` due to their indexed nature—use `Rsamtools` instead: ```{r bam-error, error=TRUE} bam_file <- system.file("extdata", "example.bam", package = "dbSequence") bam_data <- read_bam(bam_file) # This should error filter_by_overlaps(bam_data, region) ``` ## Next Steps ```{r sessionInfo} sessionInfo() ```