--- title: "Data Ingestion" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Ingestion} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, warning = FALSE, message = FALSE ) ``` ## Overview dbSequence stores genomic data in DuckDB, enabling analysis of datasets larger than available memory. Data stays in the database—R operates on lazy references. ## Supported Formats | Format | Function | BiocIO Class | |--------|----------|--------------| | BED | `read_bed()` | `BEDFile` | | VCF | `read_vcf()` | `VcfFile` | | GFF/GTF | `read_gff()` | `GFFFile` | | BAM | `read_bam()` | `BamFile` | ## Quick Start ```{r quick-start} library(dbSequence) bed_file <- system.file("extdata", "example.bed", package = "dbSequence") # Import a BED file db_seq <- read_bed(bed_file) # Data stays in DuckDB - this is a lazy reference db_seq ``` ## BiocIO Pattern For more control, use the BiocIO `import()` function: ```{r biocio-pattern} library(BiocIO) library(rtracklayer) bed_file <- system.file("extdata", "example.bed", package = "dbSequence") db_file <- tempfile(fileext = ".duckdb") # Specify destination database db_seq <- import( BEDFile(bed_file), dest = DuckDBFile(db_file), table_name = "fragments" ) ``` ## Lazy Evaluation The key feature: data never loads into R memory. ```{r lazy-demo} # This does NOT load data db_seq <- read_bed(bed_file) # Still no data in memory - just adds a filter condition region <- GenomicRanges::GRanges("chr1:100-500") filtered <- filter_by_overlaps(db_seq, region) # Data only loads when you explicitly collect result <- as_granges(filtered) # NOW data enters R ``` ## Multiple Tables Store multiple datasets in one database: ```{r multiple-tables} db <- DuckDBFile(tempfile(fileext = ".duckdb")) gff_file <- system.file("extdata", "example.gff3", package = "dbSequence") # Import different files to same database peaks <- import(BEDFile(bed_file), dest = db, table_name = "peaks") genes <- import(GFFFile(gff_file), dest = db, table_name = "genes") ``` ## Connection Management dbSequence manages connections automatically, but for long sessions: ```{r connections} # Access the underlying connection con <- dbProject::conn(db_seq) # For manual control, create your own connection con <- DBI::dbConnect(duckdb::duckdb(), tempfile(fileext = ".duckdb")) # ... do work ... DBI::dbDisconnect(con, shutdown = TRUE) ``` ## Next Steps - [plyranges API](plyranges-api.html) - Range operations on lazy data ```{r sessionInfo} sessionInfo() ```