This function converts a genotype matrix coded as 0/1/2/NA or AA/AB/BB to a snpStats::SnpMatrix object. It includes checks for coding validity, missing values, and duplicate sample or SNP IDs, and preserves row and column names from the input.

as_snpmatrix(
  geno,
  coding = c("012", "AAABBB"),
  missing_codes = c("NA", "-9", ".", ""),
  check_ids = TRUE
)

Arguments

geno

A samples x SNPs matrix or data.frame with genotypes coded as 0, 1, 2, or NA. Can be numeric/integer or character. rownames = sample IDs, colnames = SNP IDs.

coding

One of "012" or "AAABBB". For character inputs only. "012" expects "0", "1", "2", and missing_codes. "AAABBB" expects "AA", "AB", "BB", and missing_codes.

missing_codes

Character values to treat as missing (only used when geno is character), e.g., c("NA","-9",".").

check_ids

If TRUE, verifies that row and column names are unique (recommended).

Value

A snpStats::SnpMatrix with the same dimnames as geno.

Details

The function accepts both matrix and data.frame inputs. For data.frame objects, all columns are coerced to a common type using as.matrix(), which preserves rownames and colnames.

The returned SnpMatrix object stores each genotype as a single byte, which is memory-efficient compared to integer storage. However, large datasets still require substantial RAM. For very large genotype sets, consider using on-disk formats such as SNPRelate (GDS) or bigsnpr.

Examples

# Numeric 0/1/2 with NAs
set.seed(1)
geno <- matrix(sample(c(0L,1L,2L,NA), 20, replace=TRUE), nrow=5)
rownames(geno) <- paste0("ind", 1:5)
colnames(geno) <- paste0("snp", 1:4)
SM <- as_snpmatrix(geno)

# Character AA/AB/BB
geno_c <- matrix(sample(c("AA","AB","BB","."), 20, replace=TRUE,
                        prob=c(.35,.3,.3,.05)), nrow=5)
rownames(geno_c) <- rownames(geno)
colnames(geno_c) <- colnames(geno)
SMc <- as_snpmatrix(geno_c, coding="AAABBB", missing_codes=".")