Simulate metabolic labeling - nucleotide conversion RNA-seq data.

This function takes a vector of true relative abundances and NTRs, and then simulates (i) read counts per gene and (ii) 4sU incorporation and conversion events. Subsequently, it uses the same approach as implemented in the GRAND-SLAM 2.0 software (Juerges et al., Bioinformatics 2018) to estimate the NTR from these simulated data.

Usage

SimulateReadsForSample(
  num.reads = 2e+07,
  rel.abundance = setNames(rlnorm(10000, meanlog = 4.5, sdlog = 1), paste0("Gene",
    1:10000)),
  ntr = setNames(rbeta(10000, 1.5, 3), paste0("Gene", 1:10000)),
  dispersion = 0.05,
  beta.approx = FALSE,
  conversion.reads = FALSE,
  u.content = 0.25,
  u.content.sd = 0.05,
  read.length = 75,
  p.old = 1e-04,
  p.new = 0.04,
  p.new.fit = p.new,
  enforce.parallelized = NA,
  seed = NULL
)

Arguments

num.reads: the total amount of reads for simulation
rel.abundance: named (according to genes) vector of the true relative abundances. Is divided by its sum.
ntr: vector of true NTRs
dispersion: vector of dispersion parameters (should best be estimated by DESeq2)
beta.approx: should the beta approximation of the NTR posterior be computed?
conversion.reads: also output the number of reads with conversion
u.content: the relative frequency of uridines in the reads
u.content.sd: the standard deviation of the u content
read.length: the read length for simulation
p.old: the probability for a conversion in reads originating from old RNA
p.new: the probability for a conversion in reads originating from new RNA
p.new.fit: the probability for a conversion in reads originating from new RNA that is used for fitting (to simulate bias in the estimation of p.new)
enforce.parallelized: should parallelization be used (NA: use state of IsParallel())
seed: seed value for the random number generator (set to make it deterministic!)

Value

a matrix containing, per column, the simulated counts, the simulated NTRs, (potentially the shape parameters of the beta distribution approximation,) and the true relative frequencies and ntrs

Details

The simulation proceeds as follows:

Draw for each gene the number of reads from a negative binomial distribution parametrized with the relative abundances x read number and the dispersion parameter
For each gene: Draw for each read the number of uridines according to a beta binomial distribution for the given read length (the beta prior is parametrized to match the u.content and u.content.sd parameters)
For each read: Draw the number of conversions according to the binomial mixture model of GRAND-SLAM (parametrized with p_old, p_new, the gene specific NTR and the read specific number of uridines)
Estimate the NTR by using the GRAND-SLAM approach

Examples

SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9)
#>      count       ntr true_freq true_ntr
#> [1,]  2013 0.9317620       0.2      0.9
#> [2,]  2818 0.8905261       0.2      0.9
#> [3,]  1849 0.9493985       0.2      0.9
#> [4,]  1195 0.8917638       0.2      0.9
#> [5,]  1671 0.8820507       0.2      0.9
SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9,seed=1337)
#>      count       ntr true_freq true_ntr
#> [1,]  1970 0.9088477       0.2      0.9
#> [2,]  1733 0.8445619       0.2      0.9
#> [3,]  3158 0.8908687       0.2      0.9
#> [4,]  2866 0.8973888       0.2      0.9
#> [5,]  2173 0.9158799       0.2      0.9
SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9,seed=1337)
#>      count       ntr true_freq true_ntr
#> [1,]  1970 0.9088477       0.2      0.9
#> [2,]  1733 0.8445619       0.2      0.9
#> [3,]  3158 0.8908687       0.2      0.9
#> [4,]  2866 0.8973888       0.2      0.9
#> [5,]  2173 0.9158799       0.2      0.9
# the second and third matrix should be equal, the first should be distinct