Simulate metabolic labeling - nucleotide conversion RNA-seq data.
Source:R/readsimulator.R
SimulateReadsForSample.Rd
This function takes a vector of true relative abundances and NTRs, and then simulates (i) read counts per gene and (ii) 4sU incorporation and conversion events. Subsequently, it uses the same approach as implemented in the GRAND-SLAM 2.0 software (Juerges et al., Bioinformatics 2018) to estimate the NTR from these simulated data.
Usage
SimulateReadsForSample(
num.reads = 2e+07,
rel.abundance = setNames(rlnorm(10000, meanlog = 4.5, sdlog = 1), paste0("Gene",
1:10000)),
ntr = setNames(rbeta(10000, 1.5, 3), paste0("Gene", 1:10000)),
dispersion = 0.05,
beta.approx = FALSE,
conversion.reads = FALSE,
u.content = 0.25,
u.content.sd = 0.05,
read.length = 75,
p.old = 1e-04,
p.new = 0.04,
p.new.fit = p.new,
seed = NULL
)
Arguments
- num.reads
the total amount of reads for simulation
- rel.abundance
named (according to genes) vector of the true relative abundances. Is divided by its sum.
- ntr
vector of true NTRs
- dispersion
vector of dispersion parameters (should best be estimated by DESeq2)
- beta.approx
should the beta approximation of the NTR posterior be computed?
- conversion.reads
also output the number of reads with conversion
- u.content
the relative frequency of uridines in the reads
- u.content.sd
the standard deviation of the u content
- read.length
the read length for simulation
- p.old
the probability for a conversion in reads originating from old RNA
- p.new
the probability for a conversion in reads originating from new RNA
- p.new.fit
the probability for a conversion in reads originating from new RNA that is used for fitting (to simulate bias in the estimation of p.new)
- seed
seed value for the random number generator (set to make it deterministic!)
Value
a matrix containing, per column, the simulated counts, the simulated NTRs, (potentially the shape parameters of the beta distribution approximation,) and the true relative frequencies and ntrs
Details
The simulation proceeds as follows:
Draw for each gene the number of reads from a negative binomial distribution parametrized with the relative abundances x read number and the dispersion parameter
For each gene: Draw for each read the number of uridines according to a beta binomial distribution for the given read length (the beta prior is parametrized to match the u.content and u.content.sd parameters)
For each read: Draw the number of conversions according to the binomial mixture model of GRAND-SLAM (parametrized with p_old, p_new, the gene specific NTR and the read specific number of uridines)
Estimate the NTR by using the GRAND-SLAM approach
Examples
SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9)
#> count ntr true_freq true_ntr
#> [1,] 2013 0.9317620 0.2 0.9
#> [2,] 2818 0.8905261 0.2 0.9
#> [3,] 1849 0.9493985 0.2 0.9
#> [4,] 1195 0.8917638 0.2 0.9
#> [5,] 1671 0.8820507 0.2 0.9
SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9,seed=1337)
#> count ntr true_freq true_ntr
#> [1,] 1970 0.9088477 0.2 0.9
#> [2,] 1733 0.8445619 0.2 0.9
#> [3,] 3158 0.8908687 0.2 0.9
#> [4,] 2866 0.8973888 0.2 0.9
#> [5,] 2173 0.9158799 0.2 0.9
SimulateReadsForSample(num.reads = 10000,rel.abundance = rep(1,5),ntr=0.9,seed=1337)
#> count ntr true_freq true_ntr
#> [1,] 1970 0.9088477 0.2 0.9
#> [2,] 1733 0.8445619 0.2 0.9
#> [3,] 3158 0.8908687 0.2 0.9
#> [4,] 2866 0.8973888 0.2 0.9
#> [5,] 2173 0.9158799 0.2 0.9
# the second and third matrix should be equal, the first should be distinct