Research | NewLimit

All cells in the human body contain the same DNA code, yet they perform a wide variety of functions. Through diverse epigenetic codes, human cells execute distinct programs from a common genome, analogous to the control flow that guides the execution of specific functions in a large software codebase. Reprogramming the epigenetic code to elicit desired cell functions is both an important therapeutic challenge and a fundamental problem in molecular biology. Activating ``payloads'' of only 1-6 TFs is one approach to dramatically reprogram cell type and state, sufficient to convert adult cells to embryonic stem cells, skin cells to neurons, or restore youthful features in old cells (1, 2, 3). However, the payloads that evoke a mapping between two arbitrary cell states \(A\) and \(B\) are unknown in most cases.

Even if payloads are limited to a handful of TFs (<6), there are ~10^16 combinations within the roughly 2,000 TFs encoded by the human genome. This is 1,000-10,000 times the number of stars in the Milky Way galaxy. The highest throughput experimental methods today leverage single cell genomics to evaluate the effect of ~100-1,000 payloads in parallel. These experiments yield a gene expression measurement for each payload.

NewLimit leverages similar experimental systems to discover TF payloads that restore youthful function in old cells. As is obvious from the scale of the hypothesis space, we cannot exhaustively test all possible payloads. Here, our goal is develop in silico reprogramming models that can design payloads \(S\) that achieve a desired cell state \(y\).

In particular, we focus on the case of combinatorial predictions where the individual TFs in a payload may have been observed, the but the effect of the combination is unknown. Prior work has addressed the related problem of predicting combinatorial genetic interventions, but in practice trivial baseline approaches demonstrate the best performance in this setting. These results suggest that an effective framework has yet to be invented and existing data may be too small for highly parameterized models (4, 5).

Here, we develop a probabilistic modeling approach to design TF payloads to achieve desired cell states and functions. Our approach takes advantage of transfer learning from protein foundation models (6) and learns to generate payloads given only a sparse sampling of the combinatorial TF space. We take advantage of a unique combinatorial reprogramming dataset that is an order of magnitude larger in scale than those previously available. We experimentally demonstrate that our method is superior to existing baselines for multiple cell state prediction tasks, performance scales with the size of our training set, and an active learning campaign can accelerate payload design in a retrospective campaign.

Our core contributions include:

Demonstration that transfer learning from Molecular Foundation Models enables performant in silico reprogramming
Introduction of a combinatorial reprogramming benchmark that is \(>10\)-fold larger than prior work
Discovery of the first data scaling laws for reprogramming and general perturbation prediction
Development of a computationally efficient reprogramming payload design algorithm that can be used in a lab-in-the-loop setting
Demonstration that in silico reprogramming accelerates discoveries in an active learning campaign

Task Construction

We consider the scenario where we are given a dataset \(\mathcal{D} = \{S_i, y_i\}_{i=1}^N\) where \(S_i = (s^{(1)}, \dots, s^{(K)})\) are TF payloads composed of \(1, \dots, K\) TFs \(s \in \Omega\), where \(\Omega\) is the set of observed individual TFs, and \(y_i\) is a scalar or vector representation of cell state.

In particular, we focus on the combinatorial prediction scenario where all of the effects for individual TFs are observed, such that all sets containing a single TF are in the training set \(\{\{{s\},\; | \;s\in \Omega } \} \in \mathcal{D}\).

We wish to perform two distinct tasks: (1) estimate the distribution \(p(y|S)\) to predict the effect of TF payloads on cell state and (2) design TF payloads to achieve a desired cell state by sampling from the posterior \(S \sim p(S|y)\).

For the first payload prediction task, we perform classic 5-fold cross-validation across TF payloads in our dataset and evaluate the performance of models on an unseen test set. Cross validation folds are constructed so that all payloads are included in exactly one test set. We measure performance using both absolute prediction error (control scaled error, CSE; Pearson correlation coefficient, PCC) and rank-based measures (AUPRC). For a lab-in-the-loop setting, it's common to interrogate the top \(M\) payloads where \(M\) is set by experimental bandwidth, so the ability to rank payloads and assign binary “hit” labels is the most realistic task.

For the second payload design task, we construct an active learning benchmark where we perform successive experiments, each testing a group of payloads \(S\) attempting to achieve a target cell state \(y^*\). We use models at each stage to recommend the payloads \(S\) to test in the next round and measure the fraction of “hits” that achieve a desired cell state \(y^*\) discovered with the goal of discovering more hits in fewer rounds with a strong model and sampling method for \(S \sim p(S|y)\).

Ambrosia in silico reprogramming models

Here, we introduce a probabilistic in silico reprogramming model we call Ambrosia. Ambrosia leverages transfer learning from a pre-trained protein language model \(\psi: l \rightarrow s\) to generate TF representations \(s\) from protein sequences \(l\) (6).

To represent multi-TF payloads, Ambrosia aggregates TFs within a set \(S\) using an aggregation operation \(\xi = \alpha(s^{(1)}, \dots, s^{(K)})\), where \(\xi\) is a latent vector representation of the set \(S\) and \(\alpha\) is an aggregation operation. This approach is inspired by work on deep set learning (7).

In practice, we implemented \(\alpha\) as a sum operation over TF representations, though in principle, any permutation invariant operator could be used, including attention mechanisms. This allows Ambrosia models to initialize from rich representations of TF biology, reducing subsequent tasks to learning a distribution of cell states conditioned on payload representations.

We learn the conditional distribution:

\[ p_\theta(y | S) = f_{\theta}(S) \]

where \(f_\theta\) is implemented as a neural network with Monte Carlo dropout for uncertainty estimation (8). Practically, we implemented \(f_\theta\) using a 3-layer neural network with hidden layers of sizes \(\{512, 128\}\). Each layer is paired with a ReLU activation and a dropout layer to allow for regularization and uncertainty estimation. We optimize models with Adam to minimize a mean-squared error (MSE) loss, approximating the optimization of a Gaussian log-likelihood for \(p_\theta(y|S)\) in this setting. For uncertainty estimation, we perform 100 forward passes through the model with dropout active and parametrize \(p_\theta(y|S)\) as an empirical Gaussian distribution from these samples.

Payload design

We employ the trained model \(p_\theta(y | S)\) to design new payloads \(S\) using two schemes. In the following, we assume \(y\) is a scalar value.

In the first constrained setting, we model the scenario where a researcher has a small and finite number of TF payloads \(S\) that they can evaluate in the next round of experiments. Here, we nominate payloads \(S^*\) to achieve a desired cell state \(y^*\) through exhaustive evaluation \(S^* = \arg \max_S p(y^* | S)\). In practice, we define \(y^*\) as a region of the cell state variable \(y > \tau\), or \(y^*_{y>\tau}\).

In the second unconstrained setting, we model the scenario where a researcher has an unconstrained set of TF payloads to test, or an intractably large finite set. Here, we nominate payloads \(S^*\) by sampling \(S \sim p(S|y)\) through a Markov Chain Monte Carlo (MCMC) approach with a Metropolis-Hastings optimization procedure. We assume a simple uniform prior \(p(S)\) across the discrete set of TF payloads where \(| S | \leq 3\) (up to 3-TFs combinations). This is a practical expedient as our datasets do not include any payloads with more than 3 TFs, but relaxing this assumption is trivial.

Algorithmically, we propose TF payloads by sampling \(p(S)\) and draw samples \(S \sim p(S|y^*)\) using the following Metropolis-Hasting acceptance rule:

\[ A(S \rightarrow S') = \frac{p(S' | y^*)}{p(S | y^*)} \]

where \(S\) is the last sampled payload and \(S'\) is a payload sampled at the current iteration. We estimate \(p(S|y^*) \propto p(y^* | S)\cdot p(S)\) by Bayes rule given that we sample with a uniform prior \(p(S) \propto 1\). We use Monte Carlo dropout to sample \(y \sim p(y|S)\), allowing us to estimate \(p(y_{y>\tau}^* | S)\) empirically. This procedure is readily extensible to the case of arbitrarily large TF payloads, or the incorporation of synthetic TF sequences.

Baselines

We selected baseline methods based on recent benchmarking studies for cell perturbation prediction (4, 5). The Additive and Mean methods below were reported as the state-of-the-art across both studies.

Additive model: The current best performing baseline for combinatorial prediction is an additive model, simply denoted as \(f(S) = \sum_i^K y_{s_i}\) where \(y_{s_i}\) is the observed value of \(y\) for TF \(s_i\). In our probabilistic notation, this model can be expressed as \(p(y | S) = \mathbf{1}[\sum_i^K y_{s_i}]\) where \(\mathbf{1}[\cdot]\) is a Dirac delta distribution.

Mean model: A constant model that predicts the mean of the training data values for \(y\) is likewise reported as a strong baseline for predicting unseen perturbation effects, \(f_\mathrm{Mean}(S) = \frac{1}{|\mathcal{D}|} \sum_i^{|\mathcal{D}|} y_i\). In our probabilistic notation, this can be expressed as \(p(y | S) = \mathbf{1}[\mathbb{E}_{y \sim \mathcal{D_\mathrm{train}}}[y]]\).

Ablations

Ambrosia-Linear: We train a linear model to predict reprogramming effects from protein embeddings of TFs: \(f(S) = W \xi_S\) where \(W\) is a matrix of weights and \(\xi_S\) is a matrix of ESM2 embeddings. Again, in our probabilistic notation such a model can be framed as \(p(y|S) = \mathbf{1}[W\xi_S]\). This is an ablation of our Ambrosia model eliminating the non-linear logic and uncertainty estimation components.

Datasets

We trained in silico reprogramming models on two datasets.

K562: We first used a public combinatorial CRISPR inhibition screening dataset (“K562”) in K562 cells covering 236 genetic perturbations including 105 unique gene targets to offer an accessible comparison (9). To our knowledge, this is the most commonly used dataset for benchmarking perturbation prediction models (10, 11).

: We also used , a proprietary dataset activating 6503 TF sets containing 580 unique TFs across 3.6M primary human T cells with single cell RNA-seq read-outs. Reprogramming payloads in were transiently activated, mimicking the “dose” of reprogramming that is typically achieved with an mRNA medicine. Data were generated in primary cells because they are a more reliable model of human biology than the immortalized cell lines that are common in public domain datasets.

is more than an order-of-magnitude larger than any other existing single cell perturbation datasets in primary cells, providing us an opportunity to measure the performance of perturbation prediction models in a much larger data regime (12). All TF sets in are tested across \(\geq 5\) unique human donors and represented in \(\geq 50\) cells. contains not only gene expression profiles induced by each payload, but also a functional measure of each payload's impact on T cell growth in culture.

**Payload prediction:** **(a)** Ambrosia models are able to predict Tscm scores for the dataset better than an additive baseline. **(b)** Ambrosia models provide superior predictions of combinatorial payload effects relative to ablated approaches (Ambrosia-Linear) and the top baseline method (Additive). Payloads containing the Yamanaka Factor combination OSK alongside other factors are shown as an example (****: \(p < 10^{-4}\); Mann Whitney U-test). **(c)** Ambrosia models provide higher fidelity predictions on the **cell state** task relative to the best baseline (Additive). Predictions are shown in a UMAP embedding where each point represents the predicted effect of one payload. Insert panels and color coded arrows highlight regions of perturbation space where Ambrosia models offer superior predictions.

We first measured the performance of Ambrosia and baseline methods on the payload prediction task using multiple distinct representations of cell state \( y \). In both datasets, we predicted a compressed 50-dimensional PCA representation of gene expression \( y_{E} \) and scalar “gene set scores” \( y_G \) that represent target cell states of interest. For the K562 dataset, we predicted a gene set related to cell growth (mTOR activity) and for the dataset, we predicted a stem central memory T cell score (Tscm) that has been associated with stronger T cell responses in cancer and infectious disease settings (13). Maximizing the Tscm score represents a task highly relevant to therapeutics development.

For the dataset, we also predicted a fitness score \( y_{F} \) that measures the ability of T cells to respond to stimulation and grow in culture. We constructed a rank-based metric for the gene set and function tasks by designating the top 25% of all scores as “hits” and measuring the area under the precision-recall curve (AUPRC) for each model on each score.

We found that Ambrosia and the ablated variants were the best performing methods in both datasets across all of the tasks reported here (Table 1, 2, 3). As previously reported, the Additive baseline also demonstrated meaningful performance across tasks. Ambrosia models performed well across datasets and across gene expression and cell function tasks. Given that the function measurements were collected with an orthogonal measurement system, these results argue that the Ambrosia method is generally applicable. We found that Ambrosia models excelled at predicting large, non-additive effects in combinatorial payloads. For example, Ambrosia models provided significantly more accurate predictions for the effect of payloads containing three “Yamanaka Factors” (OCT4, SOX2, KLF4; OSK) than baseline methods.

**Table 1:** Performance comparison on the *cell state* task across datasets.
Model	K562
Model	CSE [\( \downarrow \)]	Cosine [\( \uparrow \)]	CSE [\( \downarrow \)]	Cosine [\( \uparrow \)]
Mean	0.86	0.60	1.23	0.43
Additive	0.42	0.86	0.93	0.68
Ambrosia-Linear	0.22	0.95	0.47	0.80
Ambrosia	0.21	0.92	0.37	0.79

**Table 2:** Performance comparison on the *gene set* task across datasets.
Model	K562
Model	CSE [\( \downarrow \)]	PCC [\( \uparrow \)]	AUPRC [\( \uparrow \)]	CSE [\( \downarrow \)]	PCC [\( \uparrow \)]	AUPRC [\( \uparrow \)]
Mean	0.44	0.00	0.27	0.84	0.00	0.27
Additive	0.31	0.80	0.87	0.86	0.78	0.82
Ambrosia-Linear	0.12	0.86	0.87	0.33	0.83	0.84
Ambrosia	0.11	0.85	0.88	0.20	0.90	0.89

**Table 3:** Performance comparison on the *function* prediction task in the dataset. The full Ambrosia model demonstrates superior performance to baselines and ablated models.
Model	CSE [\( \downarrow \)]	PCC [\( \uparrow \)]	AUPRC [\( \uparrow \)]
Mean	0.95	0.00	0.27
Additive	2.26	0.62	0.70
Ambrosia-Linear	0.56	0.71	0.73
Ambrosia	0.23	0.80	0.79

**Scaling laws:** Ambrosia's performance scales with the size of the training set. **(Left)** Test control-scaled error (CSE) decreases and **(Right)** area under the precision-recall curve (AUPRC) increases as the size of the training set increases.

Generative models have been shown to exhibit scaling laws in other data domains, including natural language and computer vision. As the amount of training data available grows, model performance tends to increase (14).

The scale of provides us one of the first opportunities to test if these laws are present in the single cell genomics perturbation prediction domain. To investigate, we trained Ambrosia models on data subsets \(D_p \subset \mathcal{D}\) where \(p \in [0, 1]\) is a proportion of the data used and measured performance on the payload prediction task. We constructed an initialization dataset \(D_I \subset \mathcal{D}\) containing all single TF payloads and joined it with each data subset \(D_p\).

We discovered that Ambrosia model performance improves as a function of data scale across multiple metrics. Performance follows a log-linear trend with high correlation (\(r > 0.8\)), mirroring behavior in other domains. We imagine that this behavior may have been overlooked in earlier studies due to the small scale of public datasets.

We hypothesize that these trends will extrapolate to larger data scales for TF payloads, and likewise emerge for other types of genetic perturbations in single cell genomics data.

**Payload design:** Active learning campaigns using Ambrosia models relative to a random baseline.
**(Left)** Ambrosia models accelerate the discovery of payloads \(S^*\) that achieve a target cell state \(y^*\). All Ambrosia design strategies are superior to a random baseline. **(Right)** Ambrosia methods had significantly higher area under the curve (AUC) than the baseline (*: \(p < 0.05\), **: \(p < 0.01\); Mann Whitney U-test).

In silico reprogramming methods have the potential to accelerate payload discoveries through a lab-in-the-loop workflow. In this setting, a model is trained on a set of data \( D_t \), then used to prioritize the payloads to test in the next experimental round \( D_{t+1} \). At each iteration \( t \), the number of “hits” or desirable payloads discovered is used as a metric of success. The model \( p_\theta(y|S) \) is retrained after each round of new data is collected. This scenario is analogous to active learning or Bayesian Optimization. If successful, a lab-in-the-loop workflow will improve upon the discovery rate of a random baseline.

We deployed Ambrosia in an active learning campaign across the dataset to optimize a therapeutically relevant T stem central memory (Tscm) cell state \( y_{y > \tau}^* \). We constructed this task to represent a realistic experimental setting where the researcher must design a pool of individual TFs \( \Phi \) to test in each experimental iteration \( t \). We assume that the experimental system allows the researcher to then test all \( k \)-TF combinations containing \( s \in \Phi \). This reflects the most common experimental methods in the field where payloads are constructed using either pooled molecular cloning or pooled delivery (9, 3).

We initialized models with a dataset \( D_0 \subset \mathcal{D} \) containing all single TF perturbations and 10% of multi-TF payloads. At each iteration \( t \in [1, 5] \), we constructed a pool for the next experimental round \( \Phi_t \) by designing payloads with Ambrosia models. Ambrosia was used to estimate the top payloads that remain to be tested \( S \in \mathcal{D} \setminus D_t \) to maximize the target state \( y^* \). We then assembled the pool \( \Phi_t \) by greedily adding unique TFs within the top ranked combinations until \( \left| \Phi_t \right| = 70 \). We constructed a set of payloads \( D_{\Phi_t} \) composed of TFs in the pool, then built the training dataset for the next round as \( D_{t+1} = D_t \cup D_{\Phi_t} \). Intuitively, we add all payloads that contain only TFs in the chosen pool to the dataset for the next round. We then measured the cumulative fraction of hits in the dataset recovered by iteration \( t \). The sampling procedure for Ambrosia to estimate top payloads \( S \) was varied across two settings.

We first evaluated performance in the constrained setting where we designed payloads in each cycle through exhaustive likelihood estimation across a small, finite set of possible payloads where we have ground truth data. This best represents a scenario where researchers have a limited hypothesis space of payloads to test due to experimental constraints. We designed payloads using two different acquisition strategies \( a(S) \) to rank payload candidates: (1) the maximum predicted effect (MPE; \( a_\mathrm{MPE}(S) = \mu_y(S) \)) or (2) the upper confidence bound (UCB; \( a_\mathrm{UCB}(S) = \mu_y(S) + \sigma_y(S) \)).

We found that Ambrosia models enabled active learning in the constrained setting with performance superior to a random baseline. The UCB acquisition function performed modestly better than the MPE function. In future work, we hope to explore if learning the conditional distribution \( p(y|S) \) may improve payload design performance over a simple point estimate \( \mathbb{E}[y|S] \).

We next performed active learning in the unconstrained setting where we design payloads by sampling the posterior \( S \sim p_\theta(S|y) \) with an MCMC approach. This represents the scenario where researchers have an infinite or intractably large hypothesis space, as is the case for synthetic TF design or searching payloads that contain many unique TFs. For this setting, we defined our target cell state as the top \(10\%\) of the Tscm score distribution up to that iteration (\( \tau = Q_{90\%}(y_t) \)). We restricted our MCMC procedure to only 10,000 samples to model the realistic scenario where exhaustively computing \( p(y|S) \) estimates across the entire search space is intractable. There are \( >10^6 \) payloads possible in our experimental setup (Methods [ref:methods-design]), so this represents sampling \( <1\% \) of the possible payloads.

We used a Metropolis-Hastings MCMC procedure to sample from the posterior \( p_\theta(S|y) \) and found that Ambrosia models were likewise sufficient to accelerate the discovery of hit payloads in this setting. Performance was in fact comparable to MPE ranking in the constrained setting, suggesting that our MCMC procedure is quite efficient. It's difficult to assess the quality of all samples generated by our model, as we only have ground truth data for a small fraction of the payload space. These results nonetheless suggest that our generative procedure is sufficient to accelerate biological discoveries in a lab-in-the-loop setting and many generated designs are high quality.

Here, we introduce a modeling approach (Ambrosia) for in silico reprogramming, a special case of the more general perturbation prediction problem in single cell genomics.

We demonstrate that Ambrosia models produce performant predictions of reprogramming effects on cell state and function by transfer learning from protein language models, with results superior to leading baseline methods.
Leveraging a unique single cell reprogramming dataset () with a much larger scale than prior reports, we discovered that in silico reprogramming exhibits a data scaling law, similar to other emerging biological domains such as nucleic acid and protein sequence modeling.

We believe this phenomenon is likely to emerge in other cases of the perturbation prediction problem in single cell genomics as well, but has likely been difficult to observe due to the small scale of public datasets. Our results suggest that larger scale single cell perturbation datasets and transfer learning from molecular foundation models will unlock meaningful performance in perturbation prediction ("virtual cell") models (15).

The ultimate goal of building in silico reprogramming models is to design payloads that induce target cell states. We found that Ambrosia models were able to improve the rate of designing hit payloads in multiple lab-in-the-loop settings. Reprogramming payload design space is too large for exhaustive in silico ranking procedures to be used absent some a priori constraint on the space (e.g. number of unique TFs, payload size). As experimental methods improve, these constraints cease to be a laboratory requirement, motivating in silico design approaches that can address the full extent of payload opportunities. Through a generative MCMC sampling procedure, Ambrosia models accelerated payload design in this emerging unconstrained setting as well. This generative approach opens the door to the design of reprogramming payloads within intractably large spaces, and even the design of entirely synthetic TFs.

In this work, we have demonstrated only a single possible implementation of a more general approach: transfer learning from molecular foundation models to design reprogramming payloads. In the future, we hope to explore models that incorporate a diversity of molecular representations learned in foundation models across the central dogma (DNA, RNA, protein). Likewise, we plan to extend the underlying Ambrosia architecture to employ inductive biases like attention operations to build more effective models. While we have constrained our work here to designing payloads composed of pre-defined, natural TFs, our modeling approach generalizes in principle to future synthetic TF design tasks. Our results to date support the conclusion that deploying in silico reprogramming models has the potential to accelerate payload discovery, unlocking the ability to rationally engineer cell states.

In silico design of epigenetic reprogramming payloads

Introduction

Approach and Algorithms

Task Construction

Ambrosia in silico reprogramming models

Payload design

Baselines

Ablations

Datasets

Payload prediction

Scaling laws

Payload design

Conclusions

References