Introduction

Throughout our research, we will need to discuss questions on the sample size required to make statistically significant observations on the Ethereum validator set. These questions will be essential to all of our proposed methods and solutions. Namely,

Given a validator set of size N, what sample size n do we need to achieve a reasonably accurate (in a sense yet to be specified) picture of the validator set’s client diversity?
For each of the proposed solutions, how long does it take to achieve this?

We answer the first question below, which will provide the required insights to answer the second question for each proposed solution.

Formal definition and modeling of the problem

Let $c_1, c_2, \dots c_m$ represent different Ethereum client implementations (we analyze the consensus and execution cases separately). For each client $c_i$, there is a proportion $0 \leq p_i \leq 1$ of the entire validator set $\mathcal{V}$ running the client $c_i$.

<aside> ℹ️ Remark: due to architectures such as multiplexers and DVT, it is not true that $\sum{p_i} = 1$. Instead, we have $\sum{p_i} > 1$. This does not affect the analysis, as will be seen below.

</aside>

Goal: Determine a sample size $n\ll N$ that can be used to estimate the true proportions $p_i$ with statistical significance.

The general procedure to follow is outlined below:

Given a randomly chosen sample $\mathcal{S} \subset \mathcal{V}$ with $|\mathcal{S}| = n$, query each validator for their client diversity usage, i.e., gather the bit m-tuples $q_j = (x_{1j}, x_{2j}, \dots, x_{mj})$, where $x_{ij} \in \{0,1\}$ denotes whether client $c_i$ is used by validator $j$, for $1\leq i \leq m$, and $1\leq j \leq n$.
Define the estimators $\hat{p_i} = \sum_{j}x_{ij}/n$.
Observe that, since $\mathcal{S}$ is chosen randomly, the values $x_{ij}$ (for fixed $i$ and varying $j$) correspond to independent observations, wherefore each of the estimators $\hat{p}_i$ follow binomial distributions.
By utilizing the theory of approximations to binomial distributions, one can relate the required sample size $n_i$ with a margin of error $E_i$ for the measured proportion, assuming a given confidence level (95% is a standard choice).

<aside> ℹ️ Note: the binomial distribution model is commonplace for statistical estimations of proportions, and is regularly used in various types of experiments, including surveys and clinical trials. (See, e.g., Fleiss, J. L., Levin, B., & Paik, M. C. [2013]. Statistical Methods for Rates and Proportions.)

The only additional assumption made here is a steady-state one: we assume that variability in the client diversity distribution is negligible over the period where the samples are collected. This requires the period to be reasonably short (e.g. days as opposed to months)—an assumption that we will check for consistency in each of our methods.

</aside>

Notation

For the approximations below, we define the following variables:

$p$ is the (true) proportion as defined above, which we seek to estimate experimentally.
We denote the confidence level as $1-\alpha$, so that $\alpha>0$ is a small positive number. For example, for a confidence level of 95%, $\alpha = 0.05$.