TL;DR

In this note, we discuss known approaches used to attempt to cluster validators into their respective operators, with the goal of encouraging the study of these techniques as a means to recognize Sybils and white labels. In our search, we encountered two different styles of approaches: on the one hand, there is the use of performance data in order to build machine-learning classifiers using features such as attestation effectiveness. On the other hand, some approaches analyze networking data from Ethereum’s consensus layer instead, in order to associate public validator keys with given beacon node IDs (or even their IP addresses). Regarding the latter approach, we will also see that the attempts to prevent this deanonymization of validators have been unfruitful so far since they introduce untenable levels of latency and performance loss. We close with some remarks about interesting research directions for the future.

Introduction

The ability to draw inferences from data to identify clusters of validators as being handled by the same operator will be an important element of fighting Sybils and white labels in the Lido permissionless validator set. As it turns out, drawing these inferences is not a new problem. For example, Rated Labs and their Network Explorer require this clustering to assign aggregated performance scores to the entities running the validators.

As a starting point, their documentation describes some thoughts on their approach. First, validator indices are aggregated in terms of the deposit addresses that initiated the validator. These addresses are further aggregated into operators:

Both on the front-end and the API layer of Rated, we are further grouping deposit addresses up to the entities that they are associated with. At the moment, there is no standard way via which we collate this information; it is a combination of Etherscan research, Ethereum transaction log queries, block graffiti and in some cases, operators coming straight to us willing to self-disclose.

In principle, there can be challenges with using deposit addresses or even withdrawal credentials directly when attempting to cluster Lido permissionless validators. The reason is that both these addresses will point to Lido smart contracts for these validators.

<aside> 💡 Note: There are a few on-chain data sources that could be used to substitute the withdrawal credential in the clustering above.

Depending on how the process of onboarding permissionless validators to Lido ends up looking (and whether it depends on a transaction), we may be able to use the address that initiated the validator onboarding as a substitute for the deposit address.
Since permissionless validators will require some collateral to be posted by operators, the source of the bonds is also an address to watch out for, as stated in White-labeling evidence types.

</aside>

Problem statement

Can we leverage other sources of data that allow us to cluster validators and determine whether they are likely to be run by the same entity?

Note that our research is not looking to prescribe a specific approach to these clustering techniques, instead aiming to provide free-market incentives for external parties to create them. However, we would still like to provide the current literature and ecosystem insights on how such a task could be done.

For our analysis, we will roughly identify the following sources of data to generate these insights:

Performance data: information acquired from observing block proposals and attestations that are recorded on the beacon chain.
Networking data: information related to Ethereum’s P2P network of validators that is not directly recorded by the protocol.

We can mention a third source of data for profiling validators—namely, financial data. We can identify correlations between the sources of bonds for different validators in order to trace them back to the same operator. Since analyzing this financial data can be done via more standard methods (such as blockchain forensics and asset tracking), we will not dive deeply into this.

Distinguishing between these data sources is important due to how different the approaches to obtaining insights from them can be. For example, the literature provides a variety of examples of how to detect Sybils in social networks. Approaches in this context are facilitated by the interactions between the different entities, which allow us to create a social graph and analyze relations between the different actors. Note that performance data is not amenable to such insights, but networking data might be.

Features from performance data

To our knowledge, the state-of-the-art on utilizing performance data for clustering validators is described in Rated’s article “Solo Stakers: The Backbone of Ethereum”. Here, the Rated team describes a set of features that were used to train a classifier of validators, with the purpose of identifying solo stakers in the beacon chain. Namely, they mention:

performance metrics like effectiveness, inclusion delay, missed attestations, and more

consensus client (using blockprint)
number of validators associated with a deposit address and withdrawal credentials
total incoming/outgoing gas for the deposit address
validator index (not used in the final model, but a useful feature when building inferred features due to the way some operators do bulk validator creation).