FedPS: Federated data Preprocessing
via aggregated Statistics

Xuefeng Xu^1 and Graham Cormode^2

^1University of Warwick, ^2University of Oxford

Preprint 2026

TL;DR: A unified framework for tabular data preprocessing in federated learning.

Why Federated Preprocessing?

Data preprocessing is a crucial step in machine learning pipelines, transforming raw data into a suitable format for model training. However, in federated learning, this step is often overlooked.

We introduce FedPS, a federated data preprocessing framework that uses aggregated summary statistics from clients to perform core preprocessing tasks, including scaling, encoding, transformation, discretization, and missing-value imputation, while keeping raw data decentralized.

Preprocessing Tasks

Scaling: Normalize features to comparable ranges.
Encoding: Convert categorical values into numerical form.
Transformation: Apply non-linear mappings (distribution adjustments).
Discretization: Convert continuous values into discrete form.
Imputation: Fill in missing values (univariate or multivariate).

How FedPS Works?

The FedPS framework operates in five steps (Figure 1):

1 Compute local statistics;
2 Share and aggregate statistics;
3 Derive preprocessing parameters;
4 Broadcast parameters to clients;
5 Apply preprocessing locally.

Figure 1: Overview of FedPS preprocessing framework.

For example, to implement StandardScaler, which ensures features have zero mean and unit variance:

1 Clients computes statistics (n, c, s), where n is number of samples, c=\sum_i x_i and s=\sum_i x_i^2.
2 Server aggregates these statistics by summation to obtain (N, C, S).
3 Server computes the global mean and variance: \mu=C/N, \sigma^2=S/N-\mu^2.
4 The server broadcasts (\mu, \sigma) to all clients.
5 Clients apply feature scaling on local data: x'=(x-\mu)/\sigma.

What Statistics Are Needed?

We examine the summary statistics required by common preprocessors in Scikit-learn. Table 1 highlights representative preprocessors and their formulations. The complete table and communication analysis appear in the paper.

Table 1: Preprocessors and associated statistics.

Categories	Preprocessors	Formulation	Statistics
Scaling	MinMaxScaler	(x-x_{\max})/(x_{\max}-x_{\min})	Min, Max
	StandardScaler	(x-\mu)/\sigma	Mean, Variance
	RobustScaler	(x-Q_2)/(Q_3-Q_1)	Quantile
Encoding	LabelEncoder	ordinal(y)	Set Union
	OneHotEncoder	one-hot(x)	Set Union, Frequent items
	OrdinalEncoder	ordinal(x)	Set Union, Frequent items
Transformation	PowerTransformer	\psi(\lambda,x)	Sum, Mean, Variance
	QuantileTransformer	CDF(x), \Phi^{-1}(CDF(x))	Quantile
Discretization	KBinsDiscretizer	j if T_j\le x<T_{j+1}	Min, Max, Quantile, Mean
Imputation	SimpleImputer	mean(x), median(x), freq(x)	Mean, Quantile, Freq-items
	KNNImputer	mean(k-NN of x)	Min, Mean, Sum
	IterativeImputer	RegressionModel(x)	Sum

Basic statistics, Min, Max, Mean, Variance, are inexpensive to compute in a federated setting. In contrast, statistics like quantiles and frequent items require substantial communication if computed exactly. For these, we use data sketching techniques (quantile sketches, frequent-items sketches) to compute them.

Some preprocessors rely on machine learning models, for example:

KBinsDiscretizer uses k-means clustering.
KNNImputer uses k-nearest neighbors.
IterativeImputer uses Bayesian linear regression.

We extend these algorithms to the federated setting, with sufficient statistics summarized in Table 1.

Why Not Other Approaches?

Several preprocessing approaches are possible in federated learning:

Centralized preprocessing: upload all data to the server.
No preprocessing: skip preprocessing entirely.
Transfer preprocessing: use public data or pre-trained models.
Local preprocessing: each client processes data independently.
Federated preprocessing: our proposed FedPS framework.

All alternatives except federated preprocessing have major limitations (Table 2).

Table 2: Approaches for data preprocessing in federated learning.

Approaches	Key Summary
Centralized preprocessing	Pros: Consistent across clients Cons: Infeasible in real setting
No preprocessing	Cons: Data issues remain Cons: Poor model performance
Transfer preprocessing	Pros: Works for easy tasks Cons: Fails on complex tasks
Local preprocessing	Pros: Easy to implement and deploy Cons: Inconsistent transforms across clients
Federated preprocessing	Pros: Consistent global preprocessing Pros: Better model performance

Experiments

We evaluate StandardScaler and OrdinalEncoder using three preprocessing strategies, federated preprocessing, local preprocessing, and no preprocessing, on the Cover dataset. We train using the FedAvg algorithm and an MLP model. Test accuracy results are shown in Figure 2.

We consider two types of client data distributions:

IID setting: data is uniformly partitioned across clients.
Non-IID setting: data is partitioned with skewed label distributions.

Figure 2: Test accuracy comparison under IID (left) and non-IID (right) settings.

The results highlight three key observations:

Preprocessing is essential, i.e, large accuracy gap vs. no preprocessing.
Local preprocessing fails in non-IID settings due to inconsistent transforms.
Federated preprocessing achieves stable and superior performance in all scenarios.

Citation

@misc{Xu2026fedps,
  title={FedPS: Federated data Preprocessing via aggregated Statistics},
  author={Xuefeng Xu and Graham Cormode},
  year={2026},
  eprint={2602.10870},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.10870},
}