FedPS: Federated data Preprocessing
via aggregated Statistics

Xuefeng Xu^1 and Graham Cormode^2
^1University of Warwick, ^2University of Oxford
Preprint 2026

TL;DR: A unified framework for tabular data preprocessing in federated learning.

Why Federated Preprocessing?

Data preprocessing is a crucial step in machine learning pipelines, transforming raw data into a suitable format for model training. However, in federated learning, this step is often overlooked.

We introduce FedPS, a federated data preprocessing framework that uses aggregated summary statistics from clients to perform core preprocessing tasks, including scaling, encoding, transformation, discretization, and missing-value imputation, while keeping raw data decentralized.

Preprocessing Tasks

  • Scaling: Normalize features to comparable ranges.
  • Encoding: Convert categorical values into numerical form.
  • Transformation: Apply non-linear mappings (distribution adjustments).
  • Discretization: Convert continuous values into discrete form.
  • Imputation: Fill in missing values (univariate or multivariate).

How FedPS Works?

The FedPS framework operates in five steps (Figure 1):

  • 1 Compute local statistics;
  • 2 Share and aggregate statistics;
  • 3 Derive preprocessing parameters;
  • 4 Broadcast parameters to clients;
  • 5 Apply preprocessing locally.

Figure 1: Overview of FedPS preprocessing framework.

For example, to implement StandardScaler, which ensures features have zero mean and unit variance:

  • 1 Clients computes statistics (n, c, s), where n is number of samples, c=\sum_i x_i and s=\sum_i x_i^2.
  • 2 Server aggregates these statistics by summation to obtain (N, C, S).
  • 3 Server computes the global mean and variance: \mu=C/N, \sigma^2=S/N-\mu^2.
  • 4 The server broadcasts (\mu, \sigma) to all clients.
  • 5 Clients apply feature scaling on local data: x'=(x-\mu)/\sigma.

What Statistics Are Needed?

We examine the summary statistics required by common preprocessors in Scikit-learn. Table 1 highlights representative preprocessors and their formulations. The complete table and communication analysis appear in the paper.

Table 1: Preprocessors and associated statistics.
Categories Preprocessors Formulation Statistics
Scaling MinMaxScaler (x-x_{\max})/(x_{\max}-x_{\min}) Min, Max
StandardScaler (x-\mu)/\sigma Mean, Variance
RobustScaler (x-Q_2)/(Q_3-Q_1) Quantile
Encoding LabelEncoder ordinal(y) Set Union
OneHotEncoder one-hot(x) Set Union, Frequent items
OrdinalEncoder ordinal(x) Set Union, Frequent items
Transformation PowerTransformer \psi(\lambda,x) Sum, Mean, Variance
QuantileTransformer CDF(x), \Phi^{-1}(CDF(x)) Quantile
Discretization KBinsDiscretizer j if T_j\le x<T_{j+1} Min, Max, Quantile, Mean
Imputation SimpleImputer mean(x), median(x), freq(x) Mean, Quantile, Freq-items
KNNImputer mean(k-NN of x) Min, Mean, Sum
IterativeImputer RegressionModel(x) Sum

Basic statistics, Min, Max, Mean, Variance, are inexpensive to compute in a federated setting. In contrast, statistics like quantiles and frequent items require substantial communication if computed exactly. For these, we use data sketching techniques (quantile sketches, frequent-items sketches) to compute them.

Some preprocessors rely on machine learning models, for example:

We extend these algorithms to the federated setting, with sufficient statistics summarized in Table 1.

Why Not Other Approaches?

Several preprocessing approaches are possible in federated learning:

  1. Centralized preprocessing: upload all data to the server.
  2. No preprocessing: skip preprocessing entirely.
  3. Transfer preprocessing: use public data or pre-trained models.
  4. Local preprocessing: each client processes data independently.
  5. Federated preprocessing: our proposed FedPS framework.

All alternatives except federated preprocessing have major limitations (Table 2).

Table 2: Approaches for data preprocessing in federated learning.
Approaches Key Summary
  1. Centralized preprocessing
  • Pros: Consistent across clients
  • Cons: Infeasible in real setting
  1. No preprocessing
  • Cons: Data issues remain
  • Cons: Poor model performance
  1. Transfer preprocessing
  • Pros: Works for easy tasks
  • Cons: Fails on complex tasks
  1. Local preprocessing
  • Pros: Easy to implement and deploy
  • Cons: Inconsistent transforms across clients
  1. Federated preprocessing
  • Pros: Consistent global preprocessing
  • Pros: Better model performance

Experiments

We evaluate StandardScaler and OrdinalEncoder using three preprocessing strategies, federated preprocessing, local preprocessing, and no preprocessing, on the Cover dataset. We train using the FedAvg algorithm and an MLP model. Test accuracy results are shown in Figure 2.

We consider two types of client data distributions:

  • IID setting: data is uniformly partitioned across clients.
  • Non-IID setting: data is partitioned with skewed label distributions.

Figure 2: Test accuracy comparison under IID (left) and non-IID (right) settings.

The results highlight three key observations:

  1. Preprocessing is essential, i.e, large accuracy gap vs. no preprocessing.
  2. Local preprocessing fails in non-IID settings due to inconsistent transforms.
  3. Federated preprocessing achieves stable and superior performance in all scenarios.

Citation

@misc{Xu2026fedps,
  title={FedPS: Federated data Preprocessing via aggregated Statistics},
  author={Xuefeng Xu and Graham Cormode},
  year={2026},
  eprint={2602.10870},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.10870},
}