Handling Dynamics in HFL Pipelines: The Need for Adaptive Orchestration

Ana Petra Jukić / November 12, 2025

Motivation for orchestration

Hierarchical Federated Learning (HFL) offers solutions to scalability issues, excessive communication overhead, prolonged training time, and data and hardware heterogeneity [1]. However, configuring an optimal multi-layer HFL architecture is not trivial. For example, aggregator nodes must be placed strategically, so clients can be assigned to them based on network distance, hardware resources, and dataset distribution. The latter often involves non-identical distributions, such as label-distribution skew, which usually lead to poor model performance. Another issue is that clients may join or leave a running HFL pipeline: newly joined clients may contribute significantly different amounts of more or less useful training data. Such a dynamic environment means that a previously optimal configuration (such as client-to-aggregator assignment, client inclusion in the training process, and aggregator placement) can quickly become suboptimal, leading to model performance degradation and unnecessary energy use and communication overhead during a training HFL task. For example, we may observe two scenarios with different reconfiguration outcomes under a total budget constraint:

Scenario A Scenario B

In both cases, a new node joins one of the clusters at R^reconf round. In Scenario A, adding the new low-cost, large-dataset node improves performance and budget efficiency, making the configuration more feasible, whereas in Scenario B, the high-cost, small-dataset node slows accuracy gains and exhausts the budget faster than the original setup. Clearly, HLF implementations operating in real-world, highly dynamic environments must incorporate reconfiguration validation and decide whether to revert to the original configuration, while considering that reverting also includes extra cost.

One promising solution is to employ an orchestrator, a component that monitors system and FL metrics, makes decisions, and coordinates hierarchical reconfiguration for adaptive roles, load balancing, and optimized communication. The orchestration objective is tailored to the scenario, such as maximizing model performance within a predefined budget. A solution of this type is the AIoTwin orchestration middleware, an adaptive orchestration framework for HFL under a predefined budget [2]. It detects and responds to events that trigger HFL reconfiguration based on multi-level monitoring (accuracy, resources, and costs), while estimating reconfiguration cost and impact on model accuracy. By using Kubernetes, the framework adapts to node joins, failures, and load shifts without disrupting the workflow. The rest of this post covers our design and implementation details.

Architecture, Mechanisms and Implementation of the AIoTwin orchestration middleware for HFL

The adaptive FL orchestration architecture includes:

Orchestrator consisting of:
- FL Controller - manages the pipeline
- FL Configuration module - selects the optimal pipeline setup for the environment
- HTTP-based FL Orchestrator API - receives input from the ML engineer
Nodes:
- Clients or aggregators running FL services through a virtualization engine. The service also includes a sidecar FL Agent that monitors the FL pipeline and reports to the FL Controller. Each node runs a single service.

In the implementation, all yellow components are implemented by Kubernetes and used by the green, FL-specific components through the Kubernetes API:

The integration with Kubernetes API doesn’t restrict orchestration to Kubernetes, as other lightweight edge distributions implement the same API. The orchestrator is written in Golang (available at https://github.com/AIoTwin/fl-orchestrator).

The orchestration consists of five steps, part of which are repeated iteratively until the objective is reached:

Slika na kojoj se prikazuje tekst, dijagram, dizajn, origami

Sadržaj generiran uz AI možda nije točan.

The steps are as follows:

Receive input defined by the ML engineer: initial model, budget, FL and training parameters (batch size, epochs, learning rate), and an objective (for example, maximizing performance within a budget or minimizing cost for a target accuracy). Training cost may be defined in terms of energy or communication. Once the input is set, the pipeline runs automatically. The orchestrator API collects these parameters and cost settings, which are then injected into each node’s task module within the FL service [1].
Collect node features: discover all nodes available for the (H)FL pipeline and gather infrastructure and FL-specific details. Infrastructure details include each node’s computing resources and its network connections to the aggregators, used to estimate communication cost. FL-specific data identifies aggregators and clients with training data. For clients, the controller queries the FL Agent for data distribution and any past training or resource-usage history. Node availability, resources, and cost are retrieved via the Kubernetes API.
Identify optimal FL configuration: The user inputs and node features are passed to the FL configuration module to determine the optimal FL setup. The FL configuration module can incorporate any state-of-the-art configuration algorithm. The output of the module defines the FL topology: clients and aggregators are mapped to nodes, and the connections between are defined. Currently, the implemented methods rely on KDL or on minimizing communication cost.
Deploy FL components on nodes following the optimal FL configuration.
Monitor the environment: Monitoring of the infrastructure and FL-performance. Infrastructure monitoring tracks components that affect FL performance, including network conditions and node resources and states. FL performance monitoring tracks training metrics. If a node unexpectedly joins or leaves during monitoring, the system triggers a re-evaluation of the pipeline configuration.
Validate reconfiguration in case an optimal configuration is identified: Before applying the new configuration, controller evaluates whether the reconfiguration costs (computational, communication, or time-related) are feasible. If the reconfiguration is beneficial, the controller repeats the deployment process of clients and aggregators (step 4) using a snapshot of the latest global model to preserve progress. Otherwise, the controller keeps the current (possibly suboptimal) configuration. The new reconfiguration is later further validated using the reconfiguration validation algorithm (RVA).

Reconfiguration validation algorithm (RVA)

After deploying the new configuration (round R^reconf ), the orchestrator waits for a reconfiguration window of W rounds (between R^reconf and R^eval ), and uses the RVA to further validate whether to keep the new configuration or revert to the old one. The RVA algorithm follows these steps:

Slika na kojoj se prikazuje tekst, dijagram, crta, Font

Sadržaj generiran uz AI možda nije točan.

The formal description of RVA and additional details on experiments with simulated, real and cross-site infrastructure are available in [2].

References

[1] K. Vuknić, “Hierarchical Federated Learning: Distributed Intelligence Beyond the Cloud” AIoTwin, Nov. 7, 2025. [Online]. Available: https://aiotwin.eu/aiotwin/results/blog/hierarchical_federated_learning

[2] Čilić, Ivan, et al. "Reactive Orchestration for Hierarchical Federated Learning Under a Communication Cost Budget." 2025 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN). IEEE, 2025.