Hierarchical Federated Learning: Distributed Intelligence Beyond the Cloud

Katarina Vuknić / edited: November 7, 2025

A promising approach for distributed learning in edge environments is Federated Learning (FL), first introduced in 2016 [1]. In a traditional FL pipeline, a single server and multiple clients exchange model updates instead of raw data to collaboratively learn a global model that generalizes well across all clients. This approach significantly enhances data privacy and reduces bandwidth usage.

However, traditional FL faces two major limitations: communication cost influenced by model size, client participation rate, and the number of training rounds and a reliance on a single aggregator node, which creates a system bottleneck and a single point of failure, limiting scalability and robustness.

To address these issues, an extension known as Hierarchical Federated Learning (HFL) has been proposed.

Concept and Working Principles

HFL extends the traditional "flat" FL architecture by introducing an intermediate layer of edge servers between the central cloud server and the clients (Figure 1). Edge servers aggregate updates from their local clients and also act as clients to the central cloud server, enabling a more scalable, resilient, and communication-efficient learning architecture. This multi-tier architecture is well-aligned with the hierarchical structure of the Cloud-Edge-IoT environments or the computing continuum.

Architecture of hierarchical federated learning

Figure 1. Architecture of hierarchical federated learning

 The HFL learning process is explained in the following steps (checkout the video here):

Model Initialization and distribution: The global model is initialized at the cloud server and distributed to all edge server nodes. Edge server nodes propagate the received model to their respective clients.

Training locally on the clients: Each client trains the received model on its local data for a predefined number of iterations or epochs.

Model aggregation on the edge servers: Clients send modified model parameters to the corresponding edge server. The edge server aggregates all received local model updates using a chosen strategy (e.g., FedAvg) and shares the new model with all corresponding clients. Steps 2 and 3 are repeated for a predefined number of iterations or local rounds.

Model aggregation on the cloud server: Edge servers send updated models to the cloud server. The cloud server aggregates all received local model updates using a chosen strategy (e.g., FedAvg) and shares the new model with all edge servers. Steps 1–4 are repeated for a predefined number of iterations or global rounds, until the target model accuracy is reached or the model has converged.

The roles of worker nodes (clients) and aggregators (edge servers) can be assigned based on various criteria, such as available resources, network proximity, data or model similarity, performance over time, or can be statically configured according to the developer's deployment plan. In resource-based assignment, nodes with greater computational power, memory, and network bandwidth are selected as aggregators, while less capable nodes serve as worker nodes. When relying on network proximity, aggregators are nodes with low-latency connections to other nodes. Another approach groups clients based on data or model similarity, then selects one client per group to act as an aggregator.

More advanced systems use dynamic role assignment, where nodes adaptively switch roles based on real-time performance, availability, and other factors. Alternatively, roles may be statically defined according to the deployment plan or configuration, especially in simulations. The design of assignment logic depends on system utilization, architecture, and constraints.

Challenges and Strengths of HFL Compared to Traditional FL

HFL offers significant advantages in distributed environments, but it introduces trade-offs compared to traditional flat FL. Its major strength is communication efficiency, while slower convergence rate represents a major challenge.

By introducing an intermediate layer of edge servers, HFL significantly reduces uplink communication costs compared to traditional flat FL. Instead of every client communicating with the cloud, local updates are first aggregated at nearby edge servers, lowering bandwidth usage and enabling parallel processing. However, communication efficiency in HFL is influenced by several factors, including the number and placement of edge servers, client-to-aggregator ratios, and synchronization protocols. Techniques such as update compression and orchestrator-based coordination help maintain communication efficiency, even under dynamic network conditions or shifting client participation.

The convergence rate in HFL is typically slightly slower than in centralized FL due to several additional complexities. First, the multi-level aggregation structure introduces latency and can amplify gradient variance. Second, data heterogeneity, both within client groups and across regions, leads to model divergence and slower alignment. Third, stale or delayed updates caused by asynchronous communication further affect training stability. Finally, the dynamic nature of real-world deployments, such as - reassigning clients to different edge servers, role changes between worker and aggregator nodes, fluctuating bandwidth, and device heterogeneity - requires frequent reconfiguration, which can interrupt or delay convergence. These issues are actively being addressed through techniques like variance reduction, adaptive aggregation, and orchestrator-based coordination. 

Applying HFL in Practice

HFL can be particularly useful for applications deployed over distributed infrastructure comprising sensors and edge computing units. In such systems, sensor readings may be used locally for model training at the units that collected them or offloaded to nearby edge nodes with greater computational capacity. For example, in a smart farming system deployed across geographically distributed farms, local devices at each farm act as worker nodes, collecting data such as soil moisture, temperature, and crop health. Regional aggregator nodes, strategically placed based on network proximity or resource availability, aggregate updates from these local workers, capturing regional patterns. During global aggregation, the cloud server synthesizes the regional models into a unified global model, enabling system-wide learning while preserving data locality and reducing communication overhead. This hierarchical structure allows efficient scaling and adaptation to diverse environmental conditions across different regions. Another example is a smart traffic light control system, where HFL can be used to learn optimal traffic signal timings adapted to current traffic conditions. See [2] for more details.

The AIoTwin HFL Solution: Extension of the Flower Framework for HFL

To fully enable HFL in dynamic environments, the AIoTwin project developed an Extension of the Flower Framework for Hierarchical Federated Learning as a core component of its overall orchestration middleware. This component implements the FL Service used as a Learning Service within the larger ML pipeline architecture.

This component provides an original and generic implementation of HFL services (client, local aggregator, global aggregator) based on the Flower framework and is built in Python.

The HFL services rely on a task module specified for both the client and the global aggregator. This module defines the local/global model architecture using PyTorch, including functions for retrieving and setting model weights, training and evaluating the model, and managing the local client dataset.

HFL Service	Description and AIoTwin Implementation
Client	A Python module based on the default Flower client, defining fit and evaluation methods that utilize the task module logic. Configuration, including the local aggregator address, batch size, learning rate, and local epochs, is read from a YAML file.
Local Aggregator	Implemented using Python threading to run both a Flower client and a Flower server in parallel. The client thread receives the global model, which is stored in shared memory. The server thread then manages the training process for its local clients for a predefined number of local rounds. After updates are merged, the parameters are sent to the global server via the client thread.
Global Aggregator	A standard Flower server designed to treat the local aggregators as its clients. It uses the model definition from the task module to initialize the global model and reads strategy parameters (like minimum required clients and global rounds) from a YAML file.

This service is implemented using Python Slim 3.10 with PyTorch 2.2.1+cpu, Torchvision 0.17.1+cpu, and Flower 1.7.0, and is licensed under the Apache License 2.0.

You can explore the full implementation on the GitHub repository: https://github.com/AIoTwin/fl-service.

References

[1] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial intelligence and statistics. PMLR, 2017.

[2] Rana, Omer, et al. "Hierarchical and decentralised federated learning." 2022 Cloud Continuum. IEEE, 2022.