2024-…: Privacy-Preserving Federated Learning for Science: Building Sustainable and Trustworthy Foundation Models

Federated learning (FL) offers a collaborative framework for training foundation models (FMs) and other AI models across distributed computing infrastructures and datasets while incorporating privacy-preserving techniques to manage private-sensitive datasets. This proposal addresses the challenges inherent in adapting FL to the “pre-train” and “fine-tune” paradigms of FMs with billions or trillions of parameters. These challenges include increased communication costs, computation burdens on clients, and the handling of massive model parameters and multi-modal datasets. Moreover, existing privacy-preserving techniques, such as differential privacy (DP), need to address scalability issues with such large models and different privacy requirements from clients. With synthetic data emerging as a promising alternative, new data management challenges are anticipated in the privacy-preserving FL (PPFL) framework with privacy-sensitive datasets and synthetic data.

The project develops efficient communication, memory, and energy optimization techniques for FL algorithms, particularly for large-scale FMs, while ensuring fairness and incentivizing participation. It advances DP techniques to address scalability and heterogeneity challenges, create and manage synthetic data to preserve privacy while maintaining data utility, and integrate these efforts into a cohesive data management framework to enhance the scalability and performance of PPFL systems. Specifically, the research is structured around four main thrusts: (1) improving communication, memory, and energy efficiency; (2) addressing continual learning with incentives and fairness; (3) developing scalable and heterogeneous DP techniques; and (4) creating synthetic data as a privacy-preserving alternative. A crosscut thrust integrates these efforts, providing efficient model and data management schemes using tools, like Mofka and ProxyStore, to tackle access, sharing, versioning, control, and evolution of large datasets and models.

This research effort significantly advances the field of PPFL by enhancing the scalability and efficiency of training large FMs, ensuring fairness and incentive structures for client participation in FL, developing scalable DP techniques that maintain model utility while ensuring privacy, and creating high-quality synthetic data as a proxy for sensitive datasets. The integration of these thrusts is demonstrated through specific scientific use cases in X-ray image science and electric grids, focusing on efficiently training large FMs with substantial data streams subject to privacy constraints. The outcomes ensure the sustainable and trustworthy training and deployment of FMs for science, benefiting a wide range of applications and advancing the state of the art in AI and FL.

Funding Sources

Advancements in Artificial Intelligence for Science Program, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy

Participants

Kibaek Kim (PI), Ravi Maddur, Todd Munson, Krishnan Raghavan, Rob Ross, and Matthieu Dorier — Argonne National Laboratory
Tom Flynn, Ai Kagawa, and Byung-Jun Yoon — Brookhaven National Laboratory
Olivera Kotevska and Christian Engelmann — Oak Ridge National Laboratory
Minseok Ryu — Arizona State University
Farzad Yousefian — Rutgers University

In the News

2024-10-15: ORNL News. New ORNL projects included in $67 million from DOE for AI in science research.

Symbols: Abstract, Publication, Presentation, BibTeX Citation