- Software Defined Networking for HPC Interconnects and Its Extension across Domains
Software Defined Networking (SDN) is an emerging networking technology that allows software innovation in network control. It has been widely deployed in campus networks and data centers. However, the existing SDN technology is mainly based on Ethernet, which does not provide low latency and high bandwidth communication that is required by many traditional High Performance Computing (HPC) applications. In this project, we will investigate the techniques and benefits of introducing Openflow-style SDN capability into InfiniBand, the dominating interconnection network technology for commodity HPC clusters, develop schemes for HPC systems and applications to explore SDN capability, and validate our proposed techniques through modeling, simulation, and prototyping. Researchers also plan to develop software systems to provide Openflow-style SDN capability in the current InfiniBand and to inter-operate with existing inter-domain SDN framework such as the OSCARS system on ESnet to provide SDN capability across multiple SDN domains.
- Applying Machine Learning Techniques in Interconnect Design
Machine Learning, in particular, Reinforcement learning (RL) techniques have been enabling a new family of effective optimization methods for sequential decision-making problems. Coupled with deep neural networks that have been very successful in visual object recognition, speech recognition, and machine translation, systems trained via RL algorithms have surpassed grand master levels in various games such as Heads-up No-limit Texas Hold’em (DeepStack) and Go (AlphaGo-Zero), demonstrating better decision-making capabilities in complex situations than human experts when the desired properties of the results can be easily evaluated while it is challenging to make decisions, which is the characteristics of interconnect design. This research seeks to apply machine learning and in particular, reinforcement learning techniques to design a specific class of network, as well as the general the interconnect topology and routing.
- Enhancing Data and Communication Security in HPC Clusters
Existing security mechanisms for high-performance and distributed computing infrastructure are complex and difficult to deploy. As a result, many high-performance and distributed computing facilities do no deploy sufficient security mechanisms. This has prevented privacy-sensitive applications, such as those in the medical fields, and security-sensitive applications from using such facilities. In this project, we will develop and deploy DICE, Data Insurance in the Cluster Environment, to enhance the security in HPC and distributed computing clusters. DICE will consist of three major components: a container-based virtual cluster, a component to defend against side-channel attacks, and a secure execution ledger for auditing. The container-based virtual cluster will be developed based on the Docker Linux container. The Docker security mechanism will be enhanced by deploying an effective key management scheme for groups and by reducing the attack surface exposed to containers. Novel defense mechanisms will be developed and deployed to defend against side-channel attacks in the cluster environment by exploiting new security features in the recent processors. The secure execution ledger will provide a global holistic view of program execution in the whole system, enabling auditing the behavior of individual user as well as user groups. DICE essentially creates a two-level security model: on the (physical) cluster level, a group of (mostly) mutually trusted users share a single virtual cluster for their jobs; and inside the virtual cluster, the group may use existing security mechanisms of their software-of-choice to further refine security.
- Interconnection Network Technology for Future-generation Cloud Computing Data Centers and Supercomputers
As the sizes of future supercomputers and data centers continue to grow, designing interconnection networks that can satisfy the performance requirement while meeting cost and power constraints becomes more and more challenging. Even just evaluating a design choice can pose significant challenges. The goal of this project is two-fold. First, we will investigate efficient evaluation methods that allow a comprehensive comparison across multiple design choices for large-scale systems with tens of thousands of nodes using various metrics that may measure performance, power, and reliability. Second, we will develop novel topology, routing, and networking schemes that are more scalable, cost-effective, reliable, and power-efficient than existing proposals.