Optimizing Machine Learning

Machine Learning (ML) has become a central area in modern computing. The number of applications using ML models has grown exponentially, and as such the demand for engineers who know how to pilot toolchains  like PyTorch, TensorFlow, etc. On the other hand, there is much more interesting knowledge in ML than just learning how to apply it to a problem. Computing a ML model graph is a computational intensive task that requires a number of optimization and parallelization algorithms that are hidden from the typical user. Understanding these algorithms is a central tool for the modern data-scientist,  as it differentiates the professional from the crowd of people working in this area. In this project, we aim at developing new parallelization and optimization techniques that can be used to compute ML graphs. Our team already looked under the hood of three major ML Engines: Glow, XLA (from tensorflow) and JAX.

OpenMP Cluster (OMPC)

Parallel programming in clusters has been intensely researched, motivated by the potential impact that new ideas could have on the performance of scientific and data-analytic applications. Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). Moreover, task parallelism has shown to be an efficient and seamless programming model for clusters. Thus, this project introduces a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP’s offloading standards to distribute annotated regions of code across the nodes of a distributed system, hiding the complexity of its efficient MPI-based data distribution and load-balancing behind OpenMP task dependencies.


VArchC: Approximate Computing made easy

VArchC is a framework specially designed to represent computer architectures subjected to Approximate Computing. VArchC works as an extension to the ArchC Architecture Description Language, enabling its tools to automatically generate processor models that implement approximation techniques. Using VArchC, all a designer needs to inject approximations in a target architecture is an ArchC CPU model, specific high-level software models of the desired approximation behaviors, and a high-level configuration file that links them. This configuration file is written using the ADeLe language.

See more at 

Source Matching and Rewriting

A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, modern ISA extensions and hardware accelerators can benefit from the compiler’s ability to detect and raise program idioms to acceleration instructions or optimized library calls. Although recent works based on Multi-Level IR (MLIR) have been proposed for code raising, they rely on specialized languages, compiler recompilation, or in-depth dialect knowledge. This project introduces Source Matching and Rewriting (SMR), a user-oriented source-code-based approach for MLIR idiom matching and rewriting that does not require a compiler expert’s intervention. SMR uses a two-phase automaton-based DAG-matching algorithm inspired by early work on tree-pattern matching. First, the idiom Control-Dependency Graph (CDG) is matched against the program’s CDG to rule out code fragments that do not have a control-flow structure similar to the desired idiom. Second, candidate code fragments from the previous phase have their Data-Dependency Graphs (DDGs) constructed and matched against the idiom DDG.