Publications

(2024). AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3.

PDF Cite DOI URL

(2022). Scaling Graph Traversal to 281 Trillion Edges with 40 Million Cores. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

PDF Cite DOI URL

(2021). Encouraging Compiler Optimization Practice for Undergraduate Students through Competition. Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1.

PDF Cite DOI URL

(2019). PLock: A Fast Lock for Architectures with Explicit Inter-Core Message Passing. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.

PDF Cite DOI URL

(2019). HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.

PDF Cite DOI URL

(2018). VSensor: Leveraging Fixed-Workload Snippets of Programs for Performance Variance Detection. Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

PDF Cite DOI URL

(2018). Spindle: Informed Memory Access Monitoring. Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference.

PDF Cite

(2018). Bridge the Gap between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler. Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems.

Cite DOI URL

(2017). VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU. 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

PDF Cite

(2017). SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems.

PDF Cite DOI URL

(2016). Neural network transformation under hardware constraints. 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

Cite DOI

(2016). Gemini: A Computation-Centric Distributed Graph Processing System. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation.

PDF Cite

(2015). To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures. 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

Cite DOI

(2015). GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference.

Cite

(2014). Optimizing Seam Carving on multi-GPU systems for real-time image resizing. 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

PDF Cite DOI

(2014). CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression. SC ‘14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

PDF Cite DOI

(2014). Cybertron: Pushing the Limit on I/O Reduction in Data-Parallel Programs. Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications.

PDF Cite DOI URL

(2013). Shall I Use Heterogeneous Data Centers? - A Case Study on Video on Demand Systems. 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

Cite DOI

(2013). Cost-effective cloud HPC resource provisioning by building Semi-Elastic virtual clusters. SC ‘13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

Cite

(2013). ACIC: Automatic cloud I/O configurator for HPC applications. SC ‘13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

PDF Cite DOI

(2011). OpenMDSP: Extending OpenMP to Program Multi-Core DSP. 2011 International Conference on Parallel Architectures and Compilation Techniques.

PDF Cite DOI

(2011). An SSA-Based Algorithm for Optimal Speculative Code Motion under an Execution Profile. Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation.

PDF Cite DOI URL

(2010). Taming Hardware Event Samples for FDO Compilation. Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization.

PDF Cite DOI URL

(2010). PHANTOM: Predicting Performance of Parallel Applications on Large-Scale Parallel Machines Using a Single Node. Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

PDF Cite DOI URL

(2010). MapCG: Writing Parallel Program Portable between CPU and GPU. Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques.

PDF Cite DOI URL

(2010). How OpenMP Applications Get More Benefit from Many-Core Era. Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More.

PDF Cite

(2010). Do I Use the Wrong Definition? DeFuse: Definition-Use Invariants for Detecting Concurrency and Sequential Bugs. Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications.

PDF Cite DOI URL

(2009). Cache Sharing Management for Performance Fairness in Chip Multiprocessors. 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

PDF Cite DOI

(2009). MPIWiz: Subgroup Reproducible Replay of Mpi Applications. Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

PDF Cite DOI URL

(2009). FACT: Fast Communication Trace Collection for Parallel Applications through Program Slicing. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

PDF Cite DOI URL

(2008). Exploring the Emerging Applications for Transactional Memory. 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies.

PDF Cite DOI

(2007). PBB: A Parallel Bioinformatics Benchmark Suite for Shared Memory Multiprocessors. Proceedings of the 2007 Asian Technology Information Program’s (ATIP’s) 3rd Workshop on High Performance Computing in China: Solution Approaches to Impediments for High Performance Computing.

PDF Cite DOI URL

(2006). VODCA: View-Oriented, Distributed, Cluster-Based Approach to Parallel Computing. Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

PDF Cite DOI

(2006). Tree partition based parallel frequent pattern mining on shared memory systems. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

Cite DOI

(2006). Parallel implementation and performance characterization of MUSCLE. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

Cite DOI

(2005). Hierarchical Parallel Simulated Annealing and Its Applications. Proceedings of the 6th International Conference on Algorithms and Architectures for Parallel Processing.

PDF Cite DOI URL

(2005). A Dynamic Energy Conservation Scheme for Clusters in Computing Centers. Proceedings of the Second International Conference on Embedded Software and Systems.

Cite DOI URL