Publications

(2024). WiseGraph: Optimizing GNN with Joint Workload Partition of Graph and Operations. Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2024, Athens, Greece, April 22-25, 2024.

Cite DOI URL

(2024). POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training. Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2024, Edinburgh, United Kingdom, March 2-6, 2024.

Cite DOI URL

(2024). Optimal Kernel Orchestration for Tensor Programs with Korch. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2024, La Jolla, CA, USA, 27 April 2024- 1 May 2024.

Cite DOI URL

(2023). Unveiling the Black Box of PLMs with Semantic Anchors: Towards Interpretable Neural Semantic Parsing. Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023.

Cite DOI URL

(2023). Joint Geometrical and Statistical Domain Adaptation for Cross-domain Code Vulnerability Detection. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023.

Cite DOI URL

(2023). GraphSet: High Performance Graph Mining through Equivalent Set Transformations. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023.

Cite DOI URL

(2023). EINNET: Optimizing Tensor Programs with Derivation-Based Transformations. 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023.

Cite URL

(2023). Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning. 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023.

Cite URL

(2022). Vapro: performance variance detection and diagnosis for production-run parallel applications. PPoPP ‘22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022.

Cite DOI URL

(2022). UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022.

Cite DOI URL

(2022). Suppressing ZZ crosstalk of Quantum computers through pulse and scheduling co-optimization. ASPLOS ‘22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022.

Cite DOI URL

(2022). PerFlow: a domain specific framework for automatic performance analysis of parallel applications. PPoPP ‘22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022.

Cite DOI URL

(2022). Message from the High Performance Computing and Communications 2022 Program Chairs. 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, HPCC/DSS/SmartCity/DependSys 2022, Hainan, China, December 18-20, 2022.

Cite DOI URL

(2022). Guest Editorial. IEEE Trans. Parallel Distributed Syst..

Cite DOI URL

(2022). GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate Representation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022.

Cite DOI URL

(2022). FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs. PLDI ‘22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022.

Cite DOI URL

(2022). FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. PPoPP ‘22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022.

Cite DOI URL

(2022). Efficiently emulating high-bitwidth computation with low-bitwidth hardware. ICS ‘22: 2022 International Conference on Supercomputing, Virtual Event, June 28 - 30, 2022.

Cite DOI URL

(2022). CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases. SIGMOD ‘22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022.

Cite DOI URL

(2022). AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. ASPLOS ‘22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022 - 4 March 2022.

Cite DOI URL

(2021). Understanding and bridging the gaps in current GNN performance optimizations. PPoPP ‘21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021.

Cite DOI URL

(2021). Mitigating Crosstalk in Quantum Computers through Commutativity-Based Instruction Reordering. 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021.

Cite DOI URL

(2021). HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. ICS ‘21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14-17, 2021.

Cite DOI URL

(2021). Guest Editorial. IEEE Trans. Parallel Distributed Syst..

Cite DOI URL

(2021). G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression. 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021.

Cite DOI URL

(2021). Accelerating GPU Message Communication for Autonomous Navigation Systems. IEEE International Conference on Cluster Computing, CLUSTER 2021, Portland, OR, USA, September 7-10, 2021.

Cite DOI URL

(2020). ScalAna: automating scaling loss detection with graph analysis. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020.

Cite DOI URL

(2020). Payment Behavior Prediction and Statistical Analysis for Shared Parking Lots. Network and Parallel Computing - 17th IFIP WG 10.3 International Conference, NPC 2020, Zhengzhou, China, September 28-30, 2020, Revised Selected Papers.

Cite DOI URL

(2020). ParSecureML: An Efficient Parallel Secure Machine Learning Framework on GPUs. ICPP 2020: 49th International Conference on Parallel Processing, Edmonton, AB, Canada, August 17-20, 2020.

Cite DOI URL

(2020). Memory-Centric Communication Mechanism for Real-time Autonomous Navigation Applications. ICPP 2020: 49th International Conference on Parallel Processing, Edmonton, AB, Canada, August 17-20, 2020.

Cite DOI URL

(2020). Identifying scalability bottlenecks for large-scale parallel programs with graph analysis. PPoPP ‘20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, California, USA, February 22-26, 2020.

Cite DOI URL

(2020). GraphPi: high performance graph pattern matching through effective redundancy elimination. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020.

Cite DOI URL

(2020). GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU. PACT ‘20: International Conference on Parallel Architectures and Compilation Techniques, Virtual Event, GA, USA, October 3-7, 2020.

Cite DOI URL

(2020). Enabling Efficient Random Access to Hierarchically-Compressed Data. 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020.

Cite DOI URL

(2020). Elan: Towards Generic and Efficient Elastic Training for Deep Learning. 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020, Singapore, November 29 - December 1, 2020.

Cite DOI URL

(2019). Statistical Analysis and Prediction of Parking Behavior. Network and Parallel Computing - 16th IFIP WG 10.3 International Conference, NPC 2019, Hohhot, China, August 23-24, 2019, Proceedings.

Cite DOI URL

(2019). Spread-n-share: improving application performance and cluster throughput with resource-aware job placement. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019.

Cite DOI URL

(2019). pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019.

Cite DOI URL

(2019). HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019.

Cite DOI URL

(2019). GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU. Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019.

Cite DOI URL

(2019). End-to-end I/O Monitoring on a Leading Supercomputer. 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26-28, 2019.

Cite URL

(2019). Automatic, Application-Aware I/O Forwarding Resource Allocation. 17th USENIX Conference on File and Storage Technologies, FAST 2019, Boston, MA, February 25-28, 2019.

Cite URL

(2018). Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data. Proceedings of the 32nd International Conference on Supercomputing, ICS 2018, Beijing, China, June 12-15, 2018.

Cite DOI URL

(2018). vSensor: leveraging fixed-workload snippets of programs for performance variance detection. Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018.

Cite DOI URL

(2018). Spindle: Informed Memory Access Monitoring. 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018.

Cite URL

(2018). CSE: Parallel Finite State Machines with Convergence Set Enumeration. 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20-24, 2018.

Cite DOI URL

(2018). BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU. 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, BC, Canada, May 21-25, 2018.

Cite DOI URL

(2018). A vision of post-exascale programming. Frontiers Inf. Technol. Electron. Eng..

Cite DOI URL

(2017). Versapipe: a versatile programming framework for pipelined computing on GPU. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017.

Cite DOI URL

(2017). Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL. Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, February 4-8, 2017.

Cite DOI URL

(2017). Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017.

Cite DOI URL

(2017). FinePar: irregularity-aware fine-grained workload partitioning on integrated architectures. Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017.

Cite URL

(2017). Efficient process mapping in geo-distributed cloud data centers. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 12 - 17, 2017.

Cite DOI URL

(2017). Algorithm-Directed Crash Consistence in Non-volatile Memory for HPC. 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017, Honolulu, HI, USA, September 5-8, 2017.

Cite DOI URL

(2016). A Fast Tridiagonal Solver for Intel MIC Architecture. 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23-27, 2016.

Cite DOI URL

(2015). To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures. 23rd IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2015, Atlanta, GA, USA, October 5-7, 2015.

Cite DOI URL

(2015). A Power-Conserving Online Scheduling Scheme for Video Streaming Services. Algorithms and Architectures for Parallel Processing - 15th International Conference, ICA3PP 2015, Zhangjiajie, China, November 18-20, 2015, Proceedings, Part I.

Cite DOI URL

(2014). Optimizing Seam Carving on multi-GPU systems for real-time image resizing. 20th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2014, Hsinchu, Taiwan, December 16-19, 2014.

Cite DOI URL

(2014). CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression. International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 2014.

Cite DOI URL

(2013). Cost-effective cloud HPC resource provisioning by building semi-elastic virtual clusters. International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 - 21, 2013.

Cite DOI URL

(2013). ACIC: automatic cloud I/O configurator for parallel applications. The 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC'13, New York, NY, USA - June 17 - 21, 2013.

Cite URL

(2013). ACIC: automatic cloud I/O configurator for HPC applications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, Denver, CO, USA - November 17 - 21, 2013.

Cite DOI URL

(2012). Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems. Job Scheduling Strategies for Parallel Processing, 16th International Workshop, JSSPP 2012, Shanghai, China, May 25, 2012. Revised Selected Papers.

Cite DOI URL

(2011). Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications. Conference on High Performance Computing Networking, Storage and Analysis - State of the Practice Reports, SC 2011, Seattle, Washington, USA, November 12-18, 2011.

Cite DOI URL

(2010). PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node. Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9-14, 2010.

Cite DOI URL

(2009). Process Mapping for MPI Collective Communications. Euro-Par 2009 Parallel Processing, 15th International Euro-Par Conference, Delft, The Netherlands, August 25-28, 2009. Proceedings.

Cite DOI URL

(2009). FACT: fast communication trace collection for parallel applications through program slicing. Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA.

Cite DOI URL