Publications in Selected Areas

High Performance Computing Systems

EuroSys (3), USENIX ATC (1), BigData (2), IPDPS (2), HPDC (3), ICS (5), SC (7), RTSS (2), IROS (1)

IROS   P4: Pruning and Prediction-based Priority Planning (2024)
EuroSys   Core Graph: Exploiting Edge Centrality to Speedup the Evaluation of Iterative Graph Queries (2024)
EuroSys   Tripoline: Generalized Incremental Graph Processing via Graph Triangle Inequality (2021)
EuroSys Subway: Minimizing Data Transfer during Out-of-GPU-Memory Graph Processing (2020)
USENIX ATC Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing (2016)
BigData BEAD: Batched Evaluation of Iterative Graph-Queries with Evolving Analytics Demands (2020)
BigData MultiLyra: Scalable Distributed Evaluation of Batches of Iterative Graph Queries (2019)
IPDPS COMPI: Concolic Testing for MPI Applications (2018)
IPDPS Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement (2016)
HPDC Efficient Processing of Large Graphs via Input Reduction (2016)
HPDC Parallel Execution Profiles (2016)
HPDC CuSha: Vertex-Centric Graph Processing on GPUs (2014)
ICS   DSGEN: Concolic Testing GPU Implementations of Concurrent Dynamic Data Structures (2021)
ICS CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs (2016)
ICS PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization (2015)
ICS Address-aware Fences (2013)
ICS Load and Store Reuse Using Register File Contents (2001)
SC ParaStack: Efficient Hang Detection for MPI Programs at Large Scale (2017)
SC Fence Scoping (2014)
SC Compiled Communication for All-Optical TDM Networks (1996)
SC Techniques for Integrating Parallelizing Transformations and Compiler Based Scheduling Methods (1992)
SC Loop Displacement: An Approach for Transforming and Scheduling Loops for Parallel Execution (1990)
SC Improving Instruction Cache Performance by Reducing Cache Pollution (1990)
SC The Design of a RISC based Multiprocessor Chip (1990)
RTSS Busy-Idle Profiles and Compact Task Graphs: Compile-time Support for ... Scheduling of Real-Time Tasks (1994)
RTSS Applying Compiler Techniques to Scheduling in Real Time Systems (1990)


Programming Languages and Compilers

OOPSLA (3), PPoPP/PPEALS (5), POPL (3), PLDI/PLDI-20-Years (15), ICCL (3), CGO (6)

OOPSLA DProf: Distributed Profiler with Strong Guarantees (2019)
OOPSLA RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization (2015)
OOPSLA ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM (2014)
PPoPP PANNS: Enhancing Graph-based Approximate Nearest Neighbor Search through Recency-aware Construction and Parameterized Search (2025)
PPoPP SpiceC: Scalable Parallelism via implicit copying and explicit Commit (2011)
PPoPP Enhanced Speculative Parallelization Via Incremental Recovery (2011)
PPoPP Employing Register Channels for the Exploitation of Instruction Level Parallelism (1990)
PPEALS Compile-time Techniques for Efficient Utilization of Parallel Memories (1988)
POPL Bitwidth Aware Global Register Allocation (2003)
POPL Demand-Driven Computation of Interprocedural Data Flow (1995)
POPL Generalized Dominators and Post-Dominators (1992)
PLDI Effective Parallelization of Loops in the Presence of I/O Operations (2012)
PLDI Supporting Speculative Parallelization in the Presence of Dynamic Data Structures (2010)
PLDI Towards Locating Execution Omission Errors (2007)
PLDI Pruning Dynamic Slices With Confidence (2006)
PLDI Cost Effective Dynamic Program Slicing (2004)
PLDI: 20 Years Retrospective -- Complete Removal of Redundant Expressions
PLDI Timestamped Whole Program Path Representation and its Applications (2001)
PLDI ABCD: Eliminating Array Bounds Checks on Demand (2000)
PLDI Load-Reuse Analysis: Design and Evaluation (1999)
PLDI Complete Removal of Redundant Expressions (1998)
PLDI Partial Dead Code Elimination using Slicing Transformations (1997)
PLDI Interprocedural Conditional Branch Elimination (1997)
PLDI A Practical Data Flow Framework for Array Reference Analysis and its Application in Optimizations (1993)
PLDI A Fresh Look at Optimizing Array Bound Checks (1990)
PLDI Register Allocation via Clique Separators (1989)
ICCL Automatic Generation of Microarchitecture Simulators (1998)
ICCL Path Profile Guided Partial Redundancy Elimination Using Speculation (1998)
ICCL SPMD Execution of Programs with Dynamic Data Structures on Distributed Memory Machines (1992)
CGO PreFix: Optimizing the Performance of Heap-Intensive Applications (2025)
CGO White-Box Program Tuning (2019)
CGO DrDebug: Deterministic Replay based Cyclic Debugging with Dynamic Slicing (2014)
CGO Lightweight Fault Detection in Parallelized Programs (2013)
CGO Extending Path Profiling across Loop Backedges and Procedure Boundaries (2004)
CGO Hiding Program Slices for Software Security (2003)


Computer Architecture

ASPLOS (8), ISCA (2), MICRO (14), HPCA (3), PACT (12)

ASPLOS     Glign: Taming Misaligned Graph Traversals in Concurrent Graph Processing (2023)
ASPLOS     CommonGraph: Graph Analytics on Evolving Data (2023)
ASPLOS     PnP: Pruning and Prediction for Point-To-Point Iterative Graph Analytics (2019)
ASPLOS KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations (2017)
ASPLOS CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing (2017)
ASPLOS Efficient Sequential Consistency via Conflict Ordering (2012)
ASPLOS Frequent Value Locality and Value-Centric Data Cache Design (2000)
ASPLOS The Fuzzy Barrier: A Mechanism for High-Speed Synchronization of Processors (1989)
ISCA ECMon: Exposing Cache Events for Monitoring (2009)
ISCA Value Prediction in VLIW Machines (1999)
MICRO MEGA Evolving Graph Accelerator (2023)
MICRO JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator (2021)
MICRO GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing (2020)
MICRO Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection (2015)
MICRO Copy Or Discard Execution Model For Speculative Parallelization On Multicores (2008)
MICRO Efficient Use of Invisible Registers in Thumb Code (2005)
MICRO Whole Execution Traces (2004)
MICRO Energy Efficient Frequent Value Data Cache Design (2002)
MICRO Frequent Value Compression in Data Caches (2000)
MICRO Dynamic Memory Disambiguation in the Presence of Out-of-order Store Issuing (1999)
MICRO Resource-Sensitive Profile-Directed Data Flow Analysis for Code Optimization (1997)
MICRO A Shape Matching Approach for Scheduling Fine-Grained Parallelism (192)
MICRO Executing Loops on a Fine-Grained MIMD Architecture (1991)
MICRO A Fine-grained MIMD Architecture based upon Register Channels (1990)
HPCA SENSS: Security Enhancement to Symmeteric Shared Memory Multiprocessors (2005)
HPCA Global Context-based Value Prediction (1999)
HPCA Distributed Path Reservation Algorithms for Multiplexed All-Optical Interconnection Networks (1997)
PACT Scalable SIMD-Efficient Graph Processing on GPUs (2015)
PACT Stadium Hashing: Scalable and Flexible Hashing on GPUs (2015)
PACT Shuffling: A Framework for Lock Contention Aware Thread Scheduling for Multicore Multiprocessor Systems (2014)
PACT No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs (2011)
PACT Efficient Sequential Consistency Using Conditional Fences (2010), Recipient of a PACT 2010 Best Paper Award
PACT Extended Whole Program Paths (2005)
PACT Caching and Predicting Branch Sequences for Improved Fetch Effectiveness (1999)
PACT Superscalar Execution with Direct Data Forwarding (1998)
PACT Capturing the Effects of Code Improving Transformations (1998)
PACT Path Profile Guided Partial Dead Code Elimination Using Predication (1997)
PACT Resource Spackling: A Framework for Integrating Register Allocation in Local and Global Schedulers (1994)
PACT URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures (1993)


Software Engineering

ICSE (5), ASE (1), ESEC-FSE/FSE (5), ISSTA/ISTAV (4), ICSM (10)

ICSE Dynamic Slicing for Android (2019)
ICSE Locating Faults Through Automated Predicate Switching (2006)
ICSE Effective Forward Computation of Dynamic Slices Using Reduced Ordered Binary Decision Diagrams (2004)
ICSE Precise Dynamic Slicing Algorithms (2003), Recipient of ICSE 2003 Distinguished Paper Award
ICSE A Demand-Driven Analyzer for Data Flow Testing at the Integration Level (1996)
ASE Locating Faulty Code Using Failure-Inducing Chops (2005)
FSE Dynamic Slicing Long Running Programs through Execution Fast Forwarding (2006)
ESEC-FSE Matching Execution Histories of Program Versions (2005)
ESEC-FSE Comparison Checking: An Approach to Avoid Debugging of Optimized Code (1999)
ESEC-FSE Refining Data Flow Information using Infeasible Paths (1997)
FSE Hybrid Slicing: An Approach for Refining Static Slices using Dynamic Information (1995)
ISSTA Fault Localization Using Value Replacement (2008)
ISSTA Dynamic Recognition of Synchronization Operations for Improved Data Race Detection (2008)
ISSTA Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction (2007)
ISTAV Loop Monotonic Computations: An Approach for the Efficient Run-time Detection of Races (1991)
ICSM Detecting Virus Mutations Via Dynamic Matching (2009)
ICSM Effective and Efficient Localization of Multiple Faults Using Value Replacement (2009)
ICSM Identifying the Root Causes of Memory Bugs Using Corrupted Memory Location Suppression (2008)
ICSM Dynamic Slicing of Multithreaded Programs for Race Detection (2008)
ICSM ONTRAC: A System for Efficient ONline TRACing for Debugging (2007)
ICSM Matching Control Flow of Program Versions (2007)
ICSM Priority Based Data Flow Testing (1995)
ICSM A Framework for Partial Data Flow Analysis (1994)
ICSM An Approach to Regression Testing using Slicing (1992)
ICSM A Methodology for Controlling the Size of a Test Suite (1990)


ACM Transactions

TOPLAS/LOPLAS (5), TOSEM (2), TACO (9), TECS (2), TODAES (1)

ACM TOPLAS Execution Suppression: An Automated Iterative Technique for locating Memory Errors (2010)
ACM TOPLAS Cost and Precision Tradeoffs of Dynamic Data Slicing Algorithms (2005)
ACM TOPLAS A Practical Framework for Demand-Driven Interprocedural Data Flow Analysis (1997)
ACM TOPLAS Efficient Register Allocation Via Coloring Using Clique Separators (1994)
ACM LOPLAS Optimizing Array Bound Checks Using Flow Analysis (1994)
ACM TOSEM Hybrid Slicing: Integrating Dynamic Information with Static Analysis (1997)
ACM TOSEM A Methodology for Controlling the Size of a Test Suite (1993)
ACM TACO Synergistic Analysis of Evolving Graphs (2016)
ACM TACO Tumbler: An Effective Load Balancing Technique for MultiCPU Multicore Systems (2016)
ACM TACO ADAPT: A Framework for Coscheduling Multithreaded Programs (2013)
ACM TACO A Dynamic Self Scheduling Scheme for Heterogeneous Multiprocessor Architectures (2013)
ACM TACO PLDS: Partitioning Linked Data Structures for Parallelism (2012)
ACM TACO Thread Tranquilizer: Dynamically Reducing Performance Variation (2012),
ACM TACO Dynamic Access Distance Driven Cache Replacement (2011)
ACM TACO Unified Control Flow and Dependence Traces (2007)
ACM TACO Whole Execution Traces and their Applications (2005)
ACM TECS Dynamic Coalescing for 16-bit Instructions (2005)
ACM TECS Frequent Value Locality and its Applications (2002)
ACM TODAES Frequent Value Encoding for Low Power Data Buses (2004)