Software Testing and Verification - Fall 2024
Project Guidelines

Detailed Guidelines of Each Phase:

Backgrounds:
In software testing, paths from Control Flow Graphs (CFG) and Data Flow Graphs (DFG) are commonly used for tasks like test coverage analysis. However, constructing these graphs can be complex and time-consuming due to elements like loops, conditionals, and nested structures. For instance, loops in a CFG can introduce paths that repeat many times, leading to a combinatorial explosion of scenarios to analyze, especially when nested or combined with other conditions. To simplify the process, we will instead extract paths from the Abstract Syntax Tree (AST), which represents the code’s structure without needing to analyze complex runtime behavior. This approach reduces complexity while still capturing the essential idea of traversing a particular code representation.

Overall Description:
In this project, you will implement an AST Path and Path-context Extractor for Java grammar. You will mainly use Tree-sitter — a popular AST parser for multiple languages, and its Python binding to finish the project. The goal is to create a tool that can traverse an AST, extract AST paths, and output in the given format.

Definitions: (Ref: Section4.1)

Abstract Syntax Tree (AST). It consists of:

A set of non-terminal nodes N.
A set of terminal leaf nodes T.
A set of terminal leaf values X.
A root node in N.
A function that maps from non-terminal nodes to non-terminal leaf nodes or terminal leaf nodes.
A function that maps from terminal leaf nodes to terminal values.

AST Path.Within the scope of this project, we define an AST path as a path from one terminal leaf node to another terminal leaf node. A path can be partitioned into three parts:
- All the up-directed edges and nodes between the starting terminal leaf node (inclusive) to the lowest common prefix node (exclusive).
- Lowest common prefix node.
- All the down-directed edges and nodes between the lowest common prefix node (exclusive) to the ending terminal leaf node (inclusive).
To simplify, we include the terminal values associated with the terminal leaf nodes on both ends of the AST path defined above as the ASTPath data structure in our project.
Path Length. The number of edges between two leaf nodes of the AST path.
Path Width. The difference between the sibling nodes that are children of the lowest common prefix node of the path. Specifically, these sibling nodes must also be ancestors of the terminal nodes that form the endpoints of the path.
An example:

In this example, an AST path is highlighted in red.
- The AST path is (n1(v4)-up-n2-down-n3(v7))
- The Path length is 2. (1 up edge + 1 down edge. We print path with terminal values attached, but we do not include them in the length calculating.)
- The Path width is 2. (Distance between n1 and n3, abs(child_id(n1)-child_id(n3)))

Implementation Overview:
The project is divided into three stages:

Environment setup and terminal leaf nodes collection.
AST path construction given the two terminal leaf nodes.
AST paths extraction from every two AST nodes of every function within the given code, with feature constraints.

In Project Phase I, you will first set up the GitHub Classroom and Python Tree-sitter environment. You will implement:
- LeavesCollector for traversing collecting the leaf nodes (terminal leaf nodes) given an AST.
- NodeProperty, to maintain user-defined properties other than original AST node properties. These properties will be collected simultaneously with the traversal of LeavesCollector.
In Project Phase II, you will implement:
- AstPath data structure, which represent a single path in AST (explained in the example above).
- ExtractPath(source: Node, target: Node) function as an essential component of our goal PathExtractor.
- FunctionVisitor: A visitor similar to leaves collector implemented in stage I. But it will collect information from all method nodes during traversal. Implementation of this module build the foundation for the next phase.
In Project Phase III, you will implement:
- ASTPathSet. It encapsulates multiple ASTPaths within a given method and include the method name.
- Complete PathExtractor. It can batch process methods in the given code. It can also extract paths under certain constraints such us path length and path width we mentioned above.
A final report is required for the last phase.

For each phase, we will use GitHub Classroom to submit the code. You will need to complete the code by addressing TODOs given in the template. We will release a detailed guideline for each phase. The functionalities and intentions are explained in the template comments as well. We provide two tests for each project phase. You can modify them as needed. You can run them by executing the main function of each module or using pytest. This is a fast and simple way to check the correctness and functionalities of your implementation. After submitting, we will test your code based on our in-house test cases. The result will be shown as "pass" or "fail", and you can submit repeatedly until the deadline.

Reference and Hints:

Tree-sitter:
Use of LLMs:
- You can use LLM tools for programming within the project.
- Be sure to include references to the use of LLMs as comments in your code.
- Recommended Tools: Cursor, Copilot (free for student accounts), ChatGPT.

Grading:
Phase I: 10% if all tests are passed.
Phase II: 10% if all tests are passed.
Phase III: 10% if all tests are passed + 5% final report.
The percentage contributes to your final grade.

Software Testing and Verification - Fall 2024 Project Guidelines

Software Testing and Verification - Fall 2024
Project Guidelines