Lecture 4
Symbolic Execution
Today, we are delving into an important analysis technique known as symbolic execution. Some of you may have encountered this concept in courses like software testing, but our focus here is different. We will explore symbolic execution from a security analysis perspective, examining both source code and binary code.
Symbolic execution is a technique that originated several decades ago, in the 1970s. To illustrate how it works, let's consider a classic example using a simple C program. In this program, we have two symbolic inputs, X and Y, which represent inputs provided by the user. These inputs are utilized by the program to perform calculations and make control decisions. Our goal is to reach a specific location in the code that triggers an error under certain conditions.
In software testing, the aim is to traverse as many lines of code as possible to uncover hidden functionality or correctness issues. In security analysis, the focus shifts towards identifying malicious behaviors or vulnerabilities. By treating inputs as symbolic, we consider them abstract entities that can potentially take any arbitrary value.
The process of symbolic execution involves starting from the program's entry point and systematically analyzing instructions or statements one by one. Let's break down the example program:
- Symbolic Inputs: We have X and Y as symbolic inputs.
- Function Execution: The program first calls a function named
test
, which in turn calls another function calledtwice
. - Parameter Passing: Within
twice
, Y is assigned a new symbolic name, V. The operation V times 2 is performed, and this value is returned as Z. - Condition Evaluation: The program encounters a condition that checks if Z equals X. Given Z is 2 times Y, this condition essentially compares 2 * Y with X.
- Outcome Analysis: There are two branches based on whether the condition is true or false. The symbolic execution aims to explore all possible paths to determine which values of X and Y lead to the error.
The critical aspect of symbolic execution is exploring conditions that lead to specific program behaviors or errors. For example, if the condition 2 * Y != X
is satisfied, the program follows a particular path. By solving these symbolic conditions, we can deduce the input values that trigger specific code paths, including those leading to errors or vulnerabilities.
Execution Paths and Feasibility
-
Branching in Programs: Programs often contain multiple branching conditions, leading to a variety of execution paths. For instance, if a program reaches an "else" branch, it might also need to evaluate conditions like
X > 1 * Y + 10
. These conditions determine which path the execution will follow. -
Feasibility of Paths: Not all paths in a program are feasible. To determine feasibility, we employ constraint solvers. These tools evaluate whether certain conditions can be satisfied and produce solutions if they can. For instance, in a scenario where
2 * Y == X
, a constraint solver might confirm its feasibility by providing specific values forX
andY
. -
Eliminating Infeasible Paths: Through symbolic execution, infeasible paths are identified and disregarded, ensuring that computational resources are not wasted on paths that cannot practically occur.
Bug Detection and Solutions
-
Triggering Bugs: When a path culminates in an error or bug, symbolic execution helps not only in detecting the bug but also in finding multiple input solutions to trigger it. For example, if a certain condition results in an error, values like
X = 30
andY = 15
might be identified as one such solution. -
Historical Context: The concept of symbolic execution dates back to the 1970s. Despite its age, it remains a powerful tool, although not without its limitations and challenges.
Challenges in Symbolic Execution
-
Infinite Execution Paths: A significant challenge in symbolic execution arises from programs with a vast or infinite number of execution paths. This complexity often stems from loops within the program. Each iteration of a loop might introduce new symbolic variables, exponentially increasing the number of paths.
-
Complexity in Real-world Applications: In practical applications, especially in security research, it is crucial to evaluate whether symbolic execution can effectively solve the complex problems presented by a program's implementation.
detailed breakdown:
1. Symbolic Execution and Loops
- Symbolic execution involves analyzing programs to determine what inputs cause each part of a program to execute. However, symbolic execution can encounter significant challenges with loops.
- Loops with Symbolic Conditions: If a loop's condition is symbolic, the loop might iterate an indeterminate number of times. For instance, consider a symbolic string where a loop runs over each character, assuming the last character is not zero. This can lead to potentially infinite iterations, as the loop continues indefinitely without a concrete stopping condition.
2. Function Calls and Recursion
- Function Reusability: Functions in a program are used to encapsulate and reuse functionality across different parts of a program. This means the same function can be invoked from various locations, complicating symbolic execution.
- Recursion: Functions that call themselves, either directly or indirectly, introduce additional complexity. Recursive calls can lead to an infinite number of execution paths, making it challenging to explore all possible program states.
3. Coverage of Program States
- Symbolic execution aims to cover all possible program states. However, this task becomes infeasible when considering infinite execution paths.
- State Definition: If a state is defined as a unique execution path, covering all states becomes impossible. Even if a state is defined as a line of code, reaching every line can be difficult due to unknown conditions that determine code execution paths.
- Branch Exploration: Thorough exploration requires evaluating as many branches as possible, which is not always feasible.
4. Unsolvable Formulas
- Satisfiability and Theorem Provers: To determine if a branch condition is feasible, symbolic execution tools often rely on theorem provers based on satisfiability theories.
- Complex Formulas: Certain formulas are inherently unsolvable. For example, consider a function performing nonlinear computations, such as a function
twice
that involves complex operations. While humans might find such problems intuitive, theorem provers may struggle to resolve them.
5. Symbolic Modeling
- Symbolic modeling involves representing the behavior of program instructions, including function and system calls.
- Complex Instruction Handling: When dealing with binary code, understanding and modeling each instruction's symbolic effect is crucial but challenging. This complexity extends beyond function calls to include various system-level interactions.
Floating-point operations are notoriously challenging to model within satisfiability theories. This difficulty arises from their inherent complexity and the precision required in computations. Floating-point arithmetic is a staple in many computational tasks, especially in scientific computing, and accurately representing these operations in formal models remains a significant challenge.
Similarly, complex instructions such as those found in the SSE family pose their own challenges. These instructions, which operate on multiple data streams with a single instruction, complicate the modeling process. The difficulty lies in creating models that are not only accurate but also efficient enough for constraint solvers to handle.
String operations represent another challenging area. Operations such as replacing, finding substrings, and formatting strings are common but hard to model accurately. Despite ongoing research and development of string solvers to approximate these operations, the field is still maturing. The complexity arises from the dynamic nature of strings and the multitude of ways they can be manipulated.
System calls introduce another layer of complexity. These calls, which interact directly with the operating system, involve operations like opening files, reading from them, setting pointers, and more. While it is possible to model some of these operations with enough time and effort, certain scenarios, particularly those involving symbolic offsets, are exceptionally difficult. For instance, if a pointer is set to a symbolic offset within a file, it introduces uncertainty about where the read or write operation will occur, making accurate modeling a daunting task. In such cases, one might have to resort to treating the entire file as symbolic, which means any operation on the file could yield any possible result, thereby complicating the analysis.
In summary, while modeling computational instructions and operations is a fundamental aspect of formal verification and software analysis, it is fraught with challenges. Each type of operation, whether it be floating-point arithmetic, SSE instructions, string manipulations, or system calls, presents unique hurdles that researchers and developers continue to address. The goal is to create models that are not only faithful representations of the operations but also manageable for automated tools to analyze and reason about.
Concolic Execution
Symbolic execution typically leads to the generation of numerous scenarios or test cases, some of which might be impossible or irrelevant in practical applications. This is particularly evident in operating systems like Unix or Windows, where the sheer number of system calls can complicate the analysis.
To address these challenges, researchers have developed an enhanced technique known as dynamic symbolic execution, or concolic execution—a portmanteau of "concrete" and "symbolic." This approach involves running the program with concrete inputs while simultaneously collecting symbolic constraints. By doing so, it combines the benefits of both concrete execution, which handles real-world complexities, and symbolic analysis, which identifies a wider range of potential issues.
Dynamic symbolic execution represents a significant advancement over purely static methods by allowing for the empirical testing of programs. It addresses the limitations of static analysis by incorporating real-world data and feedback, enabling a more comprehensive and realistic assessment of software behavior under varied conditions.
Concept Overview:
-
Concolic Execution: The term "concolic" is a blend of "concrete" and "symbolic." It describes a method where a program is executed with both concrete values and symbolic representations simultaneously. This dual approach allows the analysis to gather precise information about the program's behavior and explore different execution paths efficiently.
-
Initial Execution: The process begins by running the program with a specific concrete input. This input could be arbitrary, such as setting variables (X = 22) and (Y = 7). The program is executed both concretely, to observe its actual behavior, and symbolically, to collect constraints about the program's execution paths.
-
Constraint Collection: As the program runs, symbolic execution collects constraints based on the conditions encountered during execution. For example, if the program contains a conditional branch, the symbolic execution will record constraints that describe the conditions under which each branch is taken.
-
Dynamic Instrumentation: Dynamic binary instrumentation is a critical part of concolic execution. It allows the program to be instrumented at runtime, enabling the collection of symbolic constraints without modifying the source code. This instrumentation is essential for capturing the necessary information to explore alternative paths in the program.
-
Concrete Execution Path: In the example provided, with inputs (X = 22) and (Y = 7), the program executes a specific path, say, taking the "else" branch on line 7. During this execution, the program calculates intermediate values, such as (Z = 14), and checks conditions like (Z < 15).
-
Symbolic Constraint Solving: After the initial execution, the collected symbolic constraints are analyzed. If the current path is not the desired one (e.g., taking the "else" branch instead of the "if" branch), the constraints are modified or negated. A constraint solver is then employed to find an alternative set of input values that satisfy the new conditions, allowing the program to explore a different path.
-
Iterative Exploration: The process is iterative. After solving the constraints and finding new inputs (e.g., setting (2Y = X) to explore a different branch), the program is re-executed with these inputs. This iterative exploration continues, systematically covering different execution paths to uncover potential issues.
Dynamic symbolic execution is a robust method for software testing and verification. It efficiently explores various paths through the use of symbolic constraints and concrete execution, providing deeper insights into the program's behavior and helping to identify hidden bugs or security vulnerabilities.
Benefits of the Approach:
-
Solving Complex Constraints:
- Initially, some constraints appear unsolvable due to their complexity. However, by compromising certain aspects, these constraints can become tractable.
- This approach involves concretizing certain levels of the problem, which means assigning specific concrete values to some symbolic variables. This can make the problem more manageable and solvable.
-
Trade-offs:
- While this method might miss certain edge cases—potentially labeling feasible solutions as infeasible—it is often better than completely failing to solve the problem.
- By sacrificing some complex corner cases, many common cases become solvable, thereby providing a practical solution to otherwise intractable problems.
-
Modeling External Library and System Calls:
- Modeling all library and system calls is often impractical due to their sheer number and complexity.
- Traditional approaches require modeling these calls to proceed with symbolic execution. However, this can halt progress if a call is too complex to model.
- By opting for "under-constrained execution," one can bypass the need to model every call. This means allowing some symbols to remain unconstrained, which enables the continuation of analysis, albeit with potentially looser constraints.
-
Concrete Execution:
- If modeling certain complex cases is not feasible, this approach allows for the execution to follow concrete values. This ensures that the analysis can continue even if not all paths are fully explored symbolically.
- Although this might limit exploration compared to a fully symbolic exploration, it ensures progress in the analysis.
Implementation Considerations:
- The high-level idea of symbolic execution is appealing due to its ability to reason about programs mathematically. However, its practical implementation faces challenges.
- By integrating dynamic analysis and using concrete values, this approach can mitigate some of these challenges.
- Ultimately, it's a balance between the theoretical appeal of pure symbolic execution and the practical feasibility gained through dynamic elements.
In summary, the discussed approach provides a pragmatic solution to the challenges posed by complex constraints in symbolic execution. By leveraging concrete execution alongside symbolic techniques, it enables the analysis to proceed in situations where purely symbolic methods might falter. This mixed approach reaffirms the importance of adaptive strategies in tackling intricate computational problems.
Symbolic Execution Engines
KLEE
This is a popular tool used for symbolic execution of C programs. It operates by utilizing LLVM (Low-Level Virtual Machine), a compiler framework, to convert C programs into LLVM bitcode, which is an intermediate representation (IR) of the program. LLVM is not just a compiler; it can emulate LLVM bitcode, allowing for interpretation of instructions without compiling to a target executable code This emulation allows developers to implement symbolic execution logic, where concrete values can be used in an interpretative mode instead of native execution.
KLEE focuses on modeling the environment it operates within, emulating library functions and system calls present in environments like Linux or UNIX. The tool emphasizes libc and UNIX system call interfaces, allowing it to interact with system-level functions during symbolic execution.
During symbolic execution, constraints are generated, which describe the symbolic inputs and their effects on the program. Standard constraint solvers like Z3 or others are used to find solutions to these constraints, which help in generating test cases that can reveal potential crashes or unexpected behavior in the program.
Java Pathfinder (JPF)
Similar to KLEE for C, Java Pathfinder is a tool for symbolic execution of Java programs. It operates on the same principle of exploring multiple execution paths in the code by using symbolic inputs.
The primary contribution of symbolic execution tools like KLEE and JPF is their ability to model the program's environment and effectively use constraint solvers to derive meaningful test cases. These tools help in identifying crashes or unexpected behaviors by generating test cases that cover diverse execution paths, thus improving software reliability and robustness.
State Forking: This concept involves the branching of execution paths during symbolic execution when a conditional statement is encountered. When a condition is met, the execution forks into two separate paths. Each path corresponds to one possible outcome of the condition. Unlike process forking, which involves creating a new process, state forking duplicates the symbolic states of the program, allowing simultaneous exploration of different execution paths.
State Scheduling: A critical decision when dealing with branches in code that create multiple execution paths. Essential for efficiently exploring numerous paths without getting trapped in loops.
Using a naive method like Depth-First Search (DFS) can be problematic as it may easily get stuck in loops. Implementing limits to prevent this can help, but it's not ideal. Random scheduling is less likely to get stuck in loops and can cover more diverse branches. It is not a bad choice, but it might lack efficiency.
More intelligent scheduling can involve looking ahead to determine which branches might lead to unexplored code blocks. A straightforward strategy is to prioritize branches leading to previously unvisited blocks, thereby maximizing coverage. There is significant potential for improving state scheduling algorithms by incorporating more sophisticated methods.
Angr Framework
Angr is a framework widely used for both static and dynamic analysis. It shares a similar design philosophy with KLEE, converting source code into intermediate representations for analysis. Specifically, Angr converts binary code into VEX Intermediate Representation (IR) and performs symbolic execution during the interpretation phase.
The process of converting binary instructions into intermediate representations (IR) is a foundational concept. This transformation allows for various types of analysis, such as shallow value analysis, by simplifying complex constructions into smaller, more manageable components. - Benefits: Utilizing IR simplifies the analysis process, making it easier to interpret and perform transformations. This is because IR abstracts the complexity of the original binary instructions. - Implementation: The IRs are integrated one by one, notably in environments like Pin, which is a dynamic binary instrumentation tool.
- Efficiency: While Python is known for its convenience as a programming language, it is often criticized for its lack of efficiency and scalability, especially when performing low-level operations such as those required for IR interpretation.
- Complexity: Translating each IR into a function in Python can be complex and inefficient, leading to performance bottlenecks.
S2E: Selective Symbolic Execution
Selective Symbolic Execution (S2E) is an advanced form of symbolic execution that focuses on specific components within a larger system, such as a kernel driver or a user-level process. Using QEMU, S2E emulates an entire virtual machine but focuses symbolic execution efforts only on components of interest, leaving the rest of the environment to be executed concretely.
This selective execution reduces the overhead associated with fully symbolic analysis of an entire system, making it more practical for real-world applications.
The decision-making process in modeling virtual machines is crucial. One must decide which parts to model and which to omit. This is a fundamental concept in emulation and symbolic execution. Tools like QEMU provide two modes of operation: an emulation mode where it simulates the virtual machine, and a symbolic mode where instructions are interpreted one by one. This mode allows for a detailed analysis but introduces complexity in terms of implementation.
The symbolic mode requires significant technical adjustments, such as converting TCG IR (Tiny Code Generator Intermediate Representation) into LLVM IR (Low-Level Virtual Machine Intermediate Representation) for further processing. This process, while automatic, can be slow and cumbersome, impacting usability.
Despite the slow performance, these tools are indispensable, especially for tasks like kernel analysis, where few alternatives exist. Options like Angr, which allows for binary loading and partial analysis, and S2E, designed for more comprehensive analysis, are commonly used in research and practical applications.
The development of these tools has been a long journey, starting from theoretical concepts in the 1970s. Tools like Angr and S2E emerged in the early 2000s, with significant contributions made during DARPA's Cyber Grand Challenge (CGC). These tools, developed over decades, highlight the time commitment required to transform theoretical ideas into functional prototypes.
In terms of usability, however, these tools still present considerable challenges. Setting them up can be complex, and utilizing them effectively requires a deep understanding of their intricacies. Despite these hurdles, their ability to analyze binaries makes them valuable in fields where detailed analysis is necessary.
Symbolic Execution: Concept and Limitations
Symbolic execution is a technique used to analyze programs by exploring many possible execution paths simultaneously, using symbolic values instead of concrete data inputs. Despite its theoretical promise, practical applications in finding vulnerabilities, such as those in the Cyber Grand Challenge (CGC), have been disappointing.
- Limitations:
- Speed: Symbolic execution tends to be slow. It cannot process large amounts of data quickly enough to be practical in many real-world scenarios.
- State Explosion: The method suffers from a state explosion problem, where the number of possible states or paths becomes unmanageable.
- Environment Modeling: Accurately modeling the complete environment for a program is complex and challenging.
Due to these limitations, symbolic execution, while conceptually appealing, often fails to perform effectively in practice. Although it is a "great idea," its real-world application "sucks," as it struggles to compete with the simplicity and efficiency of fuzzing.
QSYM
QSYM is developed by a team from Georgia Tech, possibly motivated by participation in the CGC. It aims to address the inefficiencies of symbolic execution by making strategic trade-offs. The core idea involves sacrificing some level of accuracy for improved efficiency, which encourages a reevaluation of the traditional design of symbolic execution.
The core idea here revolves around the concept of integrating symbolic execution with fuzzing to enhance performance. Fuzzing is already acknowledged as an effective tool, but the goal isn't to replace it with symbolic execution entirely. Instead, the two should work together. By combining forces, they can achieve better results than either could alone.
Efficiency is prioritized over soundness in this integration. Soundness refers to the accuracy and completeness of the tool's operations, meaning the tool can perfectly model and explore all paths and instructions. However, achieving this can be resource-intensive and time-consuming. Many tools, due to limitations, cannot model every instruction or system call, thus restricting their ability to explore all possible paths. The proposed approach suggests that even if a tool isn't perfect, it can still be highly efficient. By operating faster, even an imperfect tool can provide numerous results, some of which may successfully tackle the problem at hand.
The idea is to leverage the strengths of fuzzing to explore paths that a symbolic execution tool might miss due to its limitations. If fuzzing can find alternative paths to reach certain branches, it can help the symbolic execution tool explore further. This cooperative strategy means the entire problem doesn't need to be solved by one tool alone, emphasizing efficiency over perfection.
The lecture also touches on the work of some individuals who aim to eliminate Intermediate Representations (IRs) in this context. IRs can significantly slow down processes. For example, when executing a native query, the time taken is minimal compared to when using IRs, which can result in a slowdown by several orders of magnitude. For instance, a native query might take 0.008 seconds, but if run through certain tools, it could take 26 seconds or more due to the overhead introduced by IRs. The tool Anger, for example, can take up to 500 seconds for tasks that could be completed in less than a second natively.
This slowdown occurs because converting instructions into IRs leads to a significant increase in complexity and size, often resulting in a fivefold or more blowup in the number of operations. This inefficiency highlights the need for more streamlined approaches that minimize or eliminate the reliance on IRs to enhance performance.
The traditional approach aims to model instruction semantics perfectly, transforming each instruction into IRs and then interpreting them. However, this often results in inefficiencies. A more direct approach has been suggested, which involves extracting symbolic expressions or constraints directly from the instructions rather than converting them into IRs. This method focuses on modeling instructions directly without intermediate transformation.
While it is true that instructions, especially in complex architectures like x86, may appear daunting due to their complexity, many of them are, in fact, straightforward, such as 'move' or 'subtract' instructions. Although modeling every aspect of complex instructions, like those in the x86 'eval' category, can be intricate, a significant portion of these instructions are either not used frequently or only require handling a few common operations like comparison and jumps (e.g., 'JJZ').
Critics may argue that this approach could be vulnerable to exploitation by malicious code, given its simplicity. However, the sophistication of malicious code is already quite advanced, and it can exploit even the most well-protected systems. Therefore, it is impractical to focus solely on complex instructions, as this could overburden constraint solvers, leading to inefficiencies. The constraint solvers may become overwhelmed by the complexity and ultimately fail to provide solutions, negating any perceived benefits of comprehensive modeling.
The core philosophy here is to prioritize the common, simple instructions and optimize for the common case. This approach is encapsulated in the principle of "make the common case fast," which suggests that systems should be designed to handle the most frequent scenarios efficiently. This philosophy is not only a guiding principle in system design but also the title of one of my papers. The takeaway is the importance of simplicity and the willingness to sacrifice some aspects of completeness and soundness for greater efficiency. In essence, while it is essential to acknowledge the limitations of not addressing every complex instruction, focusing on optimizing the common, simple cases can lead to significant performance gains and system efficiency.
Firstly, let's look at the use of the PIN tool for symbolic analysis. PIN is a dynamic binary instrumentation framework that allows users to analyze the behavior of binaries. However, there are significant drawbacks to using PIN:
- Closed Source: PIN is proprietary, which limits modifications and customization. Users can only utilize the predefined APIs provided by PIN.
- Disassembly Challenges: The inability to cache disassembled results forces repeated disassembly of each instruction, which is inefficient.
- Architecture Limitations: PIN supports a limited range of architectures, which can be a hindrance for comprehensive analysis.
- Performance Concerns: The analysis overhead is significant, making it an expensive choice for some applications.
An example given was the tool 'libdft', which is built on top of PIN and relies on taint analysis. This kind of analysis, while efficient, often sacrifices soundness for performance. For instance, when tainting, if a change affects multiple bytes, the tool may only propagate changes to a single byte to maintain efficiency. This trade-off impacts the accuracy of the analysis.
On the contrary, another platform, Klee, was suggested as a potentially better alternative. Although it possesses a more complex intermediate representation (IR) and can be costly in terms of performance, the real issue lies in the interpretation of instructions rather than the IR itself. Platforms like Klee with boundary translation in its virtual machine (VM) handle IR more effectively.
Furthermore, the lecture touched upon the concept of "resolution versus state forking" in symbolic execution. This is a technique used in tools like Symbolic Execution Engine (SEE) and Angr. When a symbolic branch is encountered, the tool forks another state to explore the alternative path. While this approach seems advantageous in theory, it also implies increased computational overhead due to the exponential growth of states that need to be managed.
Forking vs. Re-execution:
-
Forking: This approach involves creating a snapshot of the system state, allowing you to revisit that exact state later. This is particularly useful because it lets you explore different execution paths without restarting the entire process from scratch. The idea is to manage and maintain both the program state and the kernel state.
-
Re-execution: On the other hand, re-execution involves starting from the beginning and running through the program again to reach the desired state. It avoids the complexities of state management that come with forking.
In systems like S2E, entire virtual machines can be forked. By taking a VM snapshot, you can explore different states efficiently. Although this may seem resource-intensive, in reality, the differences (or deltas) between VM snapshots are often small. This allows for optimizations like 'copy-on-write', minimizing the storage and processing overhead.
- Challenges:
- Managing the kernel state is more complex than managing program state. While forking the program state is relatively straightforward, the kernel state demands more sophisticated handling.
- Forking necessitates the storage of state data, which can be expensive in terms of both space and complexity.
- It requires precise environment models to replicate the kernel state accurately.
Re-execution Advantages:
- Simplified State Management: By re-executing, you avoid the burden of managing complex system states, particularly the kernel state.
- Performance Optimization: With effective optimization of the re-execution process, this method can be competitive in terms of performance, despite the absence of state management.
- Space vs. Time Trade-off: The choice between forking and re-execution ultimately boils down to a space versus time trade-off. Forking demands more storage and complexity for state management, whereas re-execution might consume more time but requires less space.
Here’s a detailed breakdown of the key points discussed:
Signal Loss and Input Modeling - Initially, we touched upon the fact that some signals are lost, though not many. This can affect the modeling of standard impulses, which are crucial for marking inputs as symbolic. - The process involves identifying symbolic memory, file reads, and common library calls, which are essential for accurate modeling.
Handling Complex Instructions - A significant challenge in symbolic execution is handling complex instructions. When faced with such instructions, one approach is to employ concretization. - Concretization involves simplifying the process by treating complex constructs that are difficult to model symbolically with concrete values instead. - Examples of instructions that are often concretized include floating-point operations and certain string-related operations, which may be too complex to model symbolically.
Pointer Arithmetic and Symbolic Execution
- A specific challenge discussed was symbolic pointer arithmetic. When the memory address or index is symbolic, it complicates the execution model.
- Consider an instruction like MOV EAX, [EBX + 10]
where EBX
is symbolic. The potential values of EBX
could lead to fetching different memory values into EAX
.
- One naive solution is to calculate the possible range of EBX
and include all potential memory values in a symbolic formula. However, this approach can make the formula very large and complex, posing difficulties for constraint solvers.
Concretization as a Solution - When faced with overly complex symbolic execution challenges, such as those involving symbolic pointer arithmetic, concretization is often employed as a practical solution. - By concretizing the index, the execution model avoids the complexities of pointer-based symbolic execution. This approach, while not ideal, is often necessary to manage the complexity and limitations of current constraint-solving technologies.
To begin with, when discussing state scheduling, it's essential to note that in a tool like KLEE, there is no inherent state, meaning each input follows a specific execution path. This characteristic introduces a scheduling challenge akin to input scheduling or fuzzing. The question arises when encountering a symbolic branch: should the branch be explored immediately, or is it more beneficial to defer exploration to a future state? This decision boils down to whether to "flip" the branch or not, which is crucial for exploring different execution paths.
The concept of branch flipping is straightforward. When a symbolic branch is encountered, one must decide whether to explore the alternative path. A simple method is to flip the branch if it leads to a previously unexplored code block. Alternatively, if the current block is new, one may decide to flip any branch to maximize exploration. This simplistic policy is a starting point, but more sophisticated strategies can be devised to optimize exploration.
Moving to the challenges of constraint solving in symbolic execution, one of the primary issues is the complexity of constraints that can arise. Symbolic execution often involves collecting constraints from the beginning of execution to the point of interest. This accumulation can make the constraint set exceedingly complex and potentially unsolvable, particularly when the constraints span multiple functions and execution paths.
A concrete example illustrates this problem: consider a program that processes files, such as a compressed archive. At the start of execution, the program might perform intricate checks on file metadata. While the security vulnerability might lie further into the execution, exploring it requires maintaining all prior constraints. This results in a massive and intricate constraint set, which a constraint solver might struggle to handle due to its complexity.
To address this, one approach is to concretize certain parts of the execution. In this context, concretization means selectively excluding specific branches from the constraint formula, thereby simplifying the set of constraints the solver must handle. By doing so, one can focus on the critical parts of the execution path that are more likely to lead to discovering vulnerabilities without being bogged down by overly complex or irrelevant constraints.
The primary motivation behind this approach is to address scenarios where not all constraints are relevant or necessary for solving a particular problem. By focusing on a subset of conditions, we aim to efficiently solve the problem while minimizing unnecessary computational effort.
-
Optimistic Subset Solving:
- The central idea is to begin with a concrete input and selectively solve a subset of constraints related to a particular problem.
- This approach assumes that only a small subset of input bits need alteration, while the rest remain unchanged.
- Although this method is straightforward, it often doesn't yield a successful result. It is effective in less than 20% of cases, indicating its limited applicability. However, when it works, it offers a quick solution.
-
Nested Branch Solving:
- This method is more structured and involves a recursive inclusion of relevant constraints.
- Starting from the last branch, we trace back to determine how variables are propagated from the inputs.
- If multiple conditions or variables rely on the same input bytes, they are included in the constraint set.
- This recursive addition of constraints continues until all relevant conditions are covered, forming a subset that is crucial for the problem at hand.
- Despite being a subset, this method is comprehensive and has proven to be quite effective in practice, often resulting in a successful test case.
Both strategies aim to simplify the process of symbolic execution by reducing the complexity of conditions that need to be solved. While optimistic subset solving is a quicker but less reliable method, nested branch solving provides a more dependable solution by ensuring all critical constraints are considered.
The insights discussed in this lecture are drawn from a paper I published in the field of security a few years ago. This paper was well-received and even won a best paper award, highlighting its significance and the impact of the research. I have gained valuable insights from this work, which continues to influence my approach to solving such problems.
In our recent discussions, we've encountered various academic papers. Occasionally, we come across papers that, while technically sound and well-written, don't offer substantial new insights or learning opportunities. However, there are those rare instances where a paper stands out, providing valuable insights and a fresh perspective that is genuinely appreciated. Such papers contribute significantly to our understanding and are worth our attention and discussion.
Looking ahead, we will continue our discussions in future sessions, delving deeper into the topics and papers that have sparked our interest. This ongoing dialogue will enhance our collective understanding and appreciation of the material.