RFC: Scalable Static Analysis Framework
Jan Korous, Static Security Tools, Apple
Summary
We have a prototype of a source code rewriting tool that uses static analysis methods to apply security hardening across whole C++ codebases. We want to complete and upstream the tool. We are starting to work on other tools that need to reason about source code across large C, Objective-C and C++ projects. We also have a long-standing goal of enhancing the Clang Static Analyzer with analyses across translation units to improve its accuracy and precision, thereby reducing false positive rates. While there is an existing effort for cross-translation-unit analysis in Clang based on ASTImporter, we don’t think it models the software build with the accuracy we need, and it won’t be able to …
RFC: Scalable Static Analysis Framework
Jan Korous, Static Security Tools, Apple
Summary
We have a prototype of a source code rewriting tool that uses static analysis methods to apply security hardening across whole C++ codebases. We want to complete and upstream the tool. We are starting to work on other tools that need to reason about source code across large C, Objective-C and C++ projects. We also have a long-standing goal of enhancing the Clang Static Analyzer with analyses across translation units to improve its accuracy and precision, thereby reducing false positive rates. While there is an existing effort for cross-translation-unit analysis in Clang based on ASTImporter, we don’t think it models the software build with the accuracy we need, and it won’t be able to support the scale of the projects we target. This motivates us to create a summary-based cross-translation unit static analysis framework that would cover common needs of tools we want to create.
We plan to use it immediately to develop the tools mentioned above and other tools in the future. We intend to design and implement the framework incrementally in parallel with the initial client tools to make sure we are building the right thing.
This RFC is intended to kick off the development; it presents the high level ideas. We will post follow-up RFCs for each tool as well as specific proposals for parts of the design.
Problem
We intend to develop a number of tools based on static analysis methods that will need to infer, represent, and analyze facts about program entities of large software projects composed of numerous separately built targets. The tools will need to accurately represent relations within all the input source code of such software. They will need an efficient way to infer, represent, process, aggregate, and consume metadata about large amounts of code in order to implement the specific analyses.
Proposal
We will create a new framework that static analysis tools and source code rewriting tools can use to reason about the source code of large software projects. The framework will consist of new APIs, new data formats, new tools, and possibly new features in clang.
The tools need to reason about program entities, such as functions, global variables, user-defined classes, and local variables. In this RFC, we’ll use the term “entities” as an umbrella term for these language features. We might extend the definition in the future. Note that the C and C++ language standards define “entities” differently, with only partial overlap to the definition provided above.
The framework will provide a way to identify program entities independently of their AST representation.
The framework will present an abstraction of the entities that is independent of the AST representation. The client analyses will work at the level of entities and information about entities when reasoning about programs. The abstraction will be provided as an API by the framework. The API will allow constructing program entities from AST nodes, looking up information for entities, as well as implement serialization and deserialization.
We will rely on build system integration to provide the framework with information about how code is built and connected.
The framework will represent data about entities in compilation units. As the code is built and connected through processes like linking the framework will provide APIs for combining information about entities from different compilation units. The build system will invoke tools that use these APIs.
The framework will allow for complex build scenarios to be modeled. The framework will require other build tools to provide it with information on the build as necessary. A hypothetical example might be the linker delivering information on symbol aliases.
The framework will provide APIs to represent, capture, process, and consume abstractions of the code for the client analyses to use.
While being general-purpose, the framework will make certain assumptions about the client analyses. In order to achieve the goals stated below it will assume that the analyses capture and process data about the code in a compact, abstract form that we will call “summaries”. This concept is similar to ThinLTO summaries.
As an example, for function inc:
unsigned * inc(unsigned * p) {
if (p)
return p++;
abort();
}
For a client analysis reasoning about pointer flow, a possible summary of the function could be that it flows data from its parameter p to its return value:
inc:param:p -> inc:return
That describes the function in terms relevant for the analysis and omits the unnecessary details. It can be represented in a more compact way than its source code or AST.
The framework will provide shared facilities that client analyses can reuse. Analyses that cannot fit into the proposed shape can still benefit from the abstraction of program entities and build modeling provided by the framework.
Motivating Examples
Analysis - Call Graph
We will consider a toy example of a tool that will illustrate the needs we propose the framework to address.
Let’s suppose we want to build a tool that constructs a call graph of make-based C programs. Here is an example of such a program. Each gray block represents the contents of a file (whose name is included in a comment):
// main.c
#include “foo.h”
int main(void) {
foo();
}
// foo.h
void foo();
// foo.c
void foo(void) {
printf("Hello World!\n");
}
// Makefile
main: main.o foo.o
...
Suppose the tool receives definitions of all the inputs for all the compilation units and constructs ASTs for each of them. For each AST, it would be able to construct local components of the call graph.
in main.o main calls foo
in foo.o foo calls printf
The tool needs to be able to connect the information from each translation unit into a view of an entire program. The tool needs the following functionality, which the proposed framework intends to provide:
a way to identify program entities in individual translation units
In the example we give, the data in summary of foo.c will be that foo calls printf and the data in summary of main.c will be that main calls foo. The entity identifiers will be used in the summaries to represent main, foo and printf. The tool would need to serialize and deserialize them.
an abstraction of entities spanning multiple translation units and logic for merging the metadata
The tool would have to follow the build graph and merge the two summaries of individual translation units to produce merged summary for the executable main. It would have to understand that foo in the summary of main.c is the same entity as foo in the summary of foo.c and it would have to combine the information foo in both summaries to form the merged summary.
data structures to represent the code and APIs to work with the data e.g., representation of the call graph segments sketched above.
Source Code Transformation - Unsafe Pointer Hardening
We will consider another example of a tool that does analysis followed by source code transformation. Let’s suppose we have a tool that converts relevant built-in raw pointers to std::span. The tool works by finding all places where pointer arithmetic or indexing is done on a built-in pointer and changing the type of the operand to std::span. It then propagates the change from pointer to std::span via assignment and argument passing operations to upstream pointers.
Here is an example of a program to convert.
// main.cpp
#include “impl.h”
int main(void) {
int *x = create(5);
int result = access(x, 4);
return result;
}
// impl.h
int access(int *x, unsigned offset);
int *create(unsigned int x);
// impl_a.cpp
#include "impl.h"
int access(int *x, unsigned offset) {
if (x == null) return 0;
int *tmp = x;
return tmp[offset];
}
// impl_b.cpp
#include "impl.h"
int *create(unsigned size) {
int *result = new int[size];
return result;
}
// Makefile
main: main.o impl_a.o impl_b.o
...
For each translation unit (TU), the tool would construct summaries that describe:
what pointer parameters of functions defined in a TU must become a span because of local usage within the function
1.
what pointer parameters might have to become a span if another parameter becomes span
1.
(and similar information for return types of functions and return type/parameter) combinations)
For the example the data in summary of each TU would be:
for impl_a.cpp:
- Parameter `x` of `access` must become a span.
for impl_b.cpp:
- If the return type of `create` must become a span, `result` must become a span.
for main.cpp
- if the first argument of access must be a span, then x must become a span.
- if x must become a span, then return type of acreate must become a span.
The tool needs the following functionality to determine which pointers must become std::span:
a way of representing information about the program entities. For example, a must-have relationship for the type of an entity, or a conditional requirement of the form “if x then y”.
serialization and deserialization of the above for each TU.
merging information for multiple TUs.
using the merged-TU information to decide which variables must become span (in this case, you could view the information as being constraints on types of pointer declarations that must be resolved).
In addition, the tool needs to actually rewrite the source code which we will explain in “Adoption Tool for C++ Safe Buffers”.
Note: Both motivating examples share certain abstract needs, such as extracting summaries per TU and merging them. The specific logic used for these tasks might be analysis-specific, such as the data in the summaries or the summary data merging strategy.
Goals
The overall goal is to create a shared framework that will address common needs of static analysis tools and other tools based on similar methods such as source code rewriting tools.
Ease of Implementation of Cross-Translation Unit Analysis
The primary goal is that the framework will address the common low-level details of relating entities across the build so that for any specific analysis, implementing its cross-TU version won’t be substantially more difficult than implementing a single TU version.
The ideal is that the analysis authors can reason about the whole codebase as if it was all in a single translation unit. We want to factor out the common parts of cross-TU reasoning and implement them in the framework so the bar for implementing a specific analysis is much lower than today.
Interoperation with Build Tools
We need the tools to be usable in the existing workflows for building, distributing and assembling software components. The framework should fit into existing workflows of incremental compilation, linking, library distribution, etc. Specifically, we cannot require toolchain components be heavily modified, or that the workflow for building software be drastically changed.
We imagine that changes to build systems might be required and that minor changes might be required in linkers. We want to design the framework to keep these changes to a minimum.
Support Analysis of Large Software Projects
We want to build tools that can process large codebases and aggregates of codebases, such as compilers, web browsers, or OS kernels. The framework needs to avoid imposing arbitrary constraints and needs to assume the scale of analyzed code can be very large. Specifically, it should not assume all code, ASTs or even summaries for a codebase fit into memory. We plan to construct the analysis data by incrementally combining its components, similar to how binary artifacts are assembled during software builds.
It is not a goal to lower the resource demands of complex whole-program analyses such as whole-program alias analysis. While the framework aims to be neutral with respect to its applications, it will likely be biased towards modular, scalable analyses. The framework will likely be opinionated about the kinds of abstractions and APIs it provides for reasons like performance, scalability and distributed build environment support.
Support Analysis of Complex Scenarios
The framework needs to discover, capture, and expose as accurate entity relations as practically possible. For the information that cannot be inferred from the source code, such as link units or dynamic library lookup, the framework needs to provide ways for the toolchain and operating system components to communicate the information so the framework can represent it in the modeled entity relations.
Support Source Code Rewriting
Source code rewriting tools will need to solve additional challenges compared to tools that perform only analysis. These are the ones we have identified and are motivated to provide solutions for in the framework:
Code transformation tools aim to avoid breaking the build. The framework might model the relevant language rules as a set of constraints that a valid transformation must satisfy.
The tools might need to transform source code that is a product of an early compiler pass like macro expansion, template instantiation, or constexpr compile-time evaluation. The framework might model such concepts so the tools can implement the transformation.
Flexibility and Customizability
Different tools will need to use different data in the summaries. The framework needs to support different use cases by being flexible and providing the right customization points.
Support for Modeling of User-Defined Abstractions
Pointers commonly play a prominent role in security analysis of C-family source code. In C++ there are multiple standardized or popular pointer-like and buffer-like abstractions, such as std::span or boost::array. Some of our intended analyses will likely have to model these more abstract concepts of “pointer-like entities” and “buffer-like entities” and the framework should enable that.
Supported Programming Languages
The core part of the framework will support all languages implemented by clang. We will initially build analyses on top of the framework that will reason about C, C++, and C with -fbounds-safety language extension. Support for other extensions will be developed as needed.
Non-goals
Analysis over Clang IR or LLVM IR We envision that the framework could serve as a backbone for cross-TU analysis at the IR level by using language entities to bridge translation units but we will focus our efforts on developing source code level analysis.
Support for binary-level analysis All our intended uses of the framework will be at the source code level.
Abstractions to represent non-C code Cross-language analysis will be an important application of the framework for scenarios like secure C++ and Swift interoperation. We envision a decomposition of such analysis where each part would represent code written in a single programming language. We will focus on developing abstractions that model C-family language entities. Language bridging can later be built on these foundations.
Interoperation with other C compilers The concepts and operations defined by the framework will likely refer to the source code in terms of language concepts and will not refer to any AST details. It might be possible to generate build definition and summaries from other compilers and use the framework to process them but it is not an explicit goal to support such a workflow.
Proposed Architecture
The basic operations provided by the framework will be: summary extraction, summary linking and summary analysis.
Build System-Driven Workflow
Let’s take a simple build of an executable as an example:
The build is a directed graph in which build artifacts are created from the inputs (e.g., source files and compilation commands) by tools such as the compiler or the linker. The build system traverses the graph and invokes the specific tools with the inputs defined by the project’s build configuration.
The framework will provide new tools for the build system to invoke at the appropriate edges in the build graph so summary data about the codebase can be extracted and linked. For example, we might add a summary extraction feature to clang and create a summary linker tool that the build system would then invoke:
Our initial goal is to support a “single project, single-target” scenario-i.e., a build of a single user-space executable or dynamic library where nothing needs to be known about library dependencies. We will initially treat libc and libc++ headers as if they were part of the project source code. In the long-term the framework will correctly model software that is constructed from multiple codebases related through a packaging and distribution mechanism.
For the single-target scope we will rely on the build system to provide the information on how the code is built. The framework will implement basic operations related to summaries and provide these as command line tools usable by the build systems. The project’s build system will invoke these tools with the right inputs. The framework will also provide these operations in the form of APIs.
The framework will be build-system-agnostic and will avoid modeling the complexities of build systems or distributed builds.
Summary Extraction
This operation will summarize source code for a given compilation and for a given analysis. It will be invoked by the build system at the compilation edges of the build graph. The core of the implementation will likely be AST traversal to extract analysis-specific data for the entities of interest.
We might decide to add a new feature to clang so this operation can run during the actual build to amortize the cost of parsing source files.
input:
target analysis type
compilation inputs
compilation identifier (provided by build system to identify specific compilation edge in the build graph)
output:
- translation unit summary
TU summaries will represent entities and their data. The representation will need to support entity linking and data merging as two separate steps.
TU_Summary {
entities : [ Entity, ... ],
data : [ /* contains references to entities */ ]
}
Summary Linking
This operation will take summaries for multiple object files and resolve entities across individual summaries. It will be invoked by the build system at the build graph edges where the linker is invoked during the actual build.
inputs:
target analysis type
linker command
individual TU summaries
output:
- linked summaries (entities have been linked)
At this step we will not merge data in the summaries as that logic will be specific to the analysis.
Link_Unit_Summary {
linked_entities : [ LinkedEntity, ... ],
data : [ ... ]
}
Summary Analysis
Likely the first step the analysis will want to do is to merge summary data because while the entities have been linked their data has not been merged yet.
Summary Data Merging
The operation will merge analysis-specific data for identical entities from multiple TU summaries.
Summary {
linked_entities : [ LinkedEntity, ... ],
merged_data : [ ... ]
}
For some analyses this operation will be trivial-e.g., only a single TU summary will contain summary data for a given entity.
The analysis itself will work with merged summaries only, i.e., neither the original source code nor its AST will be used. It will consume the merged summary data for the whole target.
input:
target analysis type
merged summary for the entire target
output:
analysis result
[diagnostics]
Multi-pass Scenario
Some client tools might need to combine the analysis result with the rich details in the AST. The framework will support such a workflow in the sense that the analysis result of one pass (summary extraction, linking, analysis) can be used as input for the next pass (another summary extraction, …).
An example of such a tool might be a source code rewriting tool that uses the first pass to establish what entities should be transformed and then uses an additional pass to implement the source code changes that implement the transformation described in “Adoption Tool for C++ Safe Buffers”.
Extending scan-build
The framework will generally interact with build systems. We want to start with implementing support for make-based projects by extending the scan-build tool in clang. We believe that will allow make-based projects to use static analysis tools based on the framework with the least amount of work on setting things up. It will also allow us to bring up and verify the relevant interfaces in the framework without making corresponding changes to build system projects, i.e., outside of LLVM.
Configurable Analysis
We imagine the tools built on top of the framework will need to use a different kinds of analyses to obtain the required data. We expect the logic for specific applications to be located within framework code, similar to the model that we use for Clang Static Analyzer checkers. We expect some of the more general analyses (e.g., a whole-target call graph) will likely be used by multiple client tools.
At a source code level, we aim to provide APIs that allow the analyses to be reused. At the data level we imagine that within a single workflow multiple analyses might be extracting and merging their summary data, some of which may overlap. We imagine that we would create a way to share summary data gathering for multiple analyses within a single workflow.
High-Level Design
The data in summaries will refer to entities as defined in the “Proposal”.
Entity Representation
The build system will provide a unique compilation ID when a summary gets extracted. We will use the compilation ID together with clang USR (“unified symbol resolution” identifier) to implement entity identifiers.
Within a single build target, distinct entities visible outside their object file (such as global variables or user-defined classes) will have distinct USRs.
USRs of distinct entities that are not externally visible (such as local variables) can potentially conflict if they are part of different compilation units. We will rely on compilation ID to distinguish them.
For targets linked against other shared libraries we will combine the entity identifiers with the “Entity Namespaces” mechanism to relate and distinguish entities.
EntityId {
USR : string,
CompilationId : string
}
We will consider diagnosing ODR violations to detect invalid programs for which the entity data would be corrupted. If we need to model other entities we may extend the identifier data.
Entity Linking
We need to accurately relate entities across the build. This operation will be executed during the “Summary Linking” stage and it will work with the data in the TU summaries. For that purpose we will supplement entity identifiers with linkage metadata. The specifics are TBD but we expect these are the type of data we will need to gather at the “Summary Extraction” stage:
EntityLinkage {
kind : class|function|global_var|local_var
linkage : extern|static|N/A,
symbol_visibility : default|hidden,
weak_symbol : bool,
defined_in_SDK_header : bool,
inline : bool,
}
The entity linking needs to cover a superset of entities that are recognized by the linker as part of the normal build. For example, the framework needs to link user-defined class entities or staticinline functions in SDK headers, both of which are not handled by the linker. Each kind of entity can have its own strategy for how its TU-local fragments are to be linked.
We will initially model only the basic linker functionality and will cover advanced linker features such as weak symbols or hidden visibility when we implement support for multi-library scenarios. One possibility is to rely on linkers providing information through features like linker maps.
Entity Namespaces
Our goal is for the framework to eventually enable analysis of large software projects that are constructed from individual components and dependencies. Such projects typically use some form of packaging, distribution, installation and dynamic linking of its dependencies.
In such scenarios link-time symbol resolution and dynamic loading is relevant to understand how the source code is connected. Without externally provided information the framework is unable to infer accurate information on how software is constructed from its components like shared libraries.
We don’t think the framework is the right place to model all possible complexities of software projects and propose an abstraction that will both allow us to factor these complexities out of the framework and also allow the toolchain components like build systems that have access to this information to pass the relevant data to the framework.
We propose to introduce the notion of entity namespaces. A namespace is a set of program entities (imagine a shared library). Each namespace has a distinct identity, and operations such as summary merging would take namespaces into account, in combination with properties of program entities. The framework would provide some default rules for using namespaces in various operations, but allow those rules to be overridden or replaced so that the behavior for a specific system can be modeled. The framework will provide interfaces through which other tools can override namespace information in the summaries.
We expect this abstraction to be powerful enough to allow modeling of complex scenarios.
Summaries
Summaries will contain data about the entities.
Conceptually the framework needs to provide at least these kinds of operations for summaries:
produce summary from AST for a single TU
- analysis-specific summaries
- possibly producing summaries for multiple different analyses at the same time
combine summaries of multiple TUs
- analysis-specific logic
provide lookup of data for TU-local AST nodes
Scalability
One of our important goals is to support analysis of large software projects, and the data volume of the summaries will be important to reach that goal. We see two guiding principles to follow when building scalable analysis tools.
First, we should carefully consider what kinds of information internal to each TU to put in the summaries. For example, not every analysis needs to summarize local variables.
Second, we should also carefully consider what specific entity instances should be included in the summaries. For example, for a particular TU, a function declaration that neither has a definition nor is used in any way relevant to the specific analysis should generally not be included in the summary of that TU.
The framework will aim to make building scalable tools easy without being too prescriptive and without imposing artificial constraints. We also imagine that in the future we will create guidelines for designing scalable summaries.
Well-Defined Lookup
We want to avoid answering ill-defined questions at the summary consumption stage. For example, it is legal to define the same struct with different definitions within a single program as long as all but one definitions have no ODR-use in their translation unit. As a consequence, relating an arbitrary declaration to other declarations in the program sources is not always well-defined. However, relating a “use” of an entity to its definition should be both well-defined and would fit in the summary extraction strategy described above.
Constraints
For the source code rewriting tools that we intend to develop we imagine there will be a need to represent constraints over entities or entity metadata (such as type). We also might use constraints to represent language rules and would use them when searching the transformation space to avoid build regressions.
The precise details of the constraints layer will be worked out later as we start bringing up the specific tools but it seems clear that the constraints layer will depend on the notion of entities and will be contained in the summaries.
Implementation Strategy
We propose adding the framework to the clang codebase and have it controlled by a build configuration option.
The specific tools to be driven by build system can be separate tools in llvm-project/clang-tools-extra subproject although we imagine for the summary extraction phase having the functionality directly in clang will likely be highly desirable.
File Formats
We imagine that in most scenarios, client tools will store summary information on the filesystem. We expect that for scalability reasons the framework will have to provide an efficient file format—likely either binary or compressed. However, for initial bringup and debugging, having a human-readable format like JSON or YAML would be beneficial.
Scenarios
We have several initial applications of the framework in mind to guide us through the initial development:
a new static analysis tool to find global variables modified by each function and its callees
a new source code rewriting tool for:
automatic project-wide adoption of C++ Safe Buffers programming model
automatic project-wide adoption of -fbounds-safety programming model
a new static analysis tool to enforce safe interoperation between C++ and Swift code
Clang Static Analyzer functionality to improve the results quality
We outline the design of two of the client tools of the framework below.
Adoption Tool for C++ Safe Buffers
Suppose we want to automate source code rewriting to replace unsafe raw pointers (as identified by -Wunsafe-buffer-usage) with hardened std::span which tracks the upper bound of its pointer and use it for runtime bounds checks in arithmetic and array subscript operations.
When changing pointers to std::span we need to correctly set the upper bound for span instances. The strategy that we chose is to take the existing pointer, look for operations that propagate the memory address to it and later change declarations involved in these operations to not only propagate the memory address but also the bound by changing all these raw pointers to std::span instances. We expect that code paths that propagate memory addresses begin at locations where both the memory address and its bound are known and are represented by a code pattern that can be automatically analyzed (e.g., a heap allocation through operator new).
Conceptually for an unsafe pointer, we recursively build group of pointer declarations that propagate memory addresses to an unsafe pointer operation and then change their type to std::span.
We will use the framework and summaries to find sets of pointer declarations related by memory address propagation.
Let’s take this codebase example:
// cast.cpp
char * cast(int * int_ptr) { return (char*) int_ptr; }
// foo.cpp
char * cast(int * int_ptr);
struct Foo { char* buff; };
void bar(int * ptr, Foo& f) {
int * local_ptr = ptr + 1;
f.buff = cast(local_ptr);
f.buff[1024] = 'C'; // warning: Foo::buff ... [-Wunsafe-buffer-usage]
}
We see that:
in function bar memory address from ptr is propagated to local_ptr (we ignore that it is modified by the arithmetic expression because std::span implements an analogous operation)
memory address in local_ptr is propagated to int_ptr parameter of function cast
in function cast memory address in int_ptr parameter is propagated to the return value (we ignore the type cast because we can implement a safe cast between different std::span<T> instances)
in function bar the memory address returned by function cast is propagated to field Foo::buff
(While we write the list in the order that is easy to follow for the reader, the analysis does not consider the order.)
The data that we want to infer is that Foo::buff is an unsafe pointer and the following group of pointers is related (memory address flows through these to Foo::buff):
Foo::buff
int_ptr parameter of function cast
return type of function cast
local_ptr
ptr parameter of function bar
(We do not have a common word for these entities but except the function return types they are all some kinds of “declarations”.)
Another way to describe this problem can be “backward taint analysis on pointer types”.
We will design the per-TU summaries as follows. We will abstract pointer declarations as “entities” since for this part of the analysis it is irrelevant if a particular pointer is a function return type or a struct field. Let’s assume for now that we have a naming scheme for entities that covers all the kinds of pointers in the example.
Summary Extraction
unsafe_pointers : [ entity_id, ... ]
pointer_flow : [
{ from : pointer_id, to : pointer_id },
...
]
For our example we would extract these per-TU summaries. The entity naming scheme is for illustration purposes only, it is not what we intend to use.
// cast.summary
unsafe_pointers : []
pointer_flow : [
{ from : cast.cpp:cast:int_ptr, to : cast.cpp:cast:return },
]
// foo.summary
unsafe_pointers : [ foo.cpp:Foo:buff ]
pointer_flow : [
{ from : foo.cpp:bar:ptr, to : foo.cpp:bar:local_ptr },
{ from : foo.cpp:bar:local_ptr, to : foo.cpp:cast:int_ptr },
{ from : foo.cpp:cast:return, to : foo.cpp:Foo:buff }
]
Summary Linking
We will link entities from different TU summaries. (Represented by the cpp file prefix removed in the resolved identifiers.)
// cast.summary
unsafe_pointers : []
pointer_flow : [
{ from : cast:int_ptr, to : cast:return },
]
// foo.summary
unsafe_pointers : [ Foo:buff ]
pointer_flow : [
{ from : bar:ptr, to : bar:local_ptr },
{ from : bar:local_ptr, to : cast:int_ptr },
{ from : cast:return, to : Foo:buff }
]
Summary Analysis
Once we know the relations between entities, we can merge the associated data—in our case, we will simply collect all unsafe pointers and all pointer flow edges.
unsafe_pointers : [ Foo:buff ]
pointer_flow : [
{ from : cast:int_ptr, to : cast:return },
{ from : bar:ptr, to : bar:local_ptr },
{ from : bar:local_ptr, to : cast:int_ptr },
{ from : cast:return, to : Foo:buff }
]
We obtained a program-wide list of unsafe pointers and a graph of relations between pointers. In this graph, connected components identify groups of pointers that need to be transformed together (if necessary). By combining these two pieces of information, we can determine which pointers need to be changed to std::span and identify the groups of pointers that should be transformed together.
With summaries for all the source code we will identify groups of pointers to be changed to std::span by recursively aggregating pointer sources per unsafe pointer. In this example there’s only one such group:
make_spans : [ [Foo:buff, cast:int_ptr, cast:return, bar:local_ptr, bar:ptr] ]
We have found the transformation.
Second Pass
For each entity to be changed to std::span, we will have to calculate the actual source code changes required. This mainly involves changing all its declarations, but we will also need to check all of its uses in the sources because std::span is not a drop-in replacement for pointers. For example if a pointer is freed by delete[] we will have to change the operand to a call to std::span::data() method. In other words, we will have to retrofit the new type to code that expect raw pointers.
We will input the analysis result from the first pass to the summary extraction step of the second pass, create the source code changes, merge and deduplicate them, and apply them.
Pointer Escape Analysis
Safe interoperation between C or C++ and memory safe languages like Swift is an area where having the ability to infer which parameters escape versus which don’t is generally useful. Both compile-time analysis and a source code rewriting tool that would correctly add attributes to function parameters on C/C++ APIs might be valuable.
Suppose we want to implement escape analysis for pointer function parameters. Here’s how that would work with the proposed framework.
To keep the length of this scenario appropriate for the RFC, we’ll limit its scope to:
C language only
model the first dimension of each pointer only
assume no two pointers ever alias
This makes it only a toy-level example; the real analysis will need to understand more than this limited scope. We will also focus only on the aspects relevant to the RFC and will not elaborate on the operations where the pointer escapes.
First, we will summarize per function parameter of pointer type:
escape operations on the parameter that are not function calls, such as store to a global variable, return, etc. (or an expression with the pointer like cast or pointer arithmetic)
function calls passing the pointer (or an expression like above)
Then we will link entities and the last step will be summary analysis.
Let’s take this code base as an example:
// foo.c
void set(int* ptr);
void log(int* ptr);
void foo(int* ptr) {
set(ptr);
log(ptr);
}
// set.c
int* global_ptr = NULL;
void set(int* ptr) {
global_ptr = ptr;
}
// log.c
#include <stdio.h>
void log(int* ptr) {
if (ptr)
printf("Value: %d\n", *ptr);
}
// Makefile
...
foo_app: foo.o set.o log.o main.o
clang foo.o set.o log.o main.o -o foo
...
Summary Extraction
We will extract summaries for each TU with this data:
// foo.summ
definitions : [
{ func : foo.c:foo,
ptr_params : [ foo.c:foo:ptr ]
}
]
escapes : [ ]
argument_passing : [
{ arg : foo.c:foo:ptr, param : foo.c:set:ptr },
{ arg : foo.c:foo:ptr, param : foo.c:log:ptr }
]
// set.summ
definitions : [
{ func : set.c:set,
ptr_params : [ set.c:set:ptr ]
}
]
escapes : [ set.c:set:ptr ]
argument_passing : [ ]
// log.summ
definitions : [
{ func : log.c:log,
ptr_params : [ log.c:log:ptr ]
}
]
escapes : [ ]
argument_passing : [ ]
Summary Linking
We will link entities from different TU summaries. Note: summary linking only relates the entities; summary data is left untouched.
// foo.summ
definitions : [
{ func : foo,
ptr_params : [ foo:ptr ]
}
]
escapes : [ ]
argument_passing : [
{ arg : foo:ptr, param : set:ptr },
{ arg : foo:ptr, param : log:ptr }
]
// set.summ
definitions : [
{ func : set,
ptr_params : [ set:ptr ]
}
]
escapes : [ set:ptr ]
argument_passing : [ ]
// log.summ
definitions : [
{ func : log,
ptr_params : [ log:ptr ]
}
]
escapes : [ ]
argument_passing : [ ]
Summary Analysis
The analysis would start with merging summary data:
definitions : [
{ func : foo,
ptr_params : [ foo:ptr ]
},
{ func : set,
ptr_params : [ set:ptr ]
},
{ func : log,
ptr_params : [ log:ptr ]
}
]
escapes : [ set:ptr ]
argument_passing : [
{ arg : foo:ptr, param : set:ptr },
{ arg : foo:ptr, param : log:ptr }
]
We will use enum {escapes, doesnt_escape, unknown} to model pointer parameters.
We make a list of pointer parameters across summaries and initialize their associated data to to unknown:
foo:ptr : unknown,
set:ptr : unknown,
log:ptr : unknown
Then we will repeatedly iterate over all the individual summaries and update the data for parameters until we reach a fixed point. We will learn during the first pass that:
set:ptr escapes
log:ptr is neither passed to another function nor escapes
foo:ptr : unknown,
set:ptr : escapes,
log:ptr : doesnt_escape
We will continue the iteration. We now see that set:ptr escapes, and we know that foo:ptr is passed to it, which means it escapes too.
foo:ptr : escapes,
set:ptr : escapes,
log:ptr : doesnt_escape
After this iteration we’ve reached the fixed point and have the analysis result.
Alternatives considered
Merging ASTs with ASTImporter
The existing facilities for cross-TU analysis in clang rely on importing and combining AST data for all analyzed code in memory using the ASTImporter. For the framework, we are envisioning higher level abstractions than ASTs. Static analysis involves analyzing programs at the semantic level and is not usually concerned with maintaining a precise representation of the as-written source code of a program, which includes details such as re-declarations of functions and classes.
We considered using the ASTImporter because it also addresses the problem of representing a program across translation units. However, the clang AST has a large memory footprint, and the serialized AST also has a large disk space footprint. We think the requirement to load all ASTs into memory will be prohibitive for the analysis of large projects and whole systems. We think the disk space footprint of serialized ASTs will be prohibitive too.
The ASTImporter also does not provide ways to model important aspects of how software is built. The ASTImporter relies on simple heuristics to model linking and build artifact propagation. These heuristics do not model some aspects of linking, such as the ability to rename exported symbols or to have multiple exported names.
We see the ability to accurately resolve entity identities across the build as necessary for analysis that require precision. To do so we imagine the post-compilation stages of software build need to be represented with higher accuracy.
Compilation Database
While a lot of source code processing tools successfully use a compilation database, we find the level of precision insufficient for the use cases we have in mind. A compilation database doesn’t accurately model either linking or dynamic loading. Generally, the negative effects of that are negligible for analysis of small codebases but prohibitive for large complex multi-component projects. We decided to design the framework in a way that it can model software builds accurately.
Build Modeling in the Framework
We explored ideas that could allow the framework to be more loosely coupled with build systems. For example, we were considering using file hashes of binary build artifacts so the framework would be able to connect object files with libraries and executables to understand how source code is related through post-compilation build phases. After considering some more complex scenarios such as debug symbol stripping, code signing or universal binaries we decided that modeling software build outside of the build system is too complex and the potential advantage is not worth the effort. We decided to delegate the responsibility to define and provide identifiers of compilation instances, etc. to the external tools that will be driving the analysis tools built on top of the framework, such as build systems.
Implementing Entity Linking in Linkers
Since we eventually want to model more advanced linker features like weak symbols, symbol visibility, etc. it might appear appealing to claim the linkers should be responsible for entity linking. However, we also need to “link” kinds of entities that are out-of-scope for linkers. We decided to implement entity linking in a standalone tool.
However, once we start supporting more advanced linking scenarios it might become inevitable to consume some information from linkers and it is possible that the framework might need some kind of data that linkers currently does not expose.