NeverSawUs

static analysis of javascript programs using a stack machine

In this post, I explore the problem of statically analyzing Node.js applications, discuss a method for approach the problem, and introduce a tool, [estoc][], for performing fast static analysis.

Working on the io.js project has given me a fresh perspective on the need for this class of tools. When evaluating changes to be merged into the project, despite the extensive test suite it remains difficult to predict whether a given change will break code in the npm ecosystem. Similarly, it is hard to approach deprecating a given API with confidence, because usage metrics are not readily available on a per-method basis. This "fog of war" makes it hard to execute semver-based releases with confidence, since downstream breakage can mean the difference between a major version bump and a patch bump.

The brute force method of running every proposed change against every package in npm is not viable, and becomes less viable every day as the registry grows. To that end, we need to be able to search for API method usage and patterns of usage easily, in order to limit the number of tests that need to be run against a given changeset.

Ideally, we would collect the following information:

  • External object access: var url = require('url')
  • External object property lookup: url.parse
  • External object property deletion: delete url.parse
  • External object method invocation: url.parse(values)
    • Invocation should include the target of any thrown exception,
    • the types of arguments passed,
    • and the return value of the invocation should itself be tracked.

This information should be sufficient allow construction of a searchable index of API usage.

Definitions

We will define a "package" to be a set of files and directories, including one top-level "package.json" file, published to npm. A "module" is an individual JavaScript file that may export at most one value, and import other modules via require. The relationship that a require() call represents between the calling module and the target is called a "dependency." An internal dependency is one between two modules in the same package. An external dependency is one between a module in one package and a target in a different package.

Packages may have one or more "entry points": these are modules that are likely to be executed when the package is run as a command, or is the direct target of a dependency. These are determined by the "bin" and "main" keys in package.json, defaulting to the only available .js file in the package if those keys are not present. Note that any modules in a package may be the target of a require statement – this is called a "deep requirement".

This article assumes the reader is familiar with Abstract Syntax Trees ("ASTs"), and methods for obtaining them from source text.

Method

A control flow graph can be obtained by visiting individual nodes of an AST and connecting them together in the order of operations defined by the language. Each node represents an operation, and each directed edge represents the flow of control from that operation. That graph can then be simplified – straightline runs of operations can be turned into single blocks. That is, operations with only one exit point or one entry point may be absorbed into neighboring blocks until only blocks with N≠1 incoming or outgoing edges remain. For the dual purposes of analysis and illustration, it is ideal to reduce the total number of blocks when possible. That is, when building the CFG it is desirable to avoid unneccessarily splitting blocks where possible. In JavaScript, this means a CFG constructor should attempt to avoid adding exception edges that will never be taken during the course of the program.

However, in JavaScript, most operations may throw exceptions when presented with certain types of data. There are many "safe" combinations of types and operations as well; in practice, these safe combinations often form the normal, expected behavior of a program. It is thus desirable to track state so that types of values are known when processing a given operation. If the input types to a given operation are known, then it may be possible to safely omit an exception edge. Where that information is unavailable, a conservative approach should be taken and any viable exception edges should be added. See Fig. 1 for an illustration of this.

A side effect of building the control flow graph generator in this fashion is that it also functions as an execution simulator. In the following subsections I will touch on specific problems and solutions for this approach; as well as how this data is exposed as a analyzer for packages.

Unknown Values

Some values may not be known given the source text alone. These values are pushed onto the stack as Unknown instances. Operations accepting an Unknown value should treat it conservatively – if the operation can throw an exception, it should. If the nature of an operation allows for it, assumptions can be made about Unknown values after the first exception has been thrown.

As Fig. 2 illustrates: anUnknownValue.x = 3; anUnknownValue.y should only throw a single exception edge. The member lookup of .x receives an Unknown and draws an exception edge, and resume operation. When control reaches anUnknownValue.y, anUnknownValue is known to be defined, and non-null. Note that values obtained via lookup will initially be new Unknown instances – .y will be a new Unknown instances, while .x will be 3.

Similarly, anUnknownValue(); anUnknownValue() only adds a single exception edge. If control reaches the second invocation, the first invocation succeeded.

Unknown values may also be produced by dynamic lookup operations where the key is unknown, or by runtime functions when not provided adequate data to return a known value.

Object Values, Scopes, & Tracking Classes

Object values are of prime concern when analyzing JavaScript code. The presence or absence of a key in a given object can greatly affect the outcome of a program.

These values are treated as collections of Names, that may have a [[Prototype]] link, and a "hidden class ID." Name's point at other Value instances, and are accessed internally via valueObject.getprop("attribute").

Names track the number of times they have been assigned, the "current source object" from which they were obtained, as well as the value they are currently point at. All values track the set of names by which they are reachable.

Operations that require an LValue, like update expressions (++x, x++), assignment expressions (x = 2, x += 13), and invocations (x.y()) may induce subsequent operations to push a Name onto the stack instead of the Value to which they point. objValue.getprop takes an optional parameter denoting whether or not it should look up the prototype chain for an attribute Name.

Scopes are a specialization of Object values – parent scopes are referred to by child scopes via the [[Prototype]] link. Function values capture a reference to scope object values when their definition is visited. Two object instances are held by the CFG generator – builtins and global. global represents both the global scope and whatever global object may be exposed into the runtime (for example, window.) builtins is not accessible directly from JavaScript code and holds special pointers to runtime-required values (e.g., [[ArgumentsProto]]).

The hidden class id (HCID) of object values is globally tracked: all object values start as one of a set of "known" hidden class IDs representing the builtin types of the language. As properties are added to and deleted from an object value, their hidden class ID is advanced. The order of these operations determines the class ID – two objects that add the same attributes in the same order will have the same HCID. This HCID can be used to track rich type information. Fig. 3 illustrates the HCID recording process.

Simulating the execution of a JavaScript program in this fashion entails building and retaining an simulation of the object graph of the running program. For the purposes of static analysis, it is sufficient to simply return the appropriate types from the call and add the appropriate exception edges to the call. When building a intermediate representation, the runtime should emit expected operations (or intrinsics), which is a more involved process.

function invocation

Because the object graph is simulated, it is possible that a real function value will be available when visiting a call site. When this occurs, the control flow graph generator will simulate the call and visit the target function body. Subsequent visits may re-visit the function body. The overall effect is similar to aggressive inlining.

In order to support recursion, we must simulate a call-stack. If the function to be called already exists on the stack, a new edge is drawn to the entry point of the function and CFG generator continues past the call-site, avoiding potentially infinite loops due to recursion.

The runtime defines Function.prototype.{apply,call,bind}, so those operations are forwarded appropriately. Unfortunately in cases where the array has been modified the calling arguments are replaced by Unknown values (up to the arity of the function being called.)

branches and their effect of values

All logical branches are simulated. This poses a problem, illustrated in Fig. 4. A value that is mutated in a branch does not have a "determined" value after the branch – it could either type A or type B.

To allow for this, when a branch is entered, a "membrane" is inserted into the scope chain. Name lookups that pass through the membrane will result in a "wrapped" value or name being returned. Non-mutative operations on that wrapped value act exactly the same as if operating on a non-wrapped value. The wrapped value will return wrapped names. Wrapped names proxy their unwrapped counterparts and ensure that the membrane wrapping applies to all objects accessible via the originally wrapped value. Wraps have a branch ID, which corresponds to a call to branchOpen. When branchClose(id) is called, all wrapped values and names in the stack are "unwrapped" – replaced by the original value or name.

If a mutative operation occurs on a wrap, all extant external references to the target object will be made to point at a new composite value, an "Either." References internal to the branch will replace the wrapped value with the new value, whatever that may be. The outer "Either" will contain the original value (prior to the opening of the branch), and the new value (after the mutative operation). Operations performed on the Either value will produce new Either objects, based on the result of performing the operation on each of the constituent values.

The values contained by an Either are called outcomes. Property lookups on Either values will return either a Name, if the property is shared between all outcomes, or an EitherName. An EitherName may be assigned – outcomes missing the property will grow the property. An EitherName's value before assignment is an Either containing all potential outcomes of the lookup.

Calls to Either values will attempt to call all consituent outcomes, and return an "Either" value as a result. These calls will automatically happen through a Membrane, as the function call should not mutate outer values directly. Fig. 5 illustrates this behavior.

Simulation & Spies

By augmenting the runtime with the ability to require other packages and modules, it is possible to simulate, roughly, the actions a given package will take when first required. However, this does not prove sufficient for full analysis – the initial transfer of control to a package often only serves to create values to return to the requiring package. Since we have chosen to deal with packages on an individual basis we are attempting to ignore any potential use of the current package by requiring packages, instead focusing on usage of other packages within the package-at-hand.

We can gain this information by simulating the usage of the package-at-hand. During the CFG construction phase, the analyzer stores a queue of function values as they are defined, removing them as they are visited by the CFG. Once the CFG phase is completed, only un-visited functions remain in the queue. The first un-executed function in the queue may be popped off the queue and a call simulated, before restarting the CFG construction. Simulated calls pass N+1 Unknown arguments – one for "this", and one for every named function argument. The queue is retained and updated as visited calls remove functions and new function definitions are added.

Our require implementation will transfer control between individual CFG builders as necessary for files within the package. The queue of unexecuted functions is shared between all files and their CFG builders. Once the queue is emptied and the control flow graph construction ends, the analysis will have visited all functions defined in the package that are reachable from the package's entry points.

To get usage information from this system, we inject spy values into the CFG builder's stack machine when external packages are required via our require implementation. When properties are accessed from a spy, a new spy is returned, augmented by the path by which it was accessed and the source location at which the access occurred. When a spy is called or instantiated, that action is reported and a new spy is inserted onto the stack, representing the return value.

Many node packages operate via inversion of control, where a function is passed from the client program to be called by the package upon completion of the task. When such a function is passed to a call to a spy value, it is marked with the source spy value. When a call to that inversion-of-control ("ioc") function is simulated, instead of unknown arguments and context, the context and arguments are now spy values that descend from the spy value that was called. The usage of ioc arguments is tracked in this fashion.

Thus, these spies report the following usage:

  • The names to which they are assigned,
  • When they are loaded from scope,
  • When their properties are accessed,
  • When a call to them is simulated,
  • The value arguments that call was made with, which may include "rich" values with HCIDS,
  • The use of their return values,
  • The source location to which they would throw an exception,
  • The use of arguments passed to inversion-of-control functions,
  • And the source locations, consisting of filename and line number, of all of the above.

analysis & results

Using [estoc][], I analyzed two popular packages on npm. One is [request][], and the other is [eslint][]. The results of the analysis are [here][]. I'd like to call out some usage that I feel exemplifies the power of the approach this tool uses.

concerns

Where does it fall down? looping throws it off, calls based on looping give reduced information (but still useful) branching current operates branch-by-branch, no capture at branch outset es2015 literals – a bit hard to keep up with! a work in progress. eval / Function * computed lookup