This design borrows some features from previous prototype of a new pylint, which in this paper I'll call **new-lint**. New-lint used a data-driven algorithm (I called it a sudoku-like algorithm) to do lint-like checking. Many features of this data-driven algorithm will reappear below.
New-lint was an “interesting” failure. It showed that a data-driven approach to lint-like checking is feasible. Alas, it's performance was comparable to that of pylint. This is strong evidence, imo, that the performance of the pylint can not be significantly improved without a major change in strategy.
To get significantly better performance, an **incremental** approach must be used. Such an algorithm computes diffs between old and new versions of files and generates the minimum needed additional checks based on those diffs. My intuition is that inc-lint could be 10 to 100 times faster than pylint in many situations. Inc-lint should be fast enough so that it can be run any time a program changes.
As an extreme example of an incremental approach, inserting a comment into a program should require *no* additional analysis at all. The only work would be to notice that the ast's (parse trees) of the changed file have not changed. More commonly, changes that do not alter the data defined by a module can have no global effects on the program. Inc-lint would check only the changed file. But these checks will happen in the presence of cached data about all other parts of the program, so we can expect such checks to be much faster than pylint's checks.
Inc-lint seemed impossible
It was far from obvious that inc-lint was feasible. Indeed, the difficulties seemed overwhelming. Aside from adding or deleting comments, any change to a python file can have ripple effects throughout an entire program. What kind of bookkeeping could possibly keep track of all such changes? For example, diffs based on ast's could not possibly work: the number of cases to consider would be too large. Building incremental features into pylint also seemed hopeless. The present pylint algorithms are extremely complex—adding more complexity into pylint would be a recipe for failure. In spite of these difficulties, a new design gradually emerged.
Global and Local analysis
The essential first step was to accept the fact that some checks must be repeated every time a file changes. These checks include checks that depend on the exact order of statements in a file. For example, the check that a variable is used before being defined is such a check. The check that a 'break' statement appears in an appropriate context is another such check. Otoh, many other checks, including *all* data-flow checks do *not* depend on the order in which definitions appear in files.
The distinction between order-dependent and order-independent checks is the key organizing principle of the design. This lead almost immediately to the fundamental distinction of the design: local analysis and global analysis.
**Local analysis** depends on order of statements in a Python file. Inc-lint completely redoes local analysis for a file any time that file changes. Local analysis performs all checks that depend on the exact form of the parse (ast) trees. As we shall see, the output of local analysis are data that do *not* depend on the order of statements in the parse tree.
**Global analysis** uses the order-independent data produced by local analysis. Global analysis uses a data-driven algorithm: only the *existence* of the data matters, how the data is defined is irrelevant.
This distinction makes an incremental design possible. We don't recompute global checks based on diffs to parse trees. That would be an impossible task. Instead, we recompute global checks based on diffs of order-independent data. This is already an important optimization: program changes that leave order-independent data unchanged will not generate new lint checks.
Contexts and symbol tables
A **context** is a module, class or function. The **contents** of a context are all the (top-level) variables, classes and functions of that context. For example, the contents of a module context are all the top-level variables, classes and functions of that module. The top-level classes and functions of a module are also contexts: contexts may contain **sub-contexts**.
**Symbol tables** are the internal representation of a context. Contexts may contain sub-contexts, so symbol tables can contain **inner symbol tables**. In other words, symbol tables are recursive structures. The exact form of symbol tables does not matter except for one essential requirement—it must be possible to compare two symbol tables easily and to compute their diffs: the list of symbols (including inner contexts) that appear in one symbol table but not the other.
Local analysis produces the **module symbol table** for that file. The module symbol table and all its inner tables describe every symbol defined anywhere in the module. Local analysis is run (non-incrementally) every time a file changes, so the module symbol table is recreated “from scratch” every time a file changes.
Deductions and the data-driven algorithm
The second output of local analysis is a list of **deductions**, the data that drive the data-driven algorithm done by global analysis. Deductions arise from assignment statements and other statements. You can think of deductions as being the data-flow representation of such statements.
Important: deductions *use* the data in symbol tables, and deductions also *set* the data in symbol tables. The data-driven algorithm is inherently an iterative process.
For example, an assignment a = b implies that the set of types that symbol 'a' can have is a superset of the set of types that symbol 'b' can have. One kind of deduction “completes” the type in the symbol table for 'a' when all types in the right hand side (RHS) of any assignment to a have been deduced. This deduction **fires** only when all the right-hand-sides of assignments to 'a' are known. Naturally, 'a' itself may appear in the RHS of an assignment to another variable 'c'. Once the possible types of 'a' are known, it may be possible to deduce the type of 'c'.
Another kind of deduction checks that operands have compatible types. For example, the expression 'x' + 'y' is valid only if some '+' operator may be applied to 'x' and 'y'. This is a non-trivial check: the meaning of '+' may depend on an __add__ function, which in turn depends on the types of 'x' and 'y'. In any case, these kinds of deductions result in various kinds of lint checks.
Global analysis attempts to satisfy deductions using the information in symbol tables. As in new-lint, the data-driven algorithm will start by triggering **base deductions**, deductions that depend on no other deductions. Satisfied deductions may trigger other deductions. When all possible deductions have been made, the remaining unsatisfied deductions generate error messages.
Large programs will contain many thousands of deductions. We can not afford to rerun all those deductions every time a change is made to a program. Instead, we must compute the (smallest) set of deductions that must be re-verified.
To compute the new deductions, we need a way of comparing the data contained in the changed source files. Comparing (diffing) parse trees will not work. Instead, inc-lint will compare symbol tables.
Happily, comparing symbol tables is easy. Any two source files that define the same contexts will be equivalent (isomorphic), regardless of how those contexts were defined. The diff algorithm will be recursive, mirroring the recursive structure of symbol tables. We expect the diff algorithm to be simple and fast.
The output of the diff will be a list of created and destroyed symbols for any context. Changing a name (in any particular context) is equivalent to deleting the old name and creating a new name.
Caching and updating
Inc-lint will cache symbol tables and deductions for all files. This allows us to use avoid the local analysis phase for all unchanged files. However, changes made in the local analysis of a file may affect deductions in many *unchanged* files.
The update phase requires that we be able to find the “users” (referrers) of all data that might change during local analysis. Thus, we expect symbol tables and deductions to use doubly (or even multiply) linked lists. It should be straightforward (and fast!) to update these links during the update phase.
Diffing symbol tables will result in a list of changes. When applying those changes, we want to update the *old* (cached) copy of each symbol table. This will allow references to unchanged items in the symbol table to remain valid. Of course, references to *changed* items will have to be deleted to avoid “dangling pointers”. By taking care to update links we can use typical Python references (pointers) to symbol table entries and deductions. This avoids having to relink pointers to new symbol tables.
Here are the essential features of the design:
1. Inc-lint performs local analysis for all changed files in a project. This phase does all lint checks that depend on the order of statements in a Python program. The output of local analysis is a new symbol table for each changed file, and a list of deductions that must be proved for the changed file.
2. A diff phase compares the old (cached) and new versions of the symbol table. This diff will be straightforward and fast because symbol tables will be designed to be easily diffed. As an optimization, we can bypass the diff if the old and new parse trees are “isomorphic”. For example, files that differ only in whitespace or comments will have isomorphic ast trees.
3. An update phase inserts and deletes cached (global) deductions. Changes to a symbol table may result in changes to deductions in arbitrarily many files of the project. Thus, all symbol table entries and deductions will be heavily linked.
4. After all symbol table entries and deductions have been updated, a data-driven algorithm will attempt to satisfy all unsatisfied deductions, that is, deductions that must be proven (again) because of changes to one or more symbol tables. These deductions correspond to the type-checking methods in pylint. At the end of this phase, still-unsatisfied deductions will result in error messages.
This design looks like the simplest thing that could possibly work. Indeed, it looks like the *only* reasonable design. For simplicity's sake, local analysis *must* be done afresh for all changed files. In contrast, global analysis depends only on deductions and symbol tables, neither of which depends on program order. Thus, we can easily imagine that deductions that depend on unchanged symbol table entries (symbols) will not need to be rechecked.
This design consists of largely independent parts or phases. Difficulties with one part will not cause difficulties or complexity elsewhere. This separation into independent phases and parts is the primary strength of the design. Like Leo's core modules, this design should remain valid even if various parts change significantly.
This design seeks to minimizes the risks to the project. I believe it has accomplished this goal. It should be possible to demonstrate the design with relatively simple prototype code.
All comments are welcome.
Edward K. Ream
July 27, 2010