Tuesday, July 27, 2010

A design for inc-lint, an incremental pylint

This paper discusses the essential features of the design of an incremental pylint, or **inc-lint** for short. It discusses only those aspects that are essential for the success of the project. That is, it is the highest level design.

This design borrows some features from previous prototype of a new pylint, which in this paper I'll call **new-lint**. New-lint used a data-driven algorithm (I called it a sudoku-like algorithm) to do lint-like checking. Many features of this data-driven algorithm will reappear below.

New-lint was an “interesting” failure. It showed that a data-driven approach to lint-like checking is feasible. Alas, it's performance was comparable to that of pylint. This is strong evidence, imo, that the performance of the pylint can not be significantly improved without a major change in strategy.

To get significantly better performance, an **incremental** approach must be used. Such an algorithm computes diffs between old and new versions of files and generates the minimum needed additional checks based on those diffs. My intuition is that inc-lint could be 10 to 100 times faster than pylint in many situations. Inc-lint should be fast enough so that it can be run any time a program changes.

As an extreme example of an incremental approach, inserting a comment into a program should require *no* additional analysis at all. The only work would be to notice that the ast's (parse trees) of the changed file have not changed. More commonly, changes that do not alter the data defined by a module can have no global effects on the program. Inc-lint would check only the changed file. But these checks will happen in the presence of cached data about all other parts of the program, so we can expect such checks to be much faster than pylint's checks.

Inc-lint seemed impossible


It was far from obvious that inc-lint was feasible. Indeed, the difficulties seemed overwhelming. Aside from adding or deleting comments, any change to a python file can have ripple effects throughout an entire program. What kind of bookkeeping could possibly keep track of all such changes? For example, diffs based on ast's could not possibly work: the number of cases to consider would be too large. Building incremental features into pylint also seemed hopeless. The present pylint algorithms are extremely complex—adding more complexity into pylint would be a recipe for failure. In spite of these difficulties, a new design gradually emerged.

Global and Local analysis


The essential first step was to accept the fact that some checks must be repeated every time a file changes. These checks include checks that depend on the exact order of statements in a file. For example, the check that a variable is used before being defined is such a check. The check that a 'break' statement appears in an appropriate context is another such check. Otoh, many other checks, including *all* data-flow checks do *not* depend on the order in which definitions appear in files.

The distinction between order-dependent and order-independent checks is the key organizing principle of the design. This lead almost immediately to the fundamental distinction of the design: local analysis and global analysis.

**Local analysis** depends on order of statements in a Python file. Inc-lint completely redoes local analysis for a file any time that file changes. Local analysis performs all checks that depend on the exact form of the parse (ast) trees. As we shall see, the output of local analysis are data that do *not* depend on the order of statements in the parse tree.

**Global analysis** uses the order-independent data produced by local analysis. Global analysis uses a data-driven algorithm: only the *existence* of the data matters, how the data is defined is irrelevant.

This distinction makes an incremental design possible. We don't recompute global checks based on diffs to parse trees. That would be an impossible task. Instead, we recompute global checks based on diffs of order-independent data. This is already an important optimization: program changes that leave order-independent data unchanged will not generate new lint checks.

Contexts and symbol tables


A **context** is a module, class or function. The **contents** of a context are all the (top-level) variables, classes and functions of that context. For example, the contents of a module context are all the top-level variables, classes and functions of that module. The top-level classes and functions of a module are also contexts: contexts may contain **sub-contexts**.

**Symbol tables** are the internal representation of a context. Contexts may contain sub-contexts, so symbol tables can contain **inner symbol tables**. In other words, symbol tables are recursive structures. The exact form of symbol tables does not matter except for one essential requirement—it must be possible to compare two symbol tables easily and to compute their diffs: the list of symbols (including inner contexts) that appear in one symbol table but not the other.

Local analysis produces the **module symbol table** for that file. The module symbol table and all its inner tables describe every symbol defined anywhere in the module. Local analysis is run (non-incrementally) every time a file changes, so the module symbol table is recreated “from scratch” every time a file changes.

Deductions and the data-driven algorithm


The second output of local analysis is a list of **deductions**, the data that drive the data-driven algorithm done by global analysis. Deductions arise from assignment statements and other statements. You can think of deductions as being the data-flow representation of such statements.

Important: deductions *use* the data in symbol tables, and deductions also *set* the data in symbol tables. The data-driven algorithm is inherently an iterative process.

For example, an assignment a = b implies that the set of types that symbol 'a' can have is a superset of the set of types that symbol 'b' can have. One kind of deduction “completes” the type in the symbol table for 'a' when all types in the right hand side (RHS) of any assignment to a have been deduced. This deduction **fires** only when all the right-hand-sides of assignments to 'a' are known. Naturally, 'a' itself may appear in the RHS of an assignment to another variable 'c'. Once the possible types of 'a' are known, it may be possible to deduce the type of 'c'.

Another kind of deduction checks that operands have compatible types. For example, the expression 'x' + 'y' is valid only if some '+' operator may be applied to 'x' and 'y'. This is a non-trivial check: the meaning of '+' may depend on an __add__ function, which in turn depends on the types of 'x' and 'y'. In any case, these kinds of deductions result in various kinds of lint checks.

Global analysis attempts to satisfy deductions using the information in symbol tables. As in new-lint, the data-driven algorithm will start by triggering **base deductions**, deductions that depend on no other deductions. Satisfied deductions may trigger other deductions. When all possible deductions have been made, the remaining unsatisfied deductions generate error messages.

Diffs


Large programs will contain many thousands of deductions. We can not afford to rerun all those deductions every time a change is made to a program. Instead, we must compute the (smallest) set of deductions that must be re-verified.

To compute the new deductions, we need a way of comparing the data contained in the changed source files. Comparing (diffing) parse trees will not work. Instead, inc-lint will compare symbol tables.

Happily, comparing symbol tables is easy. Any two source files that define the same contexts will be equivalent (isomorphic), regardless of how those contexts were defined. The diff algorithm will be recursive, mirroring the recursive structure of symbol tables. We expect the diff algorithm to be simple and fast.

The output of the diff will be a list of created and destroyed symbols for any context. Changing a name (in any particular context) is equivalent to deleting the old name and creating a new name.

Caching and updating


Inc-lint will cache symbol tables and deductions for all files. This allows us to use avoid the local analysis phase for all unchanged files. However, changes made in the local analysis of a file may affect deductions in many *unchanged* files.

The update phase requires that we be able to find the “users” (referrers) of all data that might change during local analysis. Thus, we expect symbol tables and deductions to use doubly (or even multiply) linked lists. It should be straightforward (and fast!) to update these links during the update phase.

Preserving pointers


Diffing symbol tables will result in a list of changes. When applying those changes, we want to update the *old* (cached) copy of each symbol table. This will allow references to unchanged items in the symbol table to remain valid. Of course, references to *changed* items will have to be deleted to avoid “dangling pointers”. By taking care to update links we can use typical Python references (pointers) to symbol table entries and deductions. This avoids having to relink pointers to new symbol tables.

Recap


Here are the essential features of the design:

1. Inc-lint performs local analysis for all changed files in a project. This phase does all lint checks that depend on the order of statements in a Python program. The output of local analysis is a new symbol table for each changed file, and a list of deductions that must be proved for the changed file.

2. A diff phase compares the old (cached) and new versions of the symbol table. This diff will be straightforward and fast because symbol tables will be designed to be easily diffed. As an optimization, we can bypass the diff if the old and new parse trees are “isomorphic”. For example, files that differ only in whitespace or comments will have isomorphic ast trees.

3. An update phase inserts and deletes cached (global) deductions. Changes to a symbol table may result in changes to deductions in arbitrarily many files of the project. Thus, all symbol table entries and deductions will be heavily linked.

4. After all symbol table entries and deductions have been updated, a data-driven algorithm will attempt to satisfy all unsatisfied deductions, that is, deductions that must be proven (again) because of changes to one or more symbol tables. These deductions correspond to the type-checking methods in pylint. At the end of this phase, still-unsatisfied deductions will result in error messages.

Conclusions


This design looks like the simplest thing that could possibly work. Indeed, it looks like the *only* reasonable design. For simplicity's sake, local analysis *must* be done afresh for all changed files. In contrast, global analysis depends only on deductions and symbol tables, neither of which depends on program order. Thus, we can easily imagine that deductions that depend on unchanged symbol table entries (symbols) will not need to be rechecked.

This design consists of largely independent parts or phases. Difficulties with one part will not cause difficulties or complexity elsewhere. This separation into independent phases and parts is the primary strength of the design. Like Leo's core modules, this design should remain valid even if various parts change significantly.

This design seeks to minimizes the risks to the project. I believe it has accomplished this goal. It should be possible to demonstrate the design with relatively simple prototype code.

All comments are welcome.

Edward K. Ream
July 27, 2010

Friday, July 9, 2010

Federal judge rules against gay marriage ban

Joseph Louis Tauro, the federal judge for the United States District Court for the District of Massachusetts, has just ruled unconstitutional the Defense of Marriage Act (DOMA).

There are two rulings involved.

http://www.glad.org/uploads/docs/cases/2010-07-08-gill-district-court-decision.pdf

http://www.mass.gov/Cago/docs/civilrights/DOMA%20Decision.pdf

Here are some excerpts from the first.

Equal Protection of the Laws.

QQQ
[T]he Constitution ‘neither knows nor tolerates classes among citizens.’” It is with this fundamental principle in mind that equal protection jurisprudence takes on “governmental classifications that ‘affect some groups of citizens differently than others.’” And it is because of this “commitment to the law’s neutrality where the rights of persons are at stake” that legislative provisions which arbitrarily or irrationally create discrete classes cannot withstand constitutional scrutiny.

To say that all citizens are entitled to equal protection of the laws is “essentially a direction [to the government] that all persons similarly situated should be treated alike.” But courts remain cognizant of the fact that “the promise that no person shall be denied the equal protection of the laws must coexist with the practical necessity that most legislation classifies for one purpose or another, with resulting disadvantage to various groups or persons.” And so, in an attempt to reconcile the promise of equal protection with the reality of lawmaking, courts apply strict scrutiny, the most searching of constitutional inquiries, only to those laws that burden a fundamental right or target a suspect class.
...
What remains, therefore, is the possibility that Congress sought to deny recognition to same-sex marriages in order to make heterosexual marriage appear more valuable or desirable. But to the extent that this was the goal, Congress has achieved it “only by punishing same-sex couples who exercise their rights under state law.” And this the Constitution does not permit. “For if the constitutional conception of ‘equal protection of the laws’ means anything, it must at the very least mean” that the Constitution will not abide such “a bare congressional desire to harm a politically unpopular group.”
...
This court simply “cannot say that DOMA is directed to any identifiable legitimate
purpose or discrete objective. It is a status-based enactment divorced from any factual context which [this court] could discern a relationship to legitimate [government] interests.” Indeed, Congress undertook this classification for the one purpose that lies entirely outside of legislative bounds, to disadvantage a group of which it disapproves. And such a classification, the Constitution clearly will not permit. In the wake of DOMA, it is only sexual orientation that differentiates a married couple entitled to federal marriage-based benefits from one not so entitled. And this court can conceive of no way in which such a difference might be relevant to the provision of the benefits at issue. By premising eligibility for these benefits on marital status in the first instance, the federal government signals to this court that the relevant distinction to be drawn is between married individuals and unmarried individuals. To further divide the class of married individuals into those with spouses of the same sex and those with spouses of the opposite sex is to create a distinction without meaning. And where, as here, “there is no reason to believe that the disadvantaged class is different, in relevant respects” from a similarly situated class, this court may conclude that it is only irrational prejudice that motivates the challenged classification. As irrational prejudice plainly never constitutes a legitimate government interest, this court must hold that Section 3 of DOMA as applied to Plaintiffs violates the equal protection principles embodied in the Fifth Amendment to the United States Constitution.
QQQ