Log 2013-01-24

(Note that I was on vacation from 01/07 to 01/13, then had a lot of work related to TAing CS229T from 01/14 to 01/18, and was on vacation again from 01/19 to 01/21. So I’ve only actually missed a few real work days of logs.)

What I did today:

I’m now starting a new project, working with Noah Goodman and Andreas Stuhlmüller. The goal of the project is to figure out when it is possible to avoid computing some portion of a program — instead replacing it by a default value — and obtain a result that is close to the “true” output of the program. (Note that the default value could be a probability distribution instead of a single number.) We call this “very lazy evaluation” because it is a more aggressive form of lazy evaluation that only asks for the output to be approximately the same instead of exactly the same.

Something that is intimately tied to this goal is the ability to approximately evaluate the impact of changing an input on the output of a program (we would ideally like to do this without actually running the entire program, because that may be expensive). Today I worked on something related to this latter goal.

Let’s forget about the structure of the program for a second and just assume that we have two collections of variables, X and Y. We are actually dealing with randomized programs, and furthermore we allow for the possibility of conditioning on boolean functions of the output. Dealing with all of this is a bit complicated, but we can abstract all of this away to just say that we have some probability distribution over X and Y with (unnormalized) probability density \psi(X,Y). Let’s assume further that X and Y have some substructure such that we can write \psi(X,Y) as \phi_1(X) \times \phi_2(Y) \times \phi_{12}(X,Y). We can of course always do this by letting \phi_{12}(X,Y) = \frac{\phi(X,Y)}{\phi_1(X)\phi_2(Y)}; what we really want is for \phi_1, \phi_2, and \phi_{12} to be “nice”, in a sense which we will define more formally in a bit.

The first thing I did today was to realize that the factorization of \psi into \phi_1, \phi_2, and \phi_{12} is a useful way to think of decomposing a collection of variables into two sets X and Y.

The second thing I did was to think about the following question:

Suppose that what we really care about is the marginal distribution over X (i.e. \bar{\psi}(X) := \sum_{Y} \psi(X,Y)), and that:

\phi_1(X) = \sum_{i} \alpha_i \mathbb{I}[X \in U_i]
\phi_2(Y) = \sum_{j} \beta_j \mathbb{I}[Y \in V_j]
\phi_{12}(X,Y) = \mathbb{I}[(X,Y) \in T]

where \mathbb{I} denotes the indicator function for an event. Note that these equations say that \phi_1 and \phi_2 are piecewise constant and that \phi_{12} only takes on values of 0 or 1. This last constraint isn’t that unreasonable; it is basically asking that \phi_{12} simply be a function that says whether a given X and Y “fit together” (for instance, think of X and Y being subtrees of a syntactic parse and Y checking whether combining X and Y results in something grammatical). In practice this is actually a bit too stringent, but we’ll work with it for now.

The question I wanted to answer is how to compute (or approximate) \bar{\psi}(X) efficiently given the above data. Algebraically, it is pretty easy to compute \bar{\psi}:

\bar{\psi}(X) = \sum_{i,j} \alpha_i\beta_j \mathbb{I}[X \in U_i] |V_j \cap T[X,:]|,

where T[X,:] = \{Y \mid (X,Y) \in T\}. Importantly, this sum is also computationally easy to compute and represent as long as |V_j \cap T[X,:]| is a function of i and j; or, in other words, |V_j \cap T[X,:]| takes on the same value for all X \in U_i. This is true in the following two cases:

  1. For a given i and j, either U_i \times V_j is contained in T or it is disjoint from T. Intuitively, this means that the U_i and V_j contain all information necessary to figure out whether X and Y are compatible (for instance, in parsing, knowing the heads of the two subtrees is typically enough).
  2. A bit more generally, for any X \in U_i, the intersection of T[X,:] with V_j has the same set of elements. This would correspond, for instance, to subtrees where the head is not known but partial information is known (for instance, we might know that the head of Y is a verb but the compatibility constraints require it to be a transitive verb; in this case, if V_j was the set of all trees whose head was a verb, we would still be fine because we could just count how many of those verbs were transitive).

Note that both of the examples I gave were from parsing; that’s because that was the motivating problem I was thinking about while developing the above. The next step is to make sure that this intuition carries over to other problems, as well. In general, I think of the U_i and V_j as abstract representations of some set of possibilities, where we assign constant probability to every concretization of a given abstract representation.

Other things I did:

  • hold CS229T office hours
  • attend the NLP lunch
  • think about linear programming duality with Roy (I still haven’t quite grokked this)
  • publicize the Aaron Swartz Memorial Hackathon

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s