Log 01/31/2013

What I did today:

1. Send out e-mails to the other SPARC instructors about what classes they’ll teach
2. Set up mailing lists for Vannevar
3. Help Arun solve an interesting combinatorics problem
4. Attend CS229T lecture
5. Attend NLP Reading Group
6. Work more on the very lazy evaluation problem.

Sorry for not a longer description of #6, I’ll have a more detailed write-up in the next couple days.


Log 2013-01-24

(Note that I was on vacation from 01/07 to 01/13, then had a lot of work related to TAing CS229T from 01/14 to 01/18, and was on vacation again from 01/19 to 01/21. So I’ve only actually missed a few real work days of logs.)

What I did today:

I’m now starting a new project, working with Noah Goodman and Andreas Stuhlmüller. The goal of the project is to figure out when it is possible to avoid computing some portion of a program — instead replacing it by a default value — and obtain a result that is close to the “true” output of the program. (Note that the default value could be a probability distribution instead of a single number.) We call this “very lazy evaluation” because it is a more aggressive form of lazy evaluation that only asks for the output to be approximately the same instead of exactly the same.

Something that is intimately tied to this goal is the ability to approximately evaluate the impact of changing an input on the output of a program (we would ideally like to do this without actually running the entire program, because that may be expensive). Today I worked on something related to this latter goal.

Let’s forget about the structure of the program for a second and just assume that we have two collections of variables, X and Y. We are actually dealing with randomized programs, and furthermore we allow for the possibility of conditioning on boolean functions of the output. Dealing with all of this is a bit complicated, but we can abstract all of this away to just say that we have some probability distribution over X and Y with (unnormalized) probability density \psi(X,Y). Let’s assume further that X and Y have some substructure such that we can write \psi(X,Y) as \phi_1(X) \times \phi_2(Y) \times \phi_{12}(X,Y). We can of course always do this by letting \phi_{12}(X,Y) = \frac{\phi(X,Y)}{\phi_1(X)\phi_2(Y)}; what we really want is for \phi_1, \phi_2, and \phi_{12} to be “nice”, in a sense which we will define more formally in a bit.

The first thing I did today was to realize that the factorization of \psi into \phi_1, \phi_2, and \phi_{12} is a useful way to think of decomposing a collection of variables into two sets X and Y.

The second thing I did was to think about the following question:

Suppose that what we really care about is the marginal distribution over X (i.e. \bar{\psi}(X) := \sum_{Y} \psi(X,Y)), and that:

\phi_1(X) = \sum_{i} \alpha_i \mathbb{I}[X \in U_i]
\phi_2(Y) = \sum_{j} \beta_j \mathbb{I}[Y \in V_j]
\phi_{12}(X,Y) = \mathbb{I}[(X,Y) \in T]

where \mathbb{I} denotes the indicator function for an event. Note that these equations say that \phi_1 and \phi_2 are piecewise constant and that \phi_{12} only takes on values of 0 or 1. This last constraint isn’t that unreasonable; it is basically asking that \phi_{12} simply be a function that says whether a given X and Y “fit together” (for instance, think of X and Y being subtrees of a syntactic parse and Y checking whether combining X and Y results in something grammatical). In practice this is actually a bit too stringent, but we’ll work with it for now.

The question I wanted to answer is how to compute (or approximate) \bar{\psi}(X) efficiently given the above data. Algebraically, it is pretty easy to compute \bar{\psi}:

\bar{\psi}(X) = \sum_{i,j} \alpha_i\beta_j \mathbb{I}[X \in U_i] |V_j \cap T[X,:]|,

where T[X,:] = \{Y \mid (X,Y) \in T\}. Importantly, this sum is also computationally easy to compute and represent as long as |V_j \cap T[X,:]| is a function of i and j; or, in other words, |V_j \cap T[X,:]| takes on the same value for all X \in U_i. This is true in the following two cases:

  1. For a given i and j, either U_i \times V_j is contained in T or it is disjoint from T. Intuitively, this means that the U_i and V_j contain all information necessary to figure out whether X and Y are compatible (for instance, in parsing, knowing the heads of the two subtrees is typically enough).
  2. A bit more generally, for any X \in U_i, the intersection of T[X,:] with V_j has the same set of elements. This would correspond, for instance, to subtrees where the head is not known but partial information is known (for instance, we might know that the head of Y is a verb but the compatibility constraints require it to be a transitive verb; in this case, if V_j was the set of all trees whose head was a verb, we would still be fine because we could just count how many of those verbs were transitive).

Note that both of the examples I gave were from parsing; that’s because that was the motivating problem I was thinking about while developing the above. The next step is to make sure that this intuition carries over to other problems, as well. In general, I think of the U_i and V_j as abstract representations of some set of possibilities, where we assign constant probability to every concretization of a given abstract representation.

Other things I did:

  • hold CS229T office hours
  • attend the NLP lunch
  • think about linear programming duality with Roy (I still haven’t quite grokked this)
  • publicize the Aaron Swartz Memorial Hackathon

Log 2013-01-04

What I did today:

1. Reviewed AISTATS papers (referee reports are due tonight)
2. Debugged code output from yesterday, concluded that issues came from:
-inability to handle superlatives well
-lingering non-convexities in the cost function
-examples where we found an incorrect logical form that nevertheless has a correct denotation end up pushing hard against the gradient
3. To solve the superlative issue, I hard-baked some morphology into the model (changing “largest” to “larg-” and “-est”).
4. Every so many iterations, we drop all the examples that we are having trouble with and recompute the parameters, then add those examples back in and use beam search with the new parameters
5. Alter the cost function to be convex (this also required a bit of math to figure out something suitable).

I also finally got around to updating my academic web page; it’s now at http://cs.stanford.edu/~jsteinhardt/

Log 01-03-2013

Oops looks like I was lame again. One excuse is I was sick for a bit. Anyways, here’s what I did since last time:

1. Realized that the loss function was converging to a crappy local optimum, modified the algorithm using an “EM-style” trick to make each local step at least be convex. This substantially improved the learning curves.

2. Examined individual test cases that we were getting wrong (based on how much they contributed to the gradient). Realized that part of the problem is that our model was unable to express certain sentences, such as “people in boulder”, that required a single word to lexicalize to multiple predicates.

3. Realized that my cost function was completely flat (i.e. Hessian = 0, gradient = 0) in some regimes and that this was causing L-BFGS to get stuck and not move (but only due to being on a plateau rather than the actual global optimum). I fixed this and things improved by a lot.

4. Also switched to an unconstrained version of L-BFGS that is actually truly convex (unlike the previous one which only had convex level sets), and surprisingly found that this didn’t mess up the results much (as opposed to earlier, when some variables would become very large if left unconstrained and mess up the model).

5. Removed a bunch of examples that the current model is unable to get to investigate whether this improves things.

6. Augmented the model to be able to get a substantial portion of those examples, then re-ran the algorithm on the full training set; I now get 65% training-set accuracy and 50% hold-out accuracy, which is a substantial improvement from before! I realized there were still a few issues with what I was doing, so I am now re-running with a slightly different model and hoping that I get even higher accuracy. After that I will examine what I am still getting wrong and see what the right next step is.

In the meantime, I’ll also be doing a bit of prep work for CS299T, the class that I’m TA’ing next quarter.