Log 03-21-13

What I did today:

1. Considered what happens if you have a polytope of distributions and take their marginal. It turns out you still get a polytope, although so far I’m not sure it has a nice representation. However, you can still at least formulate the maximum-entropy marginal distribution nicely by considering the program

maximize H(q)
subject to p \in \mathcal{P}, q = \mathrm{Marginal}[p]

where \mathcal{P} is the polytope and H is the entropy. Since marginalization is linear, this program is tractable (and if p is measurable with respect to a small \Sigma-algebra, then the program will also be tractable).

2. I also tried to apply the polytopal perspective to the problem that Noah was interested in (considering the optimal set of variables to use when doing inference about a subset of variables in a graphical model). In this case I think I run into the same problems as when I condition (that the worst case is much too bad), although Carlos Guestrin seems to get away with worst-case analysis in one of his papers, so maybe this is okay.

3. Read through some more papers. Highlights:

-“Smooth Interpretation” [Chaudhuri and Solar-Lezama, 2010]. This is another cool paper that tries to analyze the output distribution of a program when the input is a Gaussian random variable. They use tools from program analysis to do so, and even use some of the pruning techniques that I’ve been thinking about; although they seem to not really prove any bounds, which makes me hopeful because I think I have ways to prove bounds in at least some cases. They reference a large literature on verification of probabilistic programs, which I should probably read.

-“Node Splitting: A Scheme for Generating Upper Bounds in Bayesian Networks” [Choi, Chavira, Darwich, 2007]. These guys come up with a pretty interesting idea for computing the maximum-probability assignment of a collection of random variable in a Bayes net (conditioned on some values). The idea is to split a node into two copies, essentially removing some dependencies, which gives one an optimization problem that is an upper bound on the original optimization problem; whenever the two copies are consistent with each other, the bound is exact, so they can basically do branch-and-bound to try to compute a provably optimal solution (by getting an upper bound where all the variables are consistent). The algorithm is exponential only in the number of splits they perform, so for networks that can be split into a small number of networks with low tree-width, this algorithm should work quite well.

4. I also revisited the relevance criterion from the Guestrin paper from yesterday. Basically, they compute some upper bound on the partial derivative between a message and the probability assigned to a target variable; they seem to upper bound the partial derivative across all possible assignments of all other messages ┬áin the network, which seems intuitively like it should give a very conservative estimate, so I’m a bit confused as to how his algorithm works well anyways.

5. Several meetings: interesting NLP lunch talk by Sonal, meeting with Percy and Sida about 229T, and meeting with Emma and Jacob T. before they leave for Spring break.

6. Made Qiaochu try to teach me LaSerre duality, although I’m still a bit confused about what’s going on.


Log 03-20-13

What I did today:

1. Tried to deal with the fact that my current way of conditioning (based on max-ent) leads to likelihoods that are pathologically nonsensical (numbers like 2^{-10^4}). After conditioning in what seemed like a similar way, I went from that to likelihoods that were always 0 or 1, which seemed odd but far less pathological. After thinking some more, I realized that the issue was with assumptions on the joint distribution between the size of a denotation and of its abstraction; the pathological likelihood came because I had implicitly made the assumption that these sizes were necessarily very close to each other; if I assume they are independent, I get the 0-1 likelihood, and interpolating between these gives likelihoods of varying degrees of sharpness. My current plan is to go with the 0-1 likelihood for now, but try to eventually learn the join distribution between these two sizes from data (which should be easy because the data can be obtained through simulation).

I’m not sure yet whether the 0-1 likelihood will work well on data, but am just crossing my fingers for now…also, hopefully Percy and I can find some easier warm-up applications than semantic parsing to test my ideas on.

2. Read through several papers in varying degrees of detail. Some highlights:

-“Robust Filtering, Prediction, Smoothing, and Observability of Uncertain Systems” [Moheimani, Savkin, and Peterson, 1998]. This deals with hidden Markov models where the system evolves according to linear dynamics perturbed by some noise, where the noise is adversarial but constrained to lie in a given ellipse (for instance, by bounding the sum of the squared L2 norms of the noise across time). I was interested in this to see if the approach could be modified to provide more robust algorithms for logistic regression. I haven’t worked through the details of this yet but one possible issue is that it seems like logistic regression has half-plane constraints (as opposed to just ellipsoidal constraints).

-“Quantifying Causal Influences” [Janzing, Balduzzi, Grosse-Wentrup, Scholkopf 2012]. The authors define a notion of causal influence between two variables X_i and X_j based on what would happen if instead of letting X_j depend on X_i, it depended on a copy X_i' of X_i that was drawn independently from the marginal distribution of X_i. The corresponding relative entropy is very similar to a measure of importance that I’ve been developing with Noah and Andreas, although with a different application in mind.

-“Decayed MCMC Filtering” [Marthi, Pasula, Russell, Peres]. This is a pretty clever idea that tries to replace particle filters in HMMs by an MCMC algorithm where states are resampled according to a distribution that decays quadratically as a function of how far away in time a state is from the current state. They show that this algorithm actually satisfies universal convergence guarantees independent of time, which is pretty cool.

-“Optimal Value of Information in Graphical Models” [Krause and Guestrin, 2009]. The authors solve the value of information problems (where you have to pay a certain cost to observe the value of a state) for Markov models. They also show that it is computationally hard (NP^{PP}, which is kind of ridiculous) to solve optimal VOI for polytrees. I didn’t really read through this much, since it wasn’t related to what I was looking for, but it seems pretty cool.

-And another cool one out of Carlos Guestrin’s lab: “Focused Belief Propagation for Query-Specific Inference” [Chechetka and Guestrin, 2010]. They consider the problem of trying to get loopy BP to converge faster if we care primarily about the marginal distribution over some subset of variables (which is again very related to what Noah, Andreas, and I are working on). They modify loopy BP to always send the “most important” message next, and develop a principled notion of what most important means. Apparently, one previous such notion was the message that changed the most from its previous instantiation, but this doesn’t give you any degree of query-directedness. They instead compute an approximate derivative of the marginal distribution over the query variables with respect to each variable, and somehow use that to determine which message to send. I haven’t gone through the details of how exactly they are doing this, but it seems both really cool and really relevant to what I’m trying to do, so I plan to read it in more detail tomorrow.

Log 03-19-13

What I did today:

1. Realized that the worst case when you condition on something is arbitrarily bad (which I think I figured out a while ago, so feel stupid for trying this again)

2. Tried to salvage this by using the fact that the polytope is something that you have control over

3. Have what I think is a decent approach to this, but it may not matter because it seems like another issue is the relaxation you get from considering a valid abstraction of your likelihood instead of the likelihood itself — it can lead to really extreme assignments of probabilities.

4. Thought a bit about how to fix this, considering just trying to learn the correct likelihood from data, although it seems possible that there’s something better to be done.

Also, I explained kernel methods and concentration inequalities to Qiaochu and met with Percy for a couple hours.


Log 03-18-2013

What I did today:
1. Tried briefly to understand convex duality and relationships with minimax theorem
2. Tried to understand how to deal with conditioning for abstract representations by treating conditioning as a single stage in a game against nature

I apparently need to learn more statistical mechanics.


Log 03-14-2013

What I did today:

1. Tried to modify the maximum-entropy method to use p(x|y) instead of p(y|x), but it turns out this probably won’t work because we inherently need to consider counterfactual values of y to get good performance.

2. Also tried an approach where you trade off badness-of-fit against p(y) [with the intuition that it’s okay to fit models that are unlikely under your model class], but couldn’t get it to work.

3. Did some experiments with Kalman filters for the p(y|x) approach, plus worked out how things should work out for semantic parsing, and convinced myself that the approach is at least reasonable, even if I can’t prove that it works well.

4. Came up with a somewhat sketchy formalization of which variables one ought to add for relevance-directed inference on a subset of variables in a graphical model (basically, add the one where adding it has the highest KL-divergence relative to not adding it).

5. Worked out a few possible ways to deal with merging of abstract states in undirected graphical models (where there might be clique potentials where some of the variables in the potential are not represented due to the abstraction). One way involves dynamic programming and the other involves breaking dependencies (which is justified under the maximum entropy formalism).

6. Thought a little bit about how to adopt these ideas to sequential Monte Carlo, and how to add in refinements.

7. Write up post on other blog describing the probabilistic abstractions formalism. [read it here]

I felt much more productive today than yesterday. Yay!


Log 03-13-2013

What I did today:

1. Watched final project presentations for 229T
2. Attended NLP Reading Group
3. Tried to work out how to choose an optimal set of random variables if we only care about the KL divergence to some subset of the variables, but realized there were some issues with trying to do this
4. Worked through conditioning with an abstract evaluation function in the special case of a linear-Gaussian system with an output non-linearity. In this case we basically get assumed density filtering (a special case of EP for Gaussian HMMs) and there are all sorts of non-convexities which are ugly. Tomorrow I plan to try to work out a different way of conditioning based on caring more about situations with high mass under the prior. Another possibility is to try conditioning in a different order that allows more pooling of information.

I felt like I wasn’t super-productive today, not sure exactly why; was a bit hard to motivate myself to think about this stuff, maybe because I’m back in the creative stage vs. follow-through stage but haven’t quite internalized that yet.


Log 03-12-2013

What I did today:
1. Held office hours
2. Noah’s lab meeting
3. Realized that I need to add extra constraints to the convex optimization problem I currently care about, or else the objective function is really poor (and have some idea what those constraints are, basically constraints on unions of sets)
4. Wrote up pseudocode for beam search with probabilistic abstractions
5. Thought some about how conditioning ties in with abstractions, still kind of confused by this but think that passing from the constraint framework to a single distribution via maxent might be legitimate in some scenarios
6. Frisbee practice
7. Vannevar meeting
8. Wrote blog post on other blog