# Log 03-20-13

What I did today:

1. Tried to deal with the fact that my current way of conditioning (based on max-ent) leads to likelihoods that are pathologically nonsensical (numbers like $2^{-10^4}$). After conditioning in what seemed like a similar way, I went from that to likelihoods that were always $0$ or $1$, which seemed odd but far less pathological. After thinking some more, I realized that the issue was with assumptions on the joint distribution between the size of a denotation and of its abstraction; the pathological likelihood came because I had implicitly made the assumption that these sizes were necessarily very close to each other; if I assume they are independent, I get the 0-1 likelihood, and interpolating between these gives likelihoods of varying degrees of sharpness. My current plan is to go with the 0-1 likelihood for now, but try to eventually learn the join distribution between these two sizes from data (which should be easy because the data can be obtained through simulation).

I’m not sure yet whether the 0-1 likelihood will work well on data, but am just crossing my fingers for now…also, hopefully Percy and I can find some easier warm-up applications than semantic parsing to test my ideas on.

2. Read through several papers in varying degrees of detail. Some highlights:

-“Robust Filtering, Prediction, Smoothing, and Observability of Uncertain Systems” [Moheimani, Savkin, and Peterson, 1998]. This deals with hidden Markov models where the system evolves according to linear dynamics perturbed by some noise, where the noise is adversarial but constrained to lie in a given ellipse (for instance, by bounding the sum of the squared L2 norms of the noise across time). I was interested in this to see if the approach could be modified to provide more robust algorithms for logistic regression. I haven’t worked through the details of this yet but one possible issue is that it seems like logistic regression has half-plane constraints (as opposed to just ellipsoidal constraints).

-“Quantifying Causal Influences” [Janzing, Balduzzi, Grosse-Wentrup, Scholkopf 2012]. The authors define a notion of causal influence between two variables $X_i$ and $X_j$ based on what would happen if instead of letting $X_j$ depend on $X_i$, it depended on a copy $X_i'$ of $X_i$ that was drawn independently from the marginal distribution of $X_i$. The corresponding relative entropy is very similar to a measure of importance that I’ve been developing with Noah and Andreas, although with a different application in mind.

-“Decayed MCMC Filtering” [Marthi, Pasula, Russell, Peres]. This is a pretty clever idea that tries to replace particle filters in HMMs by an MCMC algorithm where states are resampled according to a distribution that decays quadratically as a function of how far away in time a state is from the current state. They show that this algorithm actually satisfies universal convergence guarantees independent of time, which is pretty cool.

-“Optimal Value of Information in Graphical Models” [Krause and Guestrin, 2009]. The authors solve the value of information problems (where you have to pay a certain cost to observe the value of a state) for Markov models. They also show that it is computationally hard ($NP^{PP}$, which is kind of ridiculous) to solve optimal VOI for polytrees. I didn’t really read through this much, since it wasn’t related to what I was looking for, but it seems pretty cool.

-And another cool one out of Carlos Guestrin’s lab: “Focused Belief Propagation for Query-Specific Inference” [Chechetka and Guestrin, 2010]. They consider the problem of trying to get loopy BP to converge faster if we care primarily about the marginal distribution over some subset of variables (which is again very related to what Noah, Andreas, and I are working on). They modify loopy BP to always send the “most important” message next, and develop a principled notion of what most important means. Apparently, one previous such notion was the message that changed the most from its previous instantiation, but this doesn’t give you any degree of query-directedness. They instead compute an approximate derivative of the marginal distribution over the query variables with respect to each variable, and somehow use that to determine which message to send. I haven’t gone through the details of how exactly they are doing this, but it seems both really cool and really relevant to what I’m trying to do, so I plan to read it in more detail tomorrow.