I’ve been a bit lame about logging stuff for the past few days (well, I was only really here on 12-21 and 12-29, to be fair; the rest was either vacation or very short days).
Either way, here’s what’s happened since 12-20:
-implemented cold restarts
-do a line search with partial re-evaluation of beam between old point and new point in L-BFGS (to avoid overfitting to the surrogate loss)
-printed utterances that become wrong after updates [although I haven’t used this extra output yet]
-wrote script to generate graphs to display learning curves
-printed separate train/test statistics
-printed out how much each example contributes to the gradient
-used this output to find a bug in my hashing
-currently waiting for new runs (without the hashing bug) to finish, which will take a couple hours
Oh and here are some learning curves (some of the info in the titles is inaccurate due to a bug in the line search that I had for a while that caused it to essentially not occur): aggregates
The “before” curve is training set accuracy before the surrogate loss (i.e. beam) gets updated, the “after” curve is training set accuracy after it gets updated, and “test” is test set accuracy.
Stuff I did today:
1. Optimized my code to run about 8x faster (by cutting out some memory and time-intensive computation that didn’t affect the answer, e.g. training examples that our model is unable to express the answer to as well as top-down information that was unnecessary because it had already been incorporated from previous iterations). This was only supposed to lead to a 3x speedup, but I guess lower memory usage => less garbage collection => much faster code. This has the added bonus of letting me run 3 threads at once on my machine before I run out of memory, as opposed to the 2 I had before.
2. Analyzed results of runs from yesterday (which all died prematurely due to exceeding RAM, but there was enough saved data to be useful). Early stopping appears to dramatically accelerate the learning curve, although it unfortunately has the side effect of having the learning curve level out more quickly than without early stopping, e.g. at the end of the day the model is not fit as well. This is a bit unfortunate and some more experimentation should be done.
3. It also appears that even with a small amount of L1 regularization (lambda = 0.01), the model is still overfitting. In retrospect it’s obvious that the parameter needs to be higher, since lambda=0.01 roughly means that the model should be willing to change 100 parameters to fit one additional training example (and it in fact does this, in fairly unintuitive ways). So, I’m currently doing runs with lambda=0.4 and lambda=1.2 to see if this decreases overfitting.
4. Finally, L-BFGS still converges slowly, even with an objective function that is convex. I therefore conclude that either there is a bug in my code or a bug in L-BFGS, but am unsure which it is. In addition, parameters and beams still shift quite rapidly between iterations even with early stopping (although I could try even earlier stopping).
I’d like to experiment with the following:
-don’t stop early for the last few iterations
-stop even earlier (possibly for just the first few iterations)
-do cold restarts on the last few iterations
-ignore all beams that we have trouble fitting (again just on the last few iterations, although it actually might be a good idea to do this intermittently as well)
A bit lower priority (but perhaps higher-priority in the future) is to understand why L-BFGS still converges slowly.
Stuff I did today:
1. Added an option to only use the best logical form (instead of a weighted combination of logical forms) during optimization. This is useful because it forces the optimization problem to be convex (probably at the loss of accuracy, but it helps for debugging L-BFGS).
2. Fixed a permissions error that was making it hard to track programs as they ran.
3. Figured out how to set up a local scratch directory so that I can save large output files.
Code is still running…will debug preliminary results tomorrow, e.g.:
1. Does convexity make L-BFGS converge more quickly?
2. Does smoothness make L-BFGS converge more quickly?
3. Does early stopping prevent overfitting?
4. Do L1 penalties prevent overfitting? Does increasing regularization help?
5. How many iterations are necessary for things to effectively converge?
And a question I was hoping to answer but somewhat failed to because I didn’t log enough info:
6. How important is top-down information at later stages of the search? (This is mainly useful for speeding things up because if it’s not useful we can stop computing it.)
Stuff I did today:
1. Checked output of the runs I sent to the NLP machines (except it turns out I had a bug and had to re-run them)
2. Wrote a debugging suite that allows me to easily query information about different features / sets of features and identify utterances in which they are important.
3. Did some debugging and realized that I had some bugs in how I was regularizing the model.
4. Fixed the regularization bug, smoothed out the L1 loss slightly to make the convex optimizer more happy, and added a bunch more logging.
5. Added several run-time options: the smoothing parameter for L1, the coefficient on the L1 penalty, an early stopping parameter for L-BFGS, and the number of iterations to run the algorithm for.
6. Sent 7 jobs off to the NLP machines, which will hopefully finish sometime tomorrow.
If the Mayans are right, this will be the 4th to last (or perhaps 3rd to last) post on this blog. Hopefully they aren’t right.
Stuff I did today:
1. Examined the output of the scripts I ran over the weekend; a reasonable number of the weights for word->predicate features look correct, although a lot are also wrong; the weights on predicate<->predicate features look completely wrong. It’s clear that the model is overfitting most of the time. I probably need to regularize, which I currently do at the end but not at the early iterations. This is because regularizing made the model not explore enough, but I think the solution to this is to have a separate model specifically for exploration.
2. Tried to use Java’s built-in serialization so that I could save/load computation state in a more fine-grained way than I was currently. It turns out that this was a terrible idea, and I ended up just writing my own serialization (using the Fig parser to do a lot of the heavy lifting).
3. Set up my code on the NLP machines and ran a bunch of jobs in parallel with different early stopping parameters for L-BFGS (which was dominating the runtime before, for reasons that we aren’t sure of yet).
4. Met with Percy to try to debug things. Current issues are: incorrect feature weights, slow convergence of L-BFGS, and overfitting to the current beam. Tomorrow I hope to better understand what is causing each of these (for instance, is L-BFGS just slow because the problem is non-convex, or for some other reason?).
Including this because it showed up during the NIPS retrospective (when I reviewed the CPRL paper):
For those not in NLP:
L-BFGS is a convex optimizer designed for large-scale optimization problems
NIPS (Neural Information Processing Systems) is an annual machine learning conference
A logical form is a symbolic representation of a sentence, such as those in Figure 4 of this paper
Beam search is just a modification to breadth-first search or some similar algorithm where you only partially expand each state (due to time / memory constraints). In this context, we are building up parse trees for sentences and we only keep track of the 30 best subtrees for each sub-span of the sentence (as opposed to all of the subtrees, which would be an exponentially large set in the size of the sentence).