Stuff I did today:
1. Optimized my code to run about 8x faster (by cutting out some memory and time-intensive computation that didn’t affect the answer, e.g. training examples that our model is unable to express the answer to as well as top-down information that was unnecessary because it had already been incorporated from previous iterations). This was only supposed to lead to a 3x speedup, but I guess lower memory usage => less garbage collection => much faster code. This has the added bonus of letting me run 3 threads at once on my machine before I run out of memory, as opposed to the 2 I had before.
2. Analyzed results of runs from yesterday (which all died prematurely due to exceeding RAM, but there was enough saved data to be useful). Early stopping appears to dramatically accelerate the learning curve, although it unfortunately has the side effect of having the learning curve level out more quickly than without early stopping, e.g. at the end of the day the model is not fit as well. This is a bit unfortunate and some more experimentation should be done.
3. It also appears that even with a small amount of L1 regularization (lambda = 0.01), the model is still overfitting. In retrospect it’s obvious that the parameter needs to be higher, since lambda=0.01 roughly means that the model should be willing to change 100 parameters to fit one additional training example (and it in fact does this, in fairly unintuitive ways). So, I’m currently doing runs with lambda=0.4 and lambda=1.2 to see if this decreases overfitting.
4. Finally, L-BFGS still converges slowly, even with an objective function that is convex. I therefore conclude that either there is a bug in my code or a bug in L-BFGS, but am unsure which it is. In addition, parameters and beams still shift quite rapidly between iterations even with early stopping (although I could try even earlier stopping).
I’d like to experiment with the following:
-don’t stop early for the last few iterations
-stop even earlier (possibly for just the first few iterations)
-do cold restarts on the last few iterations
-ignore all beams that we have trouble fitting (again just on the last few iterations, although it actually might be a good idea to do this intermittently as well)
A bit lower priority (but perhaps higher-priority in the future) is to understand why L-BFGS still converges slowly.