ICML 2017 Thoughts

ICML 2017 has just ended. While Sydney is remote for those in Europe and North America, the conference centeris a wonderful venue (with good coffee!), and the city is a lot of fun. Everything went smoothly and theorganizers did a great job.You can get a list of papers that I liked from my Twitter feed, so instead I’d like to discuss some broad themesI sensed.Multitask regularization to mitigate sample complexity in RL. Both in video games and in dialog, it is useful to add extra (auxiliary) tasks in order to accelerate learning. Leveraging knowledge and memory. Our current models are powerful function approximators, but in NLP especially we need to go beyond “the current example” in order exhibit competence. Gradient descent as inference. Whether it’s inpainting with a GAN or BLUE score maximization with an RNN, gradient descent is an…
Original Post: ICML 2017 Thoughts

Rational Inexuberance

Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav’s solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all…
Original Post: Rational Inexuberance

Tiered Architectures, Counterfactual Learning, and Sample Complexity

I’m on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially…
Original Post: Tiered Architectures, Counterfactual Learning, and Sample Complexity

Why now is the time for dialog

I’m working on a task-oriented dialog product and things are going surprisingly well from a business standpoint. It turns out that existing techniques are sufficient to substitute some portion of commercial dialog interactions from human to machine mediated, with tremendous associated cost savings which exceed the cost of developing the automatic systems. Here’s the thing that is puzzling: the surplus is so large that, as far as I can tell, it would have been viable to do this 10 years ago with then-current techniques. All the new fancy AI stuff helps, but only to improve the margins. So how come these businesses didn’t appear 10 years ago?I suspect the answer is that a format shift has occurred away from physical transactions and voice mediated interactions to digital transactions and chat mediated interactions.The movement away from voice is very important: if…
Original Post: Why now is the time for dialog

Software Engineering vs Machine Learning Concepts

Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I’ve noticed.Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is…
Original Post: Software Engineering vs Machine Learning Concepts

Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.Abstractions matter An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I’ve tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute…
Original Post: Reinforcement Learning and Language Support

Reinforcement Learning as a Service

I’ve been integrating reinforcement learning into an actual product for the last 6 months, and therefore I’m developing an appreciation for what are likely to be common problems. In particular, I’m now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights).Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn’t necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data…
Original Post: Reinforcement Learning as a Service

Generating Text via Adversarial Training

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it. They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn’t so remarkable in isolation. They use a pure…
Original Post: Generating Text via Adversarial Training

On the Sustainability of Open Industrial Research

I’m glad OpenAI exists: the more science, the better! Having said that, there was a strange happenstance at NIPS this year. OpenAI released OpenAI universe, which is their second big release of a platform for measuring and training counterfactual learning algorithms. This is the kind of behaviour you would expect from an organization which is promoting the general advancement of AI without consideration of financial gain. At the same time, Google, Facebook, and Microsoft all announced analogous platforms. Nobody blinked an eyelash at the fact that three for-profit organizations were tripping over themselves to give away basic research technologies.A naive train of thought says that basic research is a public good, subject to the free-rider problem, and therefore will be underfunded by for-profit organizations. If you think this is a strawman position, you haven’t heard of the Cisco model for…
Original Post: On the Sustainability of Open Industrial Research

Dialogue Workshop Recap

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.Evaluation There is no magic bullet, but check out Helen’s slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours. Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation. Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric. Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks. Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation. As Helen points out, computing…
Original Post: Dialogue Workshop Recap