The TTIC Young Researcher Seminar Series features talks by Ph.D. students and postdocs whose research is of broad interest to the computer science community. The series provides an opportunity for early-career researchers to present recent work to and meet with students and faculty at TTIC and nearby universities.

The seminars are typically held on Wednesdays at 10:00am in TTIC Room 530.

To receive announcements regarding the seminar series, please subscribe to the mailing list.

For additional information, please contact David McAllester(mcallester@ttic.edu)

Title:Efficient Convex Optimization Requires Superlinear Memory

Date:May 18, 2022

Speaker:Annie Marsden, Stanford University

Host:Nathan Srebro (nati@ttic.edu)

Abstract:How much memory is necessary to efficiently optimize convex functions? In this talk I will present a result which establishes that superlinear memory is required to be competitive with the best known quadratic memory algorithms. Specifically, any first-order algorithm must use either O(d^{1.25 – delta}) memory or make at least Omega(d^{1 + (4/3) delta}) queries. This resolves (a part of) a COLT 2019 open problem of Woodworth and Srebro.

Title:Hypercontractivity and Small-Set Expansion on High Dimensional Expanders

Date:April 27, 2022

Speaker:Max Hopkins, UCSD

Host:Madhur Tulsiani (madhurt@ttic.edu)

Abstract:Hypercontractivity is one of the most powerful tools in Boolean function analysis. Traditionally studied on the Boolean cube, recent years have seen a number of exciting applications of hypercontractivity on extended domains, most famously including the resolution of Khot’s 2-2 games conjecture. Unfortunately, beyond a few known examples our general understanding of hypercontractivity actually remains remarkably poor, severely limiting further avenues of application.

In this talk, we discuss the first steps towards a unified theory of hypercontractivity based on high dimensional expanders (HDX), a broad class of hypergraphs that have recently seen a series of breakthrough applications in coding theory and approximate sampling. Throughout the talk, we’ll pay special attention to the motivating application of characterizing small-set expansion in graphs, and briefly discuss how the line of work could lead to new insights towards resolving the unique games conjecture.

Based on joint work with Mitali Bafna, Tali Kaufman, and Shachar Lovett to appear at STOC 2022.

Title:Optimization Landscapes in Deep Learning

Date:November 6, 2019

Speaker:Kenji Kawaguchi, Massachusetts Institute of Technology

Host:Nathan Srebro (nati@ttic.edu)

Abstract:Deep learning has provided high-impact data-driven methods in various applications. However, theoretical guarantees in deep learning tend to provide too pessimistic insights with a gap from practical observations, often because of hidden special properties. Identifying such special properties can provide novel theoretical insights, and is potentially helpful for understanding and designing practical methods. In this talk, I will discuss special properties on nonconvex optimization landscapes of deep neural networks, as well as their implications on gradient descent methods and a few results on real-world applications.

Title:Exploring Surrogate Losses for Learning Neural Networks

Date:December 4, 2019

Speaker:Surbhi Goel, University of Texas

Host:Nathan Srebro (nati@ttic.edu)

Abstract:Developing provably efficient algorithms for learning commonly used neural network architectures continues to be a core challenge in machine learning. The underlying difficulty arises from the highly non-convex nature of the optimization problems posed by neural networks. In this talk, I will discuss the power of convex surrogate losses for tackling this underlying non-convexity. I will focus on the setting of ReLU regression and show how the convex surrogate allows us to get approximate guarantees in the challenging agnostic model. I will further show how these techniques give positive results for simple convolutional and fully connected architectures.

Title:Making contextual bandits as easy as regression

Date:February 19, 2020

Speaker:Dylan Foster, Massachusetts Institute of Technology

Abstract:A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond a well-specified model, and works even when contexts are chosen adversarially.

Title:Learning From Sub-Optimal Data

Date:May 1, 2019

Speaker:Bradly Statie, University of Toronto

Host:Gregory Shakhnarovich (gregory@ttic.edu)

Abstract:Learning algorithms typically assume their input data is good natured. If one takes this input data and trains an agent with it, then the agent should, given enough time and compute, eventually learn how to solve the intended task. But this is not always a realistic expectation. Sometimes, the data given to an agent is flawed or fails to fully convey the correct problem. In other words, the input data is sub-optimal. In this talk, we will discuss two recent advances for overcoming sub-optimal data.

First, we consider the problem of imitation learning from sub-optimal demonstrations. In this setting, a robot receives failed or flawed demonstrations of a task. It must learn to infer, and subsequently complete, the intended task from only these failed demonstrations. Results are presented on a variety of robotics problems such as door opening and pick and place.

Second, we consider the problem of learning from sub-optimal reward functions. Often, the reward functions provided to reinforcement learning agents are derived by combining low level primitives such as agent position and velocity. For example, the reward for a robot learning to walk might be its forward velocity plus the position of its head. These reward functions are first and foremost intended for human consumption, not the consumption of an RL algorithm. Consequently, it might be possible to learn a better intrinsic reward function that it is easier for the RL algorithm to optimize against. We provide a new algorithm for learning such intrinsic reward functions. Optimizing against these learned intrinsic rewards leads to better overall agent performance than optimizing against the raw hand-designed reward function. Crucially, these reward functions can be learned on the fly without significant extra computational costs. Results are presented on a variety of MuJoCo tasks and some hard robotics problems such as block stacking.

Title:Non-Interactive Agreement and Dimension Reduction for Polynomials

Date:May 8, 2019

Speaker:Pritish Kamath, Massachusetts Institute of Technology

Host:Nathan Srebro (nati@ttic.edu)

Abstract:The ability to harness and work with randomness has been at the heart of information theory and theoretical computer science. Of particular interest to this talk is the use of randomness by a set of parties that are spread out geographically. For example, randomness can enable such parties to generate and share secrets.

The “Non-interactive Agreement Distillation” problem, specified by a joint distribution P(x,y) and a target alphabet size k, is defined as follows: Two players observe sequences (X_1, … , X_n) and (Y_1, … , Y_n) respectively where {(X_i, Y_i)} are drawn i.i.d. from P(x,y). Both players look at their share of randomness, and output an element of [k]. Their goal is to maximize the probability that their outputs are the same, while ensuring that their outputs are marginally uniform.

Given P(x,y) and k, what is the largest correlation (agreement probability) that the players can achieve? It turns out that this value is not well understood, even in some well motivated special cases. A priori, this value is not even “computable”, as there is no immediate upper bound on the number of samples the players need to draw in order to achieve the best possible correlation!

This talk will describe a recent line of work that obtains an explicit bound on the number of samples needed to get epsilon-close to the maximum achievable correlation. At the heart of our result is a new technique that we call “Dimension Reduction for Polynomials”. It can be viewed as a generalization of the well-known Johnson-Lindenstrauss lemma, which in this context, corresponds to the special case of degree-1 polynomials. We believe this technique could be of broader interest.

This talk will discuss the motivational aspects of the problem, its moral and technical connections to other problems in theoretical computer science and will mainly focus on the technique of “dimension reduction for polynomials”.

Based on joint works with Badih Ghazi, Prasad Raghavendra and Madhu Sudan. [arXiv:1607.04322, arXiv:1708.03808]

Title:TBD

Date:May 13, 2019

Speaker:Richard Zhang, Adobe

Host:Gregory Shakhnarovich (gregory@ttic.edu)

Abstract:In recent years, deep convolutional networks have proven to be extremely adept at discriminative labeling tasks. Not only do networks solve the direct task, they also learn an effective, general representation of the visual world. We explore the use of deep networks for image generation, or synthesis. Generation is challenging, as it is difficult to characterize the perceptual quality of an image, and often times there is more than one “correct” answer. However, we show that networks can indeed perform the graphics task of image generation, and in doing so, learn a representation of the visual world, even without the need for hand-curated labels.

We propose BicycleGAN, a general system for image-to-image translation problems, with the specific aim of capturing the multimodal nature of the output space. We further study image colorization and develop automatic and user-guided approaches. Moreover, colorization, as well as general cross-channel prediction, is a simple but powerful pretext task for self-supervised representation learning. We demonstrate strong transfer to high-level semantic tasks, such as image classification, and to low-level human perceptual similarity judgments. For the latter, we collect a large-scale dataset of human judgments and find that our method outperforms traditional metrics such as PSNR and SSIM. We also discover that many unsupervised and self-supervised representations transfer strongly, even comparable to fully-supervised methods. Despite their strong transfer performance, deep convolutional representations surprisingly lack a basic low-level property – shift-invariance. We propose to incorporate a classic but overlooked signal processing technique, low-pass filtering, into modern deep network architectures.

Title:Insights on Deep Representations with Applications to Healthcare

Date:October 17, 2018

Speaker:Maithra Raghu, Cornell

Host:Nathan Srebro (nati@ttic.edu)

Abstract:The past few years have seen enormous successes in applying deep neural networks in a variety of domains, from speech recognition to healthcare. However, as these models become more complex and are simultaneously used for more sensitive applications, it becomes increasingly important to better understand and intuit the properties of deep representations. In this talk I overview our development of Canonical Correlation Analysis (SVCCA) as a tool to study and compare latent representations learned by deep neural networks. The results provide insights on learning dynamics, help identify where different concepts are learned, and enable measuring the similarity of generalizing and memorizing networks. I demonstrate the importance of representational considerations in model design via a task on predicting doctor disagreements in a healthcare setting.

Title:Learning Sight from Sound

Date:March 28, 2018

Speaker:Andrew Owens

Host:Gregory Shakhnarovich (gregory@ttic.edu)

Abstract:From the clink of a mug placed onto a saucer to the bustle of a busy cafe, our days are filled with visual experiences that are accompanied by distinctive sounds. In this talk, we show that these sounds can provide a rich training signal for learning visual models. First, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We demonstrate this idea by training an algorithm to produce plausible soundtracks for videos in which people hit and scratch objects with a drumstick. Second, we show that ambient audio – e.g., crashing waves, people speaking in a crowd – can also be used to learn visual models. We train a convolutional neural network to predict a statistical summary of the sounds that occur within a scene, and we demonstrate that the learned visual representation conveys information about objects and scenes. Finally, we present an unsupervised learning method for training multi-modal networks that fuse audio and visual data, and apply the learned representation to a number of audio-visual learning tasks.

Title:Generalization and Self-Supervision in Deep Robotic Learning

Date:February 28, 2018

Speaker:Chelsea Finn

Host:Matthew Walter (mwalter@ttic.edu)

Abstract:Machine learning algorithms excel primarily in settings where an engineer can first reduce the problem to a particular function (e.g. an image classifier), and then collect a substantial amount of labeled input-output pairs for that function. In drastic contrast, humans are capable of learning from streams of raw sensory data with minimal external instruction. In this talk, I will argue that, in order to build intelligent systems that are as capable as humans, machine learning models should not be trained in the context of one particular application. Instead, we should be designing systems that can be versatile, can learn in unstructured settings without detailed human-provided labels, and can accomplish many tasks, all while processing high-dimensional sensory inputs. To do so, these systems must be able to actively explore and experiment, collecting data themselves rather than relying on detailed human labels.

My talk will focus on two key aspects of this goal: versatility and self-supervision. I will first show how we can move away from hand-designed, task-specific representations of a robot’s environment by enabling the robot to learn high-capacity models, such as deep networks, for representing complex skills from raw pixels. I will also present an algorithm that learns deep models that can be rapidly adapted to different objects, new visual concepts, or varying environments, leading to versatile behaviors. Beyond versatility, a hallmark of human intelligence is self-supervised learning. I will discuss how we can allow a robot to learn by playing with objects in the environment without any human supervision. From this experience, the robot can acquire a visual predictive model of the physical world that can be used for maneuvering many different objects to varying goals. In all settings, our experiments on simulated and real robot platforms demonstrate the ability to scale to complex, vision-based skills with novel objects.

Title:Look, Listen, and Speak: Vision Systems that Communicate with Natural Language

Date:February 14, 2018

Speaker:Lisa Anne Hendricks

Abstract:Powered by deep convolutional networks and large scale visual datasets, modern visual systems are capable of accurately recognizing thousands of visual categories. However, images and videos contain so much more than categorical labels: they contain information about where objects are located (in a forest or in a kitchen?), what attributes an object has (red or blue?), and how objects interact with other objects in a scene (is the child sitting on a sofa, or running in a field?). Natural language provides an efficient and intuitive way for vision systems to convey important information about a visual scene. In this talk, I will begin by discussing vision systems that “look and speak”. In particular, I will consider how to create more scalable image captioning systems by integrating external data sources. I will then discuss vision systems that “listen and look”. I will introduce the task of moment localization in videos with natural language and detail my effort to collect a large scale dataset for this task. Finally, natural language provides a way for vision systems to not only discuss what is in a scene, but how objects and attributes in an image support a decision, such as a classification decision. I will discuss vision systems that go beyond naming objects in a scene and can generate visual explanations which justify neural network decisions.

Title:Learning Single-view 3D Reconstruction of Objects and Scenes

Date:January 24, 2018

Speaker:Shubham Tulsiani

Host:Gregory Shakhnarovich (gregory@ttic.edu)

Abstract:In this talk, I will discuss the task of inferring 3D structure underlying an image, in particular focusing on two questions - a) how we can plausibly obtain supervisory signal for this task, and b) what forms of representation should we pursue. I will first show that we can leverage image-based supervision to learn single-view 3D prediction, by using geometry as a bridge between the learning systems and the available indirect supervision. We will see that this approach enables learning 3D structure across diverse setups e.g. predicting deformable models or volumetric 3D for objects, or inferring layered-depth images for scenes. I will then advocate the case for inferring interpretable and compositional 3D representations. I will present a method that discovers the coherent compositional structure across objects in a unsupervised manner by attempting to assemble shapes using volumetric primitives, and then demonstrate the advantages of predicting similar factored 3D representations for complex scenes.

Title:A Dynamical View of Optimization Algorithms

Date:October 25, 2017

Speaker:Ashia Wilson

Host:Nathan Srebro (nati@ttic.edu)

Abstract:Optimization is a core primitive in statistics, machine learning, and data analytics. Within these fields, the rapid growth in the scale and complexity of modern datasets has led to a focus on two classes of algorithms: gradient methods and momentum methods (also referred to as accelerated methods). Momentum methods, first proposed by Nesterov in 1983, achieve faster convergence rates than gradient methods. However, unlike gradient methods, they are not descent methods and providing robust performance guarantees remains a challenge. In the Euclidean setting, momentum methods can be understood as modeling the dynamics of a damped harmonic oscillator; making this intuition precise however, and generalizing it to other geometries has been difficult. Furthermore, derivations of momentum methods do not ow from a single underlying principle, but tend to rely on case-specific algebra using a technique - considered by many to be esoteric - called estimate sequences

The first part of our work introduces a variational, continuous-time framework for understanding momentum methods. We show that there is a family of Lagrangian functionals, that we call Bregman Lagrangians, which generate dynamics corresponding to momentum methods in continuous time. In particular, momentum methods can be understood as arising from various discretization techniques applied to these continuous time dynamics. The second part of our work strengthens this connection. We demonstrate how to derive families of Lyapunov functions which can certify rates of convergence for the continuous time momentum dynamics. We further demonstrate how the proofs of convergence of momentum methods can be understood as bounding discretization errors of the Lyapunov function when moving to discrete time. Along the way, we prove an equivalence between these family of Lyapunov functions and the technique of estimate sequences. The following is joint work with Andre Wibisono, Stephen Tu, Shivaram Venkataraman, Alex Gittens,Benjamin Recht, and Michael I. Jordan.

Title:Towards Hierarchical Multiscale Recurrent Neural Networks and their Applications

Date:June 7, 2017

Speaker:Junyoung Chung

Abstract:The recent resurgence of recurrent neural networks has led to remarkable advances in various applications including machine translation, speech recognition, speech synthesis and caption generation. However, learning both hierarchical and temporal representation has been among the longstanding challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence.

In this talk, I will talk about my previous works on multiscale recurrent neural networks. I will show how a deep recurrent neural network with extra gating units can update its layers with different timescales and discover the underlying hierarchical structure of the sequences. The hierarchical multiscale recurrent neural networks have a potential to remove inherent problems of standard recurrent neural networks. In addition, the learned hierarchical structures can be useful information to many other downstream tasks such as extracting story segments for video understanding, adaptively compressing sequences for speech recognition and extracting sub-task structures in hierarchical reinforcement learning.

Title:Sample-Optimal Inference, the Method of Moments, and Community Detection

Date:May 17, 2017

Speaker:Samuel Hopkins

Host:Madhur Tulsiani (madhurt@ttic.edu)

Abstract:We propose a simple and efficient meta-algorithm for Bayesian estimation problems (i.e. hidden variable, latent variable, or planted problems). Our algorithm uses low-degree polynomials together with new and highly robust tensor decomposition methods. We focus on the question: for a given estimation problem, precisely how many samples (up to low-order additive terms) do polynomial-time algorithms require to obtain good estimates of hidden variables? Our meta-algorithm is broadly applicable, and achieves statistical or conjectured computational sample-complexity thresholds for many well-studied problems, including many for which previous algorithms were highly problem-specific.

As a running example we employ the stochastic block model – a widely studied family of random graph models which contain latent community structure. We recover and unify the proofs of the best-known sample complexity bounds for the partial recovery problem in this model. We also give the first provable guarantees for partial recovery of community structure in constant-degree graphs where nodes may participate in many communities simultaneously. This model is known to exhibit a sharp sample complexity threshold – with fewer than a very specific number of samples, recovering community structure becomes impossible. While previous explanations for this phenomenon appeal to sophisticated ideas from statistical mechanics, we give a new and simple explanation based on properties of low-degree polynomials.

Title:Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks

Date:November 30, 2016

Speaker:Emily Denton

Host:Greg Shakhnarovich (gregory@ttic.edu)

Abstract:The main focus of this talk will be on a new simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.

Title:A Task-Oriented Neural Dialogue System

Date:November 9, 2016

Speaker:Tsung-Hsien (Shawn) Wen

Host:Karen Livescu (klivescu@ttic.edu)

Abstract:Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collecting dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst helping them to accomplish tasks in a restaurant search domain.

Title:Conditional Quadratic Hardness for Data Analysis Problems

Date:October 26, 2016

Speaker:Arturs Backurs

Host:Yury Makarychev (yury@ttic.edu)

Abstract:The theory of NP-hardness has been remarkably successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial-time algorithms, but large exponents in their runtime bounds can make them inefficient in practice. For example, quadratic-time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many data analysis problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive.

In this talk I will give an overview of recent research that aims to remedy this situation. In particular, I will outline conditional hardness results for two problems: computing the edit distance between two strings, and solving the optimization problems defined by Support Vector Machines (with Gaussian kernel) up to high accuracy. Specifically, we show that, if either of these two problems can be solved in time O(n^{2-delta}) for some constant delta>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time M^{O(1)} 2^{(1-epsilon)N} for a constant epsilon>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

Title:Learning Spatial Priors for Text to 3D Scene Generation

Date:October 19, 2016

Speaker:Angel Chang

Host:Kevin Gimpel (kgimpel@ttic.edu)

Abstract:The ability to form a visual interpretation of the world from natural language is pivotal to human communication. Being able to map descriptions of scenes to 3D geometric representations can be useful in many applications such as robotics and conversational assistants. In this talk, I will present the task of text to 3D scene generation, where a scene description in natural language is automatically converted into a plausible 3D scene interpretation. For example, the sentence “a living room with a red couch and TV” should generate a realistic living room arrangement with the TV in front of the couch and supported by a TV stand. This task lies at the intersection of NLP and computer graphics, and requires techniques from both.

A key challenge in this task is that the space of geometric interpretations is large while natural language text is typically under-specified, omitting shared, common-sense facts about the world. I will describe how we can learn a set of spatial priors from virtual environments, and use them to infer plausible arrangements of objects given a natural language description. I will show that a parallel corpus of virtual 3D scenes and natural language descriptions can be leveraged to extract likely couplings between references and concrete 3D objects (e.g., an “L-shaped red couch”, and the virtual geometric representation of that object). Finally, I will discuss a few exciting directions for future work at the intersection of NLP, graphics, and more broadly AI.

Title:Memory and Communication in Neural Networks

Date:October 5, 2016at 10:00 am

Speaker:Sainbayar Sukhbaatar

Host:David McAllester (mcallester@ttic.edu)

Abstract:In this talk, I will talk about my two recent works on equipping neural networks with an external memory and communication. In the first part, I will explain about how an external memory can be attached to neural networks. The memory can store variable number of items, which then can be can accessed by a soft-attention mechanism. This whole model can be trained end-to-end with simple back-propagation. Since the model can choose which part of the input to read, it makes it suitable for tasks with out-of-order structure, such as question answering after reading a short story. In the second part, I will talk about how multiple neural networks can learn to communicate with each other to solve a common task. Instead of using discrete symbols for communication, we allow networks to exchange continuous vectors so it can be trained with back-propagation. I will demonstrate the model on multi-agent reinforcement learning tasks.

Title:Understanding scenes over time with unlabeled video

Date:September 21, 2016

Speaker:Carl Vondrick

Host:Gregory Shakhnarovich (gregory@ttic.edu)

Abstract:Learning models with a rich understanding of time is a key problem in machine perception. However, the need for large, labeled video datasets is a major obstacle because video is often expensive and ambiguous to annotate. In this talk, we capitalize on years of unlabeled, in-the-wild videos to train models for two temporal tasks, visual anticipation and sound recognition. First, we present a deep convolutional network for multi-modal regression, allowing us to model uncertainty and more accurately anticipate human actions. To support pixel-level predictions, we introduce a layered generative video model, facilitating dense extrapolation of videos. Second, we utilize videos’ natural synchronization between vision and sound to learn deep sound representations, which our experiments suggest learn some high-level semantics. We believe unlabeled video is a valuable resource for perception, and can impact many applications in robotics, recognition, and forecasting.

Title:Modular neural architectures for grounded language learning

Date:September 28, 2016

Speaker:Jacob Andreas

Host:Kevin Gimpel (kgimpel@ttic.edu)

Abstract:Language understanding depends on two abilities: an ability to translate between natural language utterances and abstract representations of meaning, and an ability to relate such meaning representations to the perceived world. In the natural language processing literature, these tasks are respectively known as “semantic parsing” and “grounding”, and have been treated as essentially independent problems. In this talk, I will present two modular neural architectures for jointly learning to ground language in the world and reason about it compositionally.

I will first describe a technique that uses syntactic information to dynamically construct neural networks from composable primitives. The resulting structures, called “neural module networks”, can be used to achieve state-of-the-art results on a variety of grounded question answering tasks. Next, I will present a model for contextual referring expression generation, in which contrastive behavior results from a combination of learned semantics and inference-driven pragmatics. This model is again backed by modular neural components—in this case elementary listener and speaker representations. It is able to successfully complete a challenging referring expression generation task, exhibiting pragmatic behavior without ever observing such behavior at training time. I will conclude by outlining possible applications of this framework to control and planning problems.

Title:Scene Understanding from RGB-D Images

Date:May 11, 2016

Speaker:Saurabh Gupta, Berkeley

Host:Greg Shakhnarovich (gregory@ttic.edu)

Abstract:In this talk, I will talk about detailed scene understanding from RGB-D images. We approach this problem by studying central computer vision problems like bottom-up grouping, object detection, instance segmentation, pose estimation in context of RGB-D images, and finally aligning CAD models to objects in the scene. This results in a detailed output which goes beyond what most current computer vision algorithms produce, and is useful for real world applications like perceptual robotics, and augmented reality. A central question in this work is how to learn good features for depth images in view of the fact that labeled RGB-D datasets are much smaller than labeled RGB datasets (such as ImageNet) typically used for feature learning. To this end I will describe our technique called “cross-modal distillation” which allows us to leverage easily available annotations on RGB images to learn representations on depth images. In addition, I will also briefly talk about some work on vision and language that I did on an internship at Microsoft Research.

Title:Neural Dialogue Generation

Date:April 20, 2016

Speaker:Jiwei Li (Stanford University)

Host:Kevin Gimpel (kgimpel@ttic.edu)

Abstract:Recent neural generation models present both new opportunities and new challenges for developing conversational agents. In this talk, I will describe how we have advanced this line of research by addressing four different issues in neural dialogue generation: (1) overcoming the overwhelming prevalence of dull responses (e.g., “I don’t know”) generated from neural models; (2) enforcing speaker consistency; (3) leveraging information from conversational history; and (4) applying reinforcement learning to foster sustained dialogue interactions.

Title:Initialization and Dual Expressivity of Neural Networks

Date:March 9, 2016

Speaker:Roy Frostig (Stanford)

Host:Nati Srebro (nati@ttic.edu)

Abstract:Neural network learning is seeing wide empirical success as an applied machine learning tool, yet we have only a nascent theoretical understanding of its recent advances, and of the design choices made in achieving them. In turn, the tools in use are often without guarantees and without useful formalisms to guide their development.

In this talk, I will present recent work that establishes a duality between neural networks and a certain notion of compositional kernels. The connection clarifies the effective modeling capacity of networks due to their architecture. We show that the data representation induced by networks under a common random initialization scheme is rich enough to express all functions in their dual kernel space. Indeed, in this space, easily learnable functions (/img.e. those of low norm) are expressive according to a succinct graph structure underlying the network architecture. An immediate upshot is that, although the network training objective is hard to optimize in the worst case, the initial weights form a good starting point from a modeling perspective (i.e. in inducing features).

The talk is based on a paper (arXiv:1602.05897) from work joint with Amit Daniely and Yoram Singer.

Title:Using Motion to Understand Objects in the Real World

Date:March 2, 2016

Speaker:David Held (Stanford)

Host:David McAllester (mcallester@ttic.edu)

Abstract:Many robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If we model how the world changes over time, we can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I show how we can use this idea to improve performance on three different tasks: velocity estimation, segmentation, and tracking with neural networks. By modeling the causes of appearance changes over time, we can make our methods more robust to a variety of challenging situations that commonly occur in the real-world, thus enabling robots to come out of the factory and into our lives.

Title:Machine Learning for Observational Studies

Date:February 17, 2016

Speaker:Uri Shalit (NYU)

Host:Nati Srebro (nati@ttic.edu)

Abstract:The proliferation of data collection in the healthcare, educational, and economic spheres, brings with it opportunities for extracting new knowledge with concrete policy implications. Examples include identifying best medical practices from electronic healthcare records, or understanding the implications of different teaching techniques from board of education surveys. Such policy decisions often hinge on understanding causal links - does medication A cause better outcomes than medication B? However, unlike randomized studies where a treatment is assigned to a random subpopulation, learning causal links from these so-called “observational studies” is difficult, because confounding factors might obscure the true causal links underlying the observed data.

In this talk I will discuss observational studies from a machine learning perspective. I will then show how we use machine learning techniques to try and learn causal relationships from these studies. Specifically we will discuss two methods: one based on using integral probability metrics to optimally re-weigh the data, and the other based on learning a balanced representation, using ideas from domain adaptation.

Title:Locality-Sensitive Hashing and Beyond

Date:February 10, 2016

Speaker:Ilya Razenshteyn (MIT CSAIL)

Host:Yury Makarychev (yury@ttic.edu)

Abstract:Locality-Sensitive Hashing (LSH) is a powerful technique for the approximate nearest neighbor search (ANN) in high dimensions.

In this talk I will present two recent results.

1) I will show a data structure for ANN for the Euclidean distance that provably outperforms the best possible LSH-based data structure. We proceed via designing a gooddata-dependenthash family.

2) I will show a practical and optimal LSH family for the cosine similarity (a.k.a. Euclidean distance on a sphere). It substantially outperforms the celebrated Hyperplane LSH family. Along the way, I will try to debunk two popular myths about LSH:

* LSH-based data structures consume too much memory and are thus impractical;

* Optimal LSH constructions are too complicated to be made practical.

The talk is based on two papers: arXiv:1501.01062 (joint with Alexandr Andoni, STOC 2015) and arXiv:1509.02897 (joint with Alexandr Andoni, Piotr Indyk, Thijs Laarhoven and Ludwig Schmidt, NIPS 2015).

Title:End-to-End Speech Recognition using Deep LSTMs, CTC Training and WFST Decoding

Date:February 3, 2016

Speaker:Yajie Miao (CMU)

Host:Karen Livescu (klivescu@ttic.edu)

Abstract:Deep learning has tremendously improved the performance of automatic speech recognition (ASR). Despite this progress, developing ASR systems remains a challenging task, requiring various resources, multiple training stages and significant expertise. This talk will present Eesen, an end-to-end ASR framework which drastically simplifies the existing speech recognition paradigm. Acoustic modeling in Eesen involves learning a single deep Long Short-Term Memory (LSTM) network predicting context-independent phonemes or characters. We adopt the connectionist temporal classification (CTC) objective function to learn the alignments between speech and label sequences. A nice property of Eesen is a generalized decoding method based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. With experiments on various datasets and languages, we will see that our end-to-end systems achieve comparable recognition accuracy to the state-of-the-art hybrid approach. In addition, we will present empirical analysis to shed light on how CTC training behaves under different conditions.