The TTIC Young Researcher Seminar Series features talks by Ph.D. students and postdocs whose research is of broad interest to the computer science community. The series provides an opportunity for early-career researchers to present recent work to and meet with students and faculty at TTIC and nearby universities.

The seminars are typically held on Wednesdays at 10:30am in TTIC Room 526.

To receive announcements regarding the seminar series, please subscribe to the mailing list.

For additional information, please contact Matthew Walter (mwalter@ttic.edu)

Title: Insights on Deep Representations with Applications to Healthcare

Date: October 17, 2018

Speaker: Maithra Raghu, Cornell

Host: Nathan Srebro (nati@ttic.edu)

Abstract: The past few years have seen enormous successes in applying deep neural networks in a variety of domains, from speech recognition to healthcare. However, as these models become more complex and are simultaneously used for more sensitive applications, it becomes increasingly important to better understand and intuit the properties of deep representations. In this talk I overview our development of Canonical Correlation Analysis (SVCCA) as a tool to study and compare latent representations learned by deep neural networks. The results provide insights on learning dynamics, help identify where different concepts are learned, and enable measuring the similarity of generalizing and memorizing networks. I demonstrate the importance of representational considerations in model design via a task on predicting doctor disagreements in a healthcare setting.


Title: Learning Sight from Sound

Date: March 28, 2018

Speaker: Andrew Owens

Host: Gregory Shakhnarovich (gregory@ttic.edu)

Abstract: From the clink of a mug placed onto a saucer to the bustle of a busy cafe, our days are filled with visual experiences that are accompanied by distinctive sounds. In this talk, we show that these sounds can provide a rich training signal for learning visual models. First, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We demonstrate this idea by training an algorithm to produce plausible soundtracks for videos in which people hit and scratch objects with a drumstick. Second, we show that ambient audio – e.g., crashing waves, people speaking in a crowd – can also be used to learn visual models. We train a convolutional neural network to predict a statistical summary of the sounds that occur within a scene, and we demonstrate that the learned visual representation conveys information about objects and scenes. Finally, we present an unsupervised learning method for training multi-modal networks that fuse audio and visual data, and apply the learned representation to a number of audio-visual learning tasks.


Title: Generalization and Self-Supervision in Deep Robotic Learning

Date: February 28, 2018

Speaker: Chelsea Finn

Host: Matthew Walter (mwalter@ttic.edu)

Abstract: Machine learning algorithms excel primarily in settings where an engineer can first reduce the problem to a particular function (e.g. an image classifier), and then collect a substantial amount of labeled input-output pairs for that function. In drastic contrast, humans are capable of learning from streams of raw sensory data with minimal external instruction. In this talk, I will argue that, in order to build intelligent systems that are as capable as humans, machine learning models should not be trained in the context of one particular application. Instead, we should be designing systems that can be versatile, can learn in unstructured settings without detailed human-provided labels, and can accomplish many tasks, all while processing high-dimensional sensory inputs. To do so, these systems must be able to actively explore and experiment, collecting data themselves rather than relying on detailed human labels.

My talk will focus on two key aspects of this goal: versatility and self-supervision. I will first show how we can move away from hand-designed, task-specific representations of a robot’s environment by enabling the robot to learn high-capacity models, such as deep networks, for representing complex skills from raw pixels. I will also present an algorithm that learns deep models that can be rapidly adapted to different objects, new visual concepts, or varying environments, leading to versatile behaviors. Beyond versatility, a hallmark of human intelligence is self-supervised learning. I will discuss how we can allow a robot to learn by playing with objects in the environment without any human supervision. From this experience, the robot can acquire a visual predictive model of the physical world that can be used for maneuvering many different objects to varying goals. In all settings, our experiments on simulated and real robot platforms demonstrate the ability to scale to complex, vision-based skills with novel objects.


Title: Look, Listen, and Speak: Vision Systems that Communicate with Natural Language

Date: February 14, 2018

Speaker: Lisa Anne Hendricks

Abstract: Powered by deep convolutional networks and large scale visual datasets, modern visual systems are capable of accurately recognizing thousands of visual categories. However, images and videos contain so much more than categorical labels: they contain information about where objects are located (in a forest or in a kitchen?), what attributes an object has (red or blue?), and how objects interact with other objects in a scene (is the child sitting on a sofa, or running in a field?). Natural language provides an efficient and intuitive way for vision systems to convey important information about a visual scene. In this talk, I will begin by discussing vision systems that “look and speak”. In particular, I will consider how to create more scalable image captioning systems by integrating external data sources. I will then discuss vision systems that “listen and look”. I will introduce the task of moment localization in videos with natural language and detail my effort to collect a large scale dataset for this task. Finally, natural language provides a way for vision systems to not only discuss what is in a scene, but how objects and attributes in an image support a decision, such as a classification decision. I will discuss vision systems that go beyond naming objects in a scene and can generate visual explanations which justify neural network decisions.


Title: Learning Single-view 3D Reconstruction of Objects and Scenes

Date: January 24, 2018

Speaker: Shubham Tulsiani

Host: Gregory Shakhnarovich (gregory@ttic.edu)

Abstract: In this talk, I will discuss the task of inferring 3D structure underlying an image, in particular focusing on two questions - a) how we can plausibly obtain supervisory signal for this task, and b) what forms of representation should we pursue. I will first show that we can leverage image-based supervision to learn single-view 3D prediction, by using geometry as a bridge between the learning systems and the available indirect supervision. We will see that this approach enables learning 3D structure across diverse setups e.g. predicting deformable models or volumetric 3D for objects, or inferring layered-depth images for scenes. I will then advocate the case for inferring interpretable and compositional 3D representations. I will present a method that discovers the coherent compositional structure across objects in a unsupervised manner by attempting to assemble shapes using volumetric primitives, and then demonstrate the advantages of predicting similar factored 3D representations for complex scenes.


Title: A Dynamical View of Optimization Algorithms

Date: October 25, 2017

Speaker: Ashia Wilson

Host: Nathan Srebro (nati@ttic.edu)

Abstract: Optimization is a core primitive in statistics, machine learning, and data analytics. Within these fields, the rapid growth in the scale and complexity of modern datasets has led to a focus on two classes of algorithms: gradient methods and momentum methods (also referred to as accelerated methods). Momentum methods, first proposed by Nesterov in 1983, achieve faster convergence rates than gradient methods. However, unlike gradient methods, they are not descent methods and providing robust performance guarantees remains a challenge. In the Euclidean setting, momentum methods can be understood as modeling the dynamics of a damped harmonic oscillator; making this intuition precise however, and generalizing it to other geometries has been difficult. Furthermore, derivations of momentum methods do not ow from a single underlying principle, but tend to rely on case-specific algebra using a technique - considered by many to be esoteric - called estimate sequences

The first part of our work introduces a variational, continuous-time framework for understanding momentum methods. We show that there is a family of Lagrangian functionals, that we call Bregman Lagrangians, which generate dynamics corresponding to momentum methods in continuous time. In particular, momentum methods can be understood as arising from various discretization techniques applied to these continuous time dynamics. The second part of our work strengthens this connection. We demonstrate how to derive families of Lyapunov functions which can certify rates of convergence for the continuous time momentum dynamics. We further demonstrate how the proofs of convergence of momentum methods can be understood as bounding discretization errors of the Lyapunov function when moving to discrete time. Along the way, we prove an equivalence between these family of Lyapunov functions and the technique of estimate sequences. The following is joint work with Andre Wibisono, Stephen Tu, Shivaram Venkataraman, Alex Gittens,Benjamin Recht, and Michael I. Jordan.


Title: Towards Hierarchical Multiscale Recurrent Neural Networks and their Applications

Date: June 7, 2017

Speaker: Junyoung Chung

Abstract: The recent resurgence of recurrent neural networks has led to remarkable advances in various applications including machine translation, speech recognition, speech synthesis and caption generation. However, learning both hierarchical and temporal representation has been among the longstanding challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence.

In this talk, I will talk about my previous works on multiscale recurrent neural networks. I will show how a deep recurrent neural network with extra gating units can update its layers with different timescales and discover the underlying hierarchical structure of the sequences. The hierarchical multiscale recurrent neural networks have a potential to remove inherent problems of standard recurrent neural networks. In addition, the learned hierarchical structures can be useful information to many other downstream tasks such as extracting story segments for video understanding, adaptively compressing sequences for speech recognition and extracting sub-task structures in hierarchical reinforcement learning.


Title: Sample-Optimal Inference, the Method of Moments, and Community Detection

Date: May 17, 2017

Speaker: Samuel Hopkins

Host: Madhur Tulsiani (madhurt@ttic.edu)

Abstract: We propose a simple and efficient meta-algorithm for Bayesian estimation problems (i.e. hidden variable, latent variable, or planted problems). Our algorithm uses low-degree polynomials together with new and highly robust tensor decomposition methods. We focus on the question: for a given estimation problem, precisely how many samples (up to low-order additive terms) do polynomial-time algorithms require to obtain good estimates of hidden variables? Our meta-algorithm is broadly applicable, and achieves statistical or conjectured computational sample-complexity thresholds for many well-studied problems, including many for which previous algorithms were highly problem-specific.

As a running example we employ the stochastic block model – a widely studied family of random graph models which contain latent community structure. We recover and unify the proofs of the best-known sample complexity bounds for the partial recovery problem in this model. We also give the first provable guarantees for partial recovery of community structure in constant-degree graphs where nodes may participate in many communities simultaneously. This model is known to exhibit a sharp sample complexity threshold – with fewer than a very specific number of samples, recovering community structure becomes impossible. While previous explanations for this phenomenon appeal to sophisticated ideas from statistical mechanics, we give a new and simple explanation based on properties of low-degree polynomials.


Title: Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks

Date: November 30, 2016

Speaker: Emily Denton

Host: Greg Shakhnarovich (gregory@ttic.edu)

Abstract: The main focus of this talk will be on a new simple semi-supervised learning approach for images based on in-painting using an adversarial loss. Images with random patches removed are presented to a generator whose task is to fill in the hole, based on the surrounding pixels. The in-painted images are then presented to a discriminator network that judges if they are real (unaltered training images) or not. This task acts as a regularizer for standard supervised training of the discriminator. Using our approach we are able to directly train large VGG-style networks in a semi-supervised fashion. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance comparable or superior to existing methods.


Title: A Task-Oriented Neural Dialogue System

Date: November 9, 2016

Speaker: Tsung-Hsien (Shawn) Wen

Host: Karen Livescu (klivescu@ttic.edu)

Abstract: Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collecting dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst helping them to accomplish tasks in a restaurant search domain.


Title: Conditional Quadratic Hardness for Data Analysis Problems

Date: October 26, 2016

Speaker: Arturs Backurs

Host: Yury Makarychev (yury@ttic.edu)

Abstract: The theory of NP-hardness has been remarkably successful in identifying problems that are unlikely to be solvable in polynomial time. However, many other important problems do have polynomial-time algorithms, but large exponents in their runtime bounds can make them inefficient in practice. For example, quadratic-time algorithms, although practical on moderately sized inputs, can become inefficient on big data problems that involve gigabytes or more of data. Although for many data analysis problems no sub-quadratic time algorithms are known, any evidence of quadratic-time hardness has remained elusive.

In this talk I will give an overview of recent research that aims to remedy this situation. In particular, I will outline conditional hardness results for two problems: computing the edit distance between two strings, and solving the optimization problems defined by Support Vector Machines (with Gaussian kernel) up to high accuracy. Specifically, we show that, if either of these two problems can be solved in time O(n^{2-delta}) for some constant delta>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time M^{O(1)} 2^{(1-epsilon)N} for a constant epsilon>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.


Title: Learning Spatial Priors for Text to 3D Scene Generation

Date: October 19, 2016

Speaker: Angel Chang

Host: Kevin Gimpel (kgimpel@ttic.edu)

Abstract: The ability to form a visual interpretation of the world from natural language is pivotal to human communication. Being able to map descriptions of scenes to 3D geometric representations can be useful in many applications such as robotics and conversational assistants. In this talk, I will present the task of text to 3D scene generation, where a scene description in natural language is automatically converted into a plausible 3D scene interpretation. For example, the sentence “a living room with a red couch and TV” should generate a realistic living room arrangement with the TV in front of the couch and supported by a TV stand. This task lies at the intersection of NLP and computer graphics, and requires techniques from both.

A key challenge in this task is that the space of geometric interpretations is large while natural language text is typically under-specified, omitting shared, common-sense facts about the world. I will describe how we can learn a set of spatial priors from virtual environments, and use them to infer plausible arrangements of objects given a natural language description. I will show that a parallel corpus of virtual 3D scenes and natural language descriptions can be leveraged to extract likely couplings between references and concrete 3D objects (e.g., an “L-shaped red couch”, and the virtual geometric representation of that object). Finally, I will discuss a few exciting directions for future work at the intersection of NLP, graphics, and more broadly AI.


Title: Memory and Communication in Neural Networks

Date: October 5, 2016 at 10:00 am

Speaker: Sainbayar Sukhbaatar

Host: David McAllester (mcallester@ttic.edu)

Abstract: In this talk, I will talk about my two recent works on equipping neural networks with an external memory and communication. In the first part, I will explain about how an external memory can be attached to neural networks. The memory can store variable number of items, which then can be can accessed by a soft-attention mechanism. This whole model can be trained end-to-end with simple back-propagation. Since the model can choose which part of the input to read, it makes it suitable for tasks with out-of-order structure, such as question answering after reading a short story. In the second part, I will talk about how multiple neural networks can learn to communicate with each other to solve a common task. Instead of using discrete symbols for communication, we allow networks to exchange continuous vectors so it can be trained with back-propagation. I will demonstrate the model on multi-agent reinforcement learning tasks.


Title: Understanding scenes over time with unlabeled video

Date: September 21, 2016

Speaker: Carl Vondrick

Host: Gregory Shakhnarovich (gregory@ttic.edu)

Abstract: Learning models with a rich understanding of time is a key problem in machine perception. However, the need for large, labeled video datasets is a major obstacle because video is often expensive and ambiguous to annotate. In this talk, we capitalize on years of unlabeled, in-the-wild videos to train models for two temporal tasks, visual anticipation and sound recognition. First, we present a deep convolutional network for multi-modal regression, allowing us to model uncertainty and more accurately anticipate human actions. To support pixel-level predictions, we introduce a layered generative video model, facilitating dense extrapolation of videos. Second, we utilize videos’ natural synchronization between vision and sound to learn deep sound representations, which our experiments suggest learn some high-level semantics. We believe unlabeled video is a valuable resource for perception, and can impact many applications in robotics, recognition, and forecasting.


Title: Modular neural architectures for grounded language learning

Date: September 28, 2016

Speaker: Jacob Andreas

Host: Kevin Gimpel (kgimpel@ttic.edu)

Abstract: Language understanding depends on two abilities: an ability to translate between natural language utterances and abstract representations of meaning, and an ability to relate such meaning representations to the perceived world. In the natural language processing literature, these tasks are respectively known as “semantic parsing” and “grounding”, and have been treated as essentially independent problems. In this talk, I will present two modular neural architectures for jointly learning to ground language in the world and reason about it compositionally.

I will first describe a technique that uses syntactic information to dynamically construct neural networks from composable primitives. The resulting structures, called “neural module networks”, can be used to achieve state-of-the-art results on a variety of grounded question answering tasks. Next, I will present a model for contextual referring expression generation, in which contrastive behavior results from a combination of learned semantics and inference-driven pragmatics. This model is again backed by modular neural components—in this case elementary listener and speaker representations. It is able to successfully complete a challenging referring expression generation task, exhibiting pragmatic behavior without ever observing such behavior at training time. I will conclude by outlining possible applications of this framework to control and planning problems.


Title: Scene Understanding from RGB-D Images

Date: May 11, 2016

Speaker: Saurabh Gupta, Berkeley

Host: Greg Shakhnarovich (gregory@ttic.edu)

Abstract: In this talk, I will talk about detailed scene understanding from RGB-D images. We approach this problem by studying central computer vision problems like bottom-up grouping, object detection, instance segmentation, pose estimation in context of RGB-D images, and finally aligning CAD models to objects in the scene. This results in a detailed output which goes beyond what most current computer vision algorithms produce, and is useful for real world applications like perceptual robotics, and augmented reality. A central question in this work is how to learn good features for depth images in view of the fact that labeled RGB-D datasets are much smaller than labeled RGB datasets (such as ImageNet) typically used for feature learning. To this end I will describe our technique called “cross-modal distillation” which allows us to leverage easily available annotations on RGB images to learn representations on depth images. In addition, I will also briefly talk about some work on vision and language that I did on an internship at Microsoft Research.


Title: Neural Dialogue Generation

Date: April 20, 2016

Speaker: Jiwei Li (Stanford University)

Host: Kevin Gimpel (kgimpel@ttic.edu)

Abstract: Recent neural generation models present both new opportunities and new challenges for developing conversational agents. In this talk, I will describe how we have advanced this line of research by addressing four different issues in neural dialogue generation: (1) overcoming the overwhelming prevalence of dull responses (e.g., “I don’t know”) generated from neural models; (2) enforcing speaker consistency; (3) leveraging information from conversational history; and (4) applying reinforcement learning to foster sustained dialogue interactions.


Title: Initialization and Dual Expressivity of Neural Networks

Date: March 9, 2016

Speaker: Roy Frostig (Stanford)

Host: Nati Srebro (nati@ttic.edu)

Abstract: Neural network learning is seeing wide empirical success as an applied machine learning tool, yet we have only a nascent theoretical understanding of its recent advances, and of the design choices made in achieving them. In turn, the tools in use are often without guarantees and without useful formalisms to guide their development.

In this talk, I will present recent work that establishes a duality between neural networks and a certain notion of compositional kernels. The connection clarifies the effective modeling capacity of networks due to their architecture. We show that the data representation induced by networks under a common random initialization scheme is rich enough to express all functions in their dual kernel space. Indeed, in this space, easily learnable functions (/img.e. those of low norm) are expressive according to a succinct graph structure underlying the network architecture. An immediate upshot is that, although the network training objective is hard to optimize in the worst case, the initial weights form a good starting point from a modeling perspective (i.e. in inducing features).

The talk is based on a paper (arXiv:1602.05897) from work joint with Amit Daniely and Yoram Singer.


Title: Using Motion to Understand Objects in the Real World

Date: March 2, 2016

Speaker: David Held (Stanford)

Host: David McAllester (mcallester@ttic.edu)

Abstract: Many robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If we model how the world changes over time, we can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I show how we can use this idea to improve performance on three different tasks: velocity estimation, segmentation, and tracking with neural networks. By modeling the causes of appearance changes over time, we can make our methods more robust to a variety of challenging situations that commonly occur in the real-world, thus enabling robots to come out of the factory and into our lives.


Title: Machine Learning for Observational Studies

Date: February 17, 2016

Speaker: Uri Shalit (NYU)

Host: Nati Srebro (nati@ttic.edu)

Abstract: The proliferation of data collection in the healthcare, educational, and economic spheres, brings with it opportunities for extracting new knowledge with concrete policy implications. Examples include identifying best medical practices from electronic healthcare records, or understanding the implications of different teaching techniques from board of education surveys. Such policy decisions often hinge on understanding causal links - does medication A cause better outcomes than medication B? However, unlike randomized studies where a treatment is assigned to a random subpopulation, learning causal links from these so-called “observational studies” is difficult, because confounding factors might obscure the true causal links underlying the observed data.

In this talk I will discuss observational studies from a machine learning perspective. I will then show how we use machine learning techniques to try and learn causal relationships from these studies. Specifically we will discuss two methods: one based on using integral probability metrics to optimally re-weigh the data, and the other based on learning a balanced representation, using ideas from domain adaptation.


Title: Locality-Sensitive Hashing and Beyond

Date: February 10, 2016

Speaker: Ilya Razenshteyn (MIT CSAIL)

Host: Yury Makarychev (yury@ttic.edu)

Abstract: Locality-Sensitive Hashing (LSH) is a powerful technique for the approximate nearest neighbor search (ANN) in high dimensions.

In this talk I will present two recent results.

1) I will show a data structure for ANN for the Euclidean distance that provably outperforms the best possible LSH-based data structure. We proceed via designing a good data-dependent hash family.

2) I will show a practical and optimal LSH family for the cosine similarity (a.k.a. Euclidean distance on a sphere). It substantially outperforms the celebrated Hyperplane LSH family. Along the way, I will try to debunk two popular myths about LSH:

* LSH-based data structures consume too much memory and are thus impractical;

* Optimal LSH constructions are too complicated to be made practical.

The talk is based on two papers: arXiv:1501.01062 (joint with Alexandr Andoni, STOC 2015) and arXiv:1509.02897 (joint with Alexandr Andoni, Piotr Indyk, Thijs Laarhoven and Ludwig Schmidt, NIPS 2015).


Title: End-to-End Speech Recognition using Deep LSTMs, CTC Training and WFST Decoding

Date: February 3, 2016

Speaker: Yajie Miao (CMU)

Host: Karen Livescu (klivescu@ttic.edu)

Abstract: Deep learning has tremendously improved the performance of automatic speech recognition (ASR). Despite this progress, developing ASR systems remains a challenging task, requiring various resources, multiple training stages and significant expertise. This talk will present Eesen, an end-to-end ASR framework which drastically simplifies the existing speech recognition paradigm. Acoustic modeling in Eesen involves learning a single deep Long Short-Term Memory (LSTM) network predicting context-independent phonemes or characters. We adopt the connectionist temporal classification (CTC) objective function to learn the alignments between speech and label sequences. A nice property of Eesen is a generalized decoding method based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. With experiments on various datasets and languages, we will see that our end-to-end systems achieve comparable recognition accuracy to the state-of-the-art hybrid approach. In addition, we will present empirical analysis to shed light on how CTC training behaves under different conditions.