The Machine Learning Seminar Series features weekly talks on recent research in machine learning.

The seminar is typically held Fridays 10-11am at TTIC

To receive announcements regarding the talks please subscribe to the mailing list..

Date: November 17, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Jialei Wang, TTI Chicago

Abstract: The fast growing size of modern data sets and the nature of data acquisition motivate new algorithms and systems that are able to learn from distributed data efficiently. In this talk, I will present two recent work that explore the theoretical foundations of distributed machine learning. First, I will present novel algorithms to learn sparse linear predictors from distributed data. To match centralized optimal predictor, the number of rounds required to communicate only scales logarithmically with the number of machines. Second, I will discuss how to parallelize the stochastic gradient descent (SGD) with a more sophisticated minibatch prox scheme, which improves over minibatch SGD by allowing larger minibatch size without slowing down the convergence.

This is joint work with Mladen Kolar, Nathan Srebro, Weiran Wang and Tong Zhang.

Date: October 13, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Song Liu, University of Bristol

Abstract: Density ratio estimation has become a versatile tool in machine learning community recently. However, due to its unbounded nature, density ratio estimation is vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator.

Date: October 5, 2017 11am-noon

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: We consider the fundamental problem of learning from expert advice, a.k.a online no-regret learning, where we have access to an offline optimization oracle that can be used to compute, in constant time, the best performing expert at any point in time. We consider the design of no-regret algorithms that are computationally efficient using such an oracle. We present structural properties under which we show oracle-efficient no-regret algorithms exist, even when the set of experts is exponentially large in a natural representation of the problem. Our algorithm is a generalization of the Follow-The-Perturbed-Leader algorithm of Kalai and Vempala that at every step follows the best-performing expert subject to some perturbations. Our design uses a shared source of randomness across all experts that can be efficiently implemented by using an oracle on a random modification of the history of the play at every time step.

Our second main contribution is showing that the structural properties required for our oracle-efficient online algorithm are present in a large class problems. As examples, we discuss applications of our oracle-efficient learning results to the adaptive optimization of large classes of auctions, including (1) VCG auctions with bidder-specific reserves in single-parameter settings, (2) envy-free item pricing in multi-item auctions, and (3) Myerson-like auctions for single-item settings.

This talk is based on joint work with Miro Dudik, Haipeng Luo, Rob Schapire, Vasilis Syrgkanis, and Jenn Wortman Vaughan.

Date: September 29, 2017 11am-noon

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: In this talk we will consider linear structural equation models (SEMs) with non-Gaussian errors. In particular, these models assume that each observed variable is a linear functions of the other variables, plus some error term. The first half of the talk will consider SEMs in which the error terms may be dependent; these models correspond to mixed graphs. We propose empirical likelihood procedures for inference and estimation, and suggest several modifications–including a profile likelihood–in order to improve tractability and performance. We show that when the error distributions are non-Gaussian, the use of EL may increase statistical efficiency and improve assessment of significance. In the second half of the talk, we will consider SEMs which correspond to directed acyclic graphs (DAGs). It has been previously shown for DAGs, when the error terms in a SEM are non-Gaussian, the exact causal structure–not simply a larger equivalence class–can be identified. We show that for suitably sparse graphs, when the error terms follow a log-concave distribution (but are non-Gaussian), the graph can also be identified in the high dimensional setting where the number of variables may exceed the number of observations.

Date: June 2, 2017 11am-noon

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are assumed to degrade statistical performance relative to exact methods. In applied mathematics, where much of the current theoretical work on approximation resides, the inputs are considered to be observed exactly. The prevailing philosophy is that while the exact problem is, regrettably, unsolvable, any approximation should be as small as possible. However, from a statistical perspective, an approximate or regularized solution may be preferable to the exact one. Regularization formalizes a trade-off between fidelity to the data and adherence to prior knowledge about the data-generating process such as smoothness or sparsity. The resulting estimator tends to be more useful, interpretable, and suitable as an input to other methods.

In this work, we propose new methodology for estimation and prediction under a linear model borrowing insights from the approximation literature. We explore these procedures from a statistical perspective and find that in many cases they improve both computational and statistical performance.

Date: May 12, 2017 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: In the probabilistic topic models, the quantity of interest—a low- rank matrix consisting of topic vectors—is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a low-rank matrix. However, the connection between this low-rank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges.

We overcome the challenge by revealing a surprising insight: there is a low-dimensional simplex structure which can be viewed as a bridge between the low-rank matrix of interest and the SVD of the text corpus matrix, and which allows us to conveniently reconstruct the former using the latter. Such an insight motivates a new SVD- based approach to learning topic models.

For asymptotic analysis, we show that under the popular probabilistic topic model (Hofmann, 1999), the convergence rate of the l1-error of our method matches that of the minimax lower bound, up to a multi-logarithmic term. In showing these results, we have derived new element-wise bounds on the singular vectors and several large- deviation bounds for weakly dependent multinomial data. Our results on the convergence rate and asymptotical minimaxity are new.

We have applied our method to two data sets, Associated Process (AP) and Statistics Literature Abstract (SLA), with encouraging results. In particular, there is a clear simplex structure associated with the SVD of the data matrices, which largely validates our discovery.

Date: March 31, 2017 10-11am

Venue: Room 526, TTIC. 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems involve time-delays, nonlinearity, and strong autocorrelations, which present major challenges for causal discovery techniques such as Granger causality leading to low detection power. Through large-scale numerical experiments and theoretical analyses we further identify strong detection biases of common causal discovery methods: The detection power for individual links may depend not only on their causal strength, but also on autocorrelation and other dependencies. Here we introduce a reliable method for large-scale linear and nonlinear causal discovery that provides more detection power than current methods and largely overcomes detection biases, allowing to more accurately rank associations in large-scale analyses by their causal strength. The method is demonstrated on a global surface pressure dataset representing atmospheric dynamics.

Title: Combinatorial Inference

Date: March 17, 2017 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Junwei Lin, Princeton University

Abstract: We propose the combinatorial inference to explore the topological structures of graphical models. The combinatorial inference can conduct the hypothesis tests on many graph properties including connectivity, hub detection, perfect matching, etc. In particular, our methods can be applied to any graph property which is invariant under the addition of edges. On the other side, we also propose a shortest self-returning path lemma to prove the general optimality of our testing procedures for various graph properties. The combinatorial inference is also generalized to the time-varying graphical models and we can infer the dynamic topological structures for graphs. Our methods are applied to the neuroscience by discovering hub voxels contributing to visual memories.

Date: March 10, 2017 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Stochastic gradient descent procedures have gained popularity for iterative parameter estimation from large datasets because they are simpler and faster than classical optimization procedures. However, their statistical properties are not well understood in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. In this talk, we will focus on implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates, and the amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed. Thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides a full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency, which enables principled statistical inference on large datasets. Part of ongoing work focuses on a crucial aspect of such inference with stochastic approximation procedures, which is to know when the procedure has reached the asymptotic regime.

Date: November 18, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Weiran Wang, TTIC

Abstract: We study the sample and time complexity of canonical correlation analysis (CCA) in the population setting. With mild assumptions on the data distribution, we show that in order to achieve $\epsilon$-suboptimality in a properly defined measure of alignment between the estimated canonical directions and the population solution, we can solve the empirical objective exactly with $O(\frac{1}{\epsilon^? \Delta^2 \gamma^2}$ samples, where $\Delta$ is the singular value gap of the whitened cross-covariance matrix and $\frac{1}{\gamma}$ is an upper bound of the condition numbers of the auto-covariance matrices. Moreover, we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with a stochastic optimization algorithm; this algorithm is based on the shift-and-invert power iterations and only needs to process the dataset for $O(log(1/\epsilon)$ passes. Finally, we show that, given an estimate of the canonical correlation, the streaming version of shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity.

This is joint work with Jialei Wang, Dan Garber, and Nathan Srebro.

Date: November 11, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Mark Hallen, TTIC

Abstract: Computational protein design algorithms search for modifications in protein systems that will optimize a desired function. Usually, this function involves chemical binding of a protein to another protein or to a drug. Binding is exquisitely sensitive to the 3-D geometry adopted by the system, with low-energy geometries preferred, so to optimize binding we must be able to optimize energy with respect to molecular geometry. Thus, our optimization—both over chemical modifications and over 3-D geometries—must take as input an “energy function,” which maps atomic coordinates to energy. In this work, we present methods to represent the energy function that are amenable to more efficient and more accurate protein design calculations than were previously possible. We pose the calculation of this representation as a machine learning problem. Two relatively simple models are found to be helpful: a continuous polynomial representation and a discrete expansion in local terms. These models open the possibility of modeling binding much more realistically in protein design calculations, while performing optimization efficiently.

Date: November 4, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Eva Dyer Northwestern University

Abstract: As the size and complexity of neural datasets continue to grow, there is an increasing need for scalable and black-box approaches to extract knowledge from these data. In the first half of this talk, I will discuss my recent work in developing methods to uncover the structure and function of neural circuits in large-scale neural recordings. My aim is to provide an introduction to modern neural datasets for a ML audience, and discuss some of the challenges that we now face in developing learning algorithms for these data. In the second half of the talk, I will discuss my recent work in global black-box optimization (at UAI 2016). The idea behind our approach is to learn a convex relaxation of a (possibly) non-convex function by estimating its convex envelope (i.e., tight convex lower bound). We show that our approach can lead to provable convergence to the global optimum for classes of smooth functions.

This is work with Mohammad Gheshlaghi Azar (Google DeepMind), Konrad Kording (Northwestern), and Bobby Kasthuri (University of Chicago, Argonne National Laboratory).

Date: October 28, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Abstract: Extracting knowledge and providing insights into the complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Networks are an example of simple, yet powerful tools for capturing relationships among entities over time. For example, in social media, networks represent connections between different individuals and the type of interaction that two individuals have. In systems biology, networks can represent the complex regulatory circuitry that controls cell behavior. Unfortunately the relationships between entities are not always observable and need to be inferred from nodal measurements.

I will present a line of work that deals with the estimation and inference in high-dimensional semi-parametric elliptical copula models. I will explain why these models are useful in exploring complex systems, how to efficiently estimate parameters of the model, and also provide theoretical guarantees that justify usage of the models in scenarios where more rigid Gaussian graphical models are commonly used.

Joint work with Rina Foygel Barber, Junwei Lu, and Han Liu.

Date: October 21, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Srinadh Bhojanapalli, TTIC

Abstract: In this work, we show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent from random initialization.

(Joint work with Behnam Neyshabur, Nathan Srebro)

Date: October 14, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Blake Woodworth, TTIC

Abstract: We provide tight upper and lower bounds on the complexity of minimizing the average of m convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smooth- ing AGD that improve over methods using just gradient accesses.

Date: October 7, 2016 10-11am

Venue: Room 530, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Nathan Srebro, TTIC

Abstract: As machine learning increasingly replaces human judgment in decisions protected by anti discrimination law, the problem of algorithmicly measuring and ensuring fairness in machine learning is pressing. What does it mean for a predictor to not discriminate with respect to protected group (e.g. according to race, gender, etc)?

We propose a notion of non-discrimination that can be measured statistically, used algorithmicly, and avoids many of the pitfalls of previous definitions. We further study what type of discrimination and non-discrimination can be identified with oblivious tests, which treat the predictor as an opaque black-box, and what different oblivious tests tell us about possible discrimination.

(Joint work with Mortiz Hardt and Eric Price)

Date: September 30, 2016 10-11am

Venue: Room 526, 6045 S Kenwood Ave, Chicago, IL 60637

Speaker: Hao Tang, TTIC

Abstract: Recent work on discriminative segmental models has shown that they can achieve competitive speech recognition performance, using features based on deep neural frame classifiers. However, segmental models can be more challenging to train than standard frame-based approaches. While some segmental models have been successfully trained end-to-end, there is a lack of understanding of their training under different settings and with different losses.

We investigate a model class based on recent successful approaches, consisting of a linear model that combines segmental features based on an LSTM frame classifier. Similarly to hybrid HMM-neural network models, segmental models of this class can be trained in two stages (frame classifier training followed by linear segmental model weight training), end-toend (joint training of both frame classifier and linear weights), or with end-to-end fine-tuning after two-stage training.

In this work we present several findings. First, models in this class can be trained with various kinds of losses. We present segmental models trained end-to-end with hinge loss, log loss, latent hinge loss, and marginal log loss. We compare several losses for the case where training alignments are available as well as where they are not. We find that in general, marginal log loss provides the most consistent strong performance without requiring ground-truth alignments. We also find that training with dropout is very important in obtaining good performance with end-to-end training. Finally, the best results are often obtained by a combination of two-stage training and fine-tuning.