Given by Param Vir Singh at 2019 INFORMS Annual Meeting in Seattle, WA.
Deep learning models have succeeded at a variety of human intelligence tasks and are already being used at commercial scale. These models largely rely on the standard gradient descent optimization of parameters W, which maps an input X to an output y ̂=f(X;W). The optimization procedure minimizes the loss (difference) between the model output y ̂ and actual output y. As an example, in the cancer detection setting, X is an MRI image, while y is the presence or absence of cancer. Three key ingredients hint at the reason behind deep learning’s power. (1) Deep architectures that can better adapt to breaking down complex functions into a composition of simpler abstract parts. (2) Standard gradient descent methods that can attain local minima on a nonconvex Loss(y,y ̂) function that are close enough to the global minima. (3) Learning algorithms that can be executed on parallel computing hardware (e.g., GPUs), thus making the optimization viable over hundreds of millions of observations (X,y). Computer vision tasks, where the input X is a high-dimensional image or video, are particularly suited to deep learning application. Recent advances in deep architectures, i.e., inception modules, attention networks, adversarial networks and DeepRL, have opened up completely new applications that were previously unexplored. However, the breakneck progress to replace human tasks with deep learning comes with caveats. These deep models tend to evade interpretation, lack causal relationships between input X and output y and may inadvertently mimic not just human actions but human biases and stereotypes. In this tutorial, we provide an intuitive explanation of deep learning methods in computer vision as well as limitations in practice.