RandNet: Deep Learning with Compressed Measurements of Images

Abstract

Principal component analysis, dictionary learning, and auto-encoders are all unsupervised methods for learning representations from a large amount of training data. In all these methods, the higher the dimensions of the input data, the longer it takes to learn. We introduce a class of neural networks, termed RandNet, for learning representations using compressed random measurements of data of interest, such as images. RandNet extends the convolutional recurrent sparse auto-encoder architecture to dense networks and, more importantly, to the case when the input data are compressed random measurements of the original data. Compressing the input data makes it possible to fit a larger number of batches in memory during training. Moreover, in the case of sparse measurements, training is more efficient computationally. We demonstrate that, in unsupervised settings, RandNet performs dictionary learning using compressed data. In supervised settings, we show that RandNet can classify MNIST images with minimal loss in accuracy, despite being trained with random projections of the images that result in a $50\%$ reduction in size. Overall, our results provide a general principled framework for training neural networks using compressed data.

Index Terms— Random Neural Networks, Dictionary Learning, Sparse Representation, Compressed Classification.

1 Introduction

Representation learning has become an important problem in recent years both in the signal processing and machine learning communities. In signal processing, dictionary learning (DL) [1] is the de facto method for learning adaptive data representations. In machine learning, deep learning is the method of choice to learn representations that are adapted to data. In several tasks such as image classification, the early success of DL, and more recently deep learning, can be attributed largely to their ability to learn representations tailored to the task of interest.

Traditionally, DL has been restricted to the shallow case, where only one set of weights is learned from data. The works in [2, 3] have shown a one-to-one correspondence between DL and neural networks (NNs) and, more specifically, how to train an auto-encoder to perform unsupervised DL. Recent work [4, 5] has shown the connections between deep DL and deep learning, emphasizing a new perspective on deep NNs as efficient algorithms for solving deep DL problems. The primary advantages of using deep NNs for DL are the widespread GPU-based infrastructure that supports them (e.g. TensorFlow), and the ease with which they can be deployed. However, in a number of applications of both DL and NNs, such as video processing, the size of datasets required for learning is a computational bottleneck.

This motivates the need for a framework that can learn from reduced-size data. In [6, 7], the authors propose two different methods for learning representations from random projections of data. The first, [6], performs PCA using compressed measurements. The second, called CK-SVD [7], is an optimization-based alternating-minimization algorithm [1] for DL from sparse random measurements of data. The benefits of these approaches are memory and computational efficiency [6], and the ability to train using reduced-size data [7].

This paper brings together the idea of training NNs for DL, and that of using compressed random measurements of data to solve DL tasks. Specifically, we introduce a framework to train NNs to perform unsupervised DL or a supervised task, e.g. classification, using only compressed projections of the data. The resulting class of architectures, which we call RandNet, is more efficient in terms of memory and computation than conventional ones. In the unsupervised setting, we demonstrate that RandNet performs DL. The architecture is a variant of the constrained recurrent sparse auto-encoder (CRsAE) [3] in which the encoder uses randomly projected data to obtain a sparse representation. For supervised tasks, RandNet uses the sparse representation produced at the output of the encoder as features. We demonstrate that the performance of RandNet in the classification of MNIST handwritten digits [8] rivals the state-of-the-art.

We begin the remainder of our treatment in the next section, where we introduce the DL problem. We introduce RandNet in Sec. 3, and apply it to DL and classification of MNIST digits in Sec. 4. We conclude in Sec. 5.

2 Dictionary Learning

Let $\mathbf{x}^{j}\in{\mathbb{R}}^{p}$ be a sparse vector and $\mathbf{y}^{j}\in{\mathbb{R}}^{N}$ be the vector obtained as the sum of the sparse linear combination of columns of a dictionary $\mathbf{A}\in{\mathbb{R}}^{N\times p}$ and additive noise

\mathbf{y}^{j}=\mathbf{A}\mathbf{x}^{j}+\mathbf{v}^{j},j=1,\cdots,J,\vspace{-3mm}

(1)

Fig. 1: RandNet architecture, where

g(\cdot)=(\mathbf{I}-\frac{1}{L}\mathbf{A}^{\text{T}}\mathbf{\Phi}^{\text{T}}\mathbf{\Phi}\mathbf{A})(\mathbf{x}_{t}+\frac{s_{t}-1}{s_{t+1}}(\mathbf{x}_{t}-\mathbf{x}_{t-1}))

and

\mathbf{r}=\mathbf{\Phi}\mathbf{y}

. The unsupervised block is a CRsAE architecture (Sec. 2.2) when

\mathbf{\Phi}=\mathbf{I}

where $J$ is the total number of vectors and $\mathbf{v}^{j}$ is i.i.d. noise. In what follows, we refer to each $\mathbf{y}^{j}$ as an example, each $\mathbf{x}^{j}$ as a sparse code, and assume the entries of $\mathbf{v}^{j}$ have variance $\sigma^{2}$ . The goal of DL is to learn a dictionary $\mathbf{A}$ such that each example $\mathbf{y}^{j}$ can be approximated as the sparse linear combination of columns from $\mathbf{A}$ using the sparse vector $\mathbf{x}^{j}$ . From an optimization perspective, DL problem aims to solve

\min_{\begin{subarray}{c}(\mathbf{x}^{j})_{j=1}^{J}\\ \mathbf{A}\end{subarray}}\ \sum_{j=1}^{J}\frac{1}{2}\|\mathbf{y}^{j}-\mathbf{A}\mathbf{x}^{j}\|^{2}+\lambda\|\mathbf{x}^{j}\|_{1}\ \text{ s.t. }\|\mathbf{a}_{i}\|_{2}=1\vspace{-2mm}

(2)

for $i=1,\cdots,p$ , where $\lambda>0$ is a sparsity enforcing parameter that depends on the statistics of the noise $\mathbf{v}^{j}$ . The vector $\mathbf{a}_{i}$ denotes the $i^{th}$ column of the matrix $\mathbf{A}$ , and the norm constraint is to avoid scaling ambiguity. The objective is jointly non-convex in $\mathbf{A}$ and $\{\mathbf{x}^{j}\}_{j=1}^{J}$ .

2.1 Classical DL

A classical method to circumvent the non-convexity in Eq. 2 is the alternating-minimization algorithm [1], which alternates between a sparse coding step and a dictionary update step. Given an estimate of the dictionary, the sparse coding step estimates each sparse code $\mathbf{x}^{j}$ by solving

\min_{\mathbf{x}^{j}}\ \frac{1}{2}\|\mathbf{y}^{j}-\mathbf{A}\mathbf{x}^{j}\|^{2}+\lambda\|\mathbf{x}^{j}\|_{1}.

(3)

The dictionary update step uses the newly-estimated codes to update the dictionary as the solution of

\min_{\mathbf{A}}\ \sum_{j=1}^{J}\frac{1}{2}\|\mathbf{y}^{j}-\mathbf{A}\mathbf{x}^{j}\|^{2}\ \text{ s.t. }\|\mathbf{a}_{i}\|_{2}=1.\vspace*{-4mm}

(4)

2.2 Constrained Auto-Encoders

A recent line of work [2, 3] has developed an auto-encoder, named CRsAE, to solve Eq. 2 when $\mathbf{A}$ is a convolutional operator. The encoder in CRsAE imitates the sparse coding step: it uses $\mathbf{A}$ , $\mathbf{A}^{\text{T}}$ , and the ReLU activation function to map the data to a sparse code. The decoder is linear and uses $\mathbf{A}$ , constrained to the matrix used in the encoder, to map the output of the encoder to a reconstruction of the data. To learn $\mathbf{A}$ , CRsAE minimizes the squared-error reconstruction loss by backpropagation. The unsupervised block of Fig. 1, when the matrix $\mathbf{\Phi}$ is the identity, illustrates the CRsAE architecture. We extend the work in [2, 3] and show, for the first time, that we can train CRsAE to learn a dense dictionary. In addition, through a connection with compressive DL [7, 9], we show that we can train a modified CRsAE, termed RandNet, to perform DL using randomly-compressed versions of the data $\mathbf{y}^{j}$ .

2.3 DL from Compressed Data

Compressive DL is the problem of learning a dictionary from compressed data, where we assume the compressed data are obtained by projecting the original data into a lower dimensional subspace [7, 9]. Compressive DL [9] uses a version of alternating-minimization based on the KSVD algorithm and an $\ell_{0}$ constraint to enforce sparsity. Using the $\ell_{1}$ -norm, the problem becomes

\min_{\begin{subarray}{c}(\mathbf{x}^{j})_{j=1}^{J}\\ \mathbf{A}\end{subarray}}\ \sum_{j=1}^{J}\frac{1}{2}\|\mathbf{r}^{j}-\mathbf{\Phi}\mathbf{A}\mathbf{x}^{j}\|^{2}+\lambda\|\mathbf{x}^{j}\|_{1}\ \text{ s.t. }\|\mathbf{a}_{i}\|_{2}=1,\vspace{-2mm}

(5)

where $\mathbf{r}^{j}=\mathbf{\Phi}\mathbf{y}^{j}$ is the compressed version (random projection) of the example $\mathbf{y}^{j}$ and $\mathbf{\Phi}\in{\mathbb{R}}^{M\times N}$ is a known random measurement matrix such that $M<N$ . The work in [9] shows that it is indeed possible to learn the dictionary $\mathbf{A}$ when $M<N$ and $\mathbf{\Phi}$ is a random Gaussian matrix. For the rest of the paper, we drop the superscript $j$ to simplify notation.

3 RandNet

We introduce a class of NNs, which we call RandNet, for learning representations from compressed data. First, we introduce the architecture in unsupervised and supervised settings. In unsupervised settings, RandNet is a variant of the CRsAE architecture (Sec. 2.2) to solve Eq. 5. In supervised cases, RandNet uses the output of the encoder as features for the supervised task. Then, we explain why RandNet is memory and computationally efficient, both when the measurement matrix is Gaussian and when it is row sparse.

3.1 Unsupervised NNs with Compressed Inputs

RandNet in unsupervised settings is an auto-encoder. Given $\mathbf{r}$ , $\mathbf{\Phi}$ and $\mathbf{A}$ , the encoder solves the sparse coding step in Eq. 5. The encoder implements the FISTA algorithm [10], which generates a sequence $\mathbf{x}_{t}$ , indexed by iteration number $t$ , that converges to a sparse code $\mathbf{x}_{T}$ after $T$ iterations. Similar to [2], we define a state vector $\mathbf{z}_{t}=\begin{bmatrix}\mathbf{z}_{t}^{(1)}\ \mathbf{z}_{t}^{(2)}\end{bmatrix}^{\text{T}}=\begin{bmatrix}\mathbf{x}_{t}\quad\mathbf{x}_{t-1}\end{bmatrix}^{\text{T}}$ . Algorithm 1 details the steps of the encoder, where the two-sided ReLU non-linearity $\eta_{\epsilon}:{\mathbb{R}}^{p}\to{\mathbb{R}}^{p}$ is defined element-wise as $(\eta_{\epsilon}(\mathbf{z}))_{n}=(|z_{n}|-\epsilon)_{+}\textrm{sgn}(z_{n})$ . The decoder applies $\mathbf{A}$ and then $\mathbf{\Phi}$ to $\mathbf{x}_{T}$ to obtain $\mathbf{c}_{T+1}=\hat{\mathbf{r}}=\mathbf{\Phi}\mathbf{A}\mathbf{x}_{T}$ . Similar to CRsAE, the parameters of the encoder and decoder are tied. The unsupervised block in Fig. 1 shows this forward pass. The loss function associated with the architecture is $\mathcal{L}_{\mathbf{A}}({\mathbf{r},\hat{\mathbf{r}})}={\textstyle\frac{1}{2}}\|\mathbf{r}-\hat{\mathbf{r}}\|_{2}^{2}$ .

Input:

\mathbf{r},\mathbf{A},\mathbf{\Phi},\lambda,L\geq\sigma_{\text{max}}(\mathbf{A}^{\text{T}}\mathbf{A})

Output:

\mathbf{x}_{T}

\mathbf{z}_{0}=\mathbf{0},s_{0}=0

2 for $t=1$ to $T$ do

s_{t}=\frac{1+\sqrt{1+4s_{t-1}^{2}}}{2}

\mathbf{w}_{t}=\begin{bmatrix}\left(1+\frac{s_{t-1}-1}{s_{t}}\right)\mathbf{I}_{p}|-\frac{s_{t-1}-1}{s_{t}}\mathbf{I}_{p}\end{bmatrix}\mathbf{z}_{t-1}

\mathbf{c}_{t}=\mathbf{w}_{t}+\frac{1}{L}\mathbf{A}^{\text{T}}\mathbf{\Phi}^{\text{T}}(\mathbf{r}-\mathbf{\Phi}\mathbf{A}\mathbf{w}_{t})

\mathbf{z}_{t}=\begin{bmatrix}\mathbf{x}_{t}\quad\mathbf{x}_{t-1}\end{bmatrix}^{\text{T}}=\begin{bmatrix}\eta_{\frac{\lambda}{L}}(\mathbf{c}_{t})\quad\mathbf{z}_{t-1}^{(1)}\end{bmatrix}^{\text{T}}

Algorithm 1 Encoder of RandNet.

3.2 Supervised NNs with Compressed Inputs

In supervised settings, e.g. classification, we extend RandNet following a similar approach to the architecture in [11]. Specifically, we show how a network should be designed for a classification task to be trained on compressed data. For $K$ -class classification, we define a $K$ -dimensional label vector $\mathbf{u}\in\{0,1\}^{K}$ such that for each class $c$

u_{k}=\begin{cases}1,&\text{if $k=c$};\\ 0,&\text{otherwise}.\end{cases}\vspace{-2.0mm}

(6)

The goal is to learn a mapping from the sparse representation $\mathbf{x}_{T}$ of the data produced at the output of the encoder (see Sec. 3.1) to an estimated vector of probabilities $\hat{\mathbf{u}}=\frac{e^{\mathbf{q}}}{\sum_{i}e^{\mathbf{q}_{i}}}$ , where $\mathbf{q}=\mathbf{C}\mathbf{x}_{T}+\mathbf{d}$ . The supervised block in Fig. 1 shows the forward pass of the resulting architecture. We define the categorical cross-entropy loss

\mathcal{L}_{\mathbf{C},\mathbf{d}}{(\mathbf{x}_{T},\mathbf{u},\mathbf{C},\mathbf{d})}=-\mathbf{u}^{\text{T}}\log{\left(\frac{e^{\mathbf{C}\mathbf{x}_{T}+\mathbf{d}}}{\sum_{i}e^{(\mathbf{C}\mathbf{x}_{T}+\mathbf{d})_{i}}}\right)}.\vspace{-3.0mm}

(7)

3.3 RandNet Backpropagation Algorithm

In the unsupervised case, we minimize $\mathcal{L}_{\mathbf{A}}({\mathbf{r},\hat{\mathbf{r}})}$ with the constraint $\|\mathbf{a}_{i}\|_{2}=1$ to avoid scaling ambiguity between the dictionary and sparse codes. For the supervised case, we proceed in two stages to optimize the parameters of interest. First, we train the unsupervised network to learn $\mathbf{x}_{T}$ that approximates the compressed data when multiplied by $\mathbf{\Phi}\mathbf{A}$ . Then, given $\mathbf{x}_{T}$ , we learn the matrix $\mathbf{C}$ and the bias $\mathbf{d}$ that minimize the loss $\mathcal{L}_{\mathbf{C},\mathbf{d}}{(\mathbf{x}_{T},\mathbf{u},\mathbf{C},\mathbf{d})}$ .

Algorithm 2 is the backpropagation algorithm for computing the gradient, denoted $\delta\cdot$ , of the loss functions with respect to the parameters of interest. The vectors $\mathbf{a}=[\mathbf{a}_{1}^{\text{T}}\ \mathbf{a}_{2}^{\text{T}}\ \cdots\mathbf{a}_{p}^{\text{T}}]^{\text{T}}$ and $\tilde{\mathbf{c}}=[\mathbf{c}_{1}^{\text{T}}\mathbf{c}_{2}^{\text{T}}\ \cdots\mathbf{c}_{K}^{\text{T}}]^{\text{T}}$ are the vectors of stacked columns from $\mathbf{A}$ and $\mathbf{C}$ , respectively. In practice, we compute the gradients through PyTorch’s autograd. We omit the derivation.

Input:

\mathbf{r},\mathbf{u},\lambda,L,\mathbf{A},\mathbf{\Phi},\mathbf{C},\mathbf{d}

, Variables

s_{t}

\mathbf{w}_{t}

\mathbf{c}_{t}

\mathbf{z}_{T}

\mathbf{x}_{T}

from RandNet encoder, and

\hat{\mathbf{r}},\mathbf{c}_{T+1}

Output:

\delta\mathbf{a},\delta\tilde{\mathbf{c}},\delta\mathbf{d}

\delta\hat{\mathbf{r}}=\hat{\mathbf{r}}-\mathbf{r},(\delta\hat{\mathbf{u}})_{k}=-\frac{u_{k}}{\hat{u}_{k}},\delta\mathbf{a}=\mathbf{0}_{Np}

\delta\mathbf{q}=\frac{\partial\hat{\mathbf{u}}}{\partial\mathbf{q}}\delta\hat{\mathbf{u}}

\delta\mathbf{d}=\frac{\partial\mathbf{q}}{\partial\mathbf{d}}\delta\mathbf{q}

\delta\tilde{\mathbf{c}}=\frac{\partial\mathbf{q}}{\partial\tilde{\mathbf{c}}}\delta\mathbf{q}

\delta\mathbf{c}_{T+1}=\frac{\partial\hat{\mathbf{r}}}{\partial\mathbf{c}_{T+1}}\delta\hat{\mathbf{r}}

\delta\mathbf{a}=\delta\mathbf{a}+\frac{\partial\mathbf{c}_{T+1}}{\partial\mathbf{a}}\delta\mathbf{c}_{T+1}

\delta\mathbf{z}_{T}=\frac{\partial\mathbf{c}_{T+1}}{\partial\mathbf{z}_{T}}\delta\mathbf{c}_{T+1}

6 for $t=T$ to $1$ do

\delta\mathbf{c}_{t}=\frac{\partial\mathbf{z}_{t}}{\partial\mathbf{c}_{t}}\delta\mathbf{z}_{t}

\delta\mathbf{a}=\delta\mathbf{a}+\frac{\partial\mathbf{c}_{t}}{\partial\mathbf{a}}\delta\mathbf{c}_{t}

\delta\mathbf{z}_{t-1}=\frac{\partial\mathbf{c}_{t}}{\partial\mathbf{z}_{t-1}}\delta\mathbf{c}_{t}

Algorithm 2 RandNet backprop for

\mathbf{A}

\mathbf{C}

, and

\mathbf{d}

		(a) Gaussian	(b) Sparse	(c) Identity
Memory Storage		$O(\beta N)$	$\qquad\cdot$	$O(N)$
Memory Access		$\qquad\cdot$	$O(\gamma N)$	$O(N)$
Matrix Operation	$\mathbf{\Phi}^{\text{T}}$	$O(MN)$	$O(\gamma N)$	$\quad\cdot$
	$\mathbf{A}^{\text{T}}$	$O(pN)$	$O(\gamma pN)$	$O(pN)$

Table 1: Memory and computational efficiency of RandNet when using (a) Gaussian measurements, (b) sparse random measurements to compress images, and (c) no compression.

Benefits of random projections: One advantage of RandNet compared to classical NNs is that a larger amount of data can fit on GPU memory, because the network is trained with the compressed data. Let $\beta\overset{\Delta}{=}\frac{M}{N}<1$ denote the measurement ratio [9]. The memory storage cost of RandNet is $O(\beta N)$ , as opposed to $O(N)$ for a network without compression. For Gaussian $\mathbf{\Phi}$ , similar to [9], RandNet takes advantage of this reduction in memory storage cost by storing the compressed data instead of the original data. The benefits of RandNet are even more pronounced when $\mathbf{\Phi}$ is row sparse. In this case, we assume each row of $\mathbf{\Phi}$ is $s$ -sparse ( $s\geq 1$ ) with non-zero entries chosen uniformly at random and set to $\{-1,+1\}$ with equal probability. Let $\gamma\overset{\Delta}{=}\beta s<1$ denote the compression factor [6, 7]. The cost for the GPU of accessing the data during training is $O(\gamma N)$ , compared to the cost $O(N)$ of accessing the data in its original dimension [6]. Therefore, in the sparse case and for small enough $\gamma$ , the cost of accessing the data is even lower than storing and accessing it in compressed form. In addition to storage/access benefits, row-sparse $\mathbf{\Phi}$ make RandNet efficient computationally. In RandNet, the cost of applying $\mathbf{A}^{\text{T}}$ is $O(\gamma pN)$ compared to $O(pN)$ without compression. This is because $\mathbf{A}^{\text{T}}$ operates on the sparse vector produced by the column-sparse operator $\mathbf{\Phi}^{\text{T}}$ (Algorithm 1, line $5$ ). Table 1 summarizes these benefits.

Shared projection operators: In practice, similar to [7], we divide the data into $B$ blocks. Examples with indices $j=\left\{\frac{J}{B}(b-1)+1,\cdots,b\frac{J}{B}\right\}$ belong to block $b$ and share the measurement matrix $\mathbf{\Phi}_{b}$ , $b=1\cdots,B$ . We denote an example from block $b$ as $\mathbf{y}^{b}$ and its projection $\mathbf{r}^{b}$ . We note that we do not need to store the $\mathbf{\Phi}_{b}$ matrices. Instead, we store the fixed random seed associated with each matrix. For the rest of the paper, we drop the notation $b$ to simplify the notation.

4 Experiments

We train RandNet on a simulated dataset and on MNIST. In the simulated case, we train a) a CRsAE architecture to learn the dictionary underlying data simulated according to Eq. 1, and b) an unsupervised RandNet architecture to learn the same dictionary from random projections of the data when $\mathbf{\Phi}$ is Gaussian and also when it is row sparse. As a benchmark, we implement the CK-SVD algorithm [7]. We use MNIST to demonstrate the use of RandNet in a classification task with randomly-projected data (digits) as inputs.

4.1 Datasets

4.1.1 Simulation

We simulated a dataset of $J=4{,}250$ examples $\mathbf{y}^{j}\in{\mathbb{R}}^{500}$ from the generative model of Eq. 1 with $\mathbf{v}^{j}=0$ . The elements of $\mathbf{A}\in{\mathbb{R}}^{500\times 20}$ are i.i.d. and drawn according to a $\mathcal{N}(0,\frac{1}{500})$ distribution, followed by the normalization of each column to have unit length. Each sparse code $\mathbf{x}^{j}$ is $3$ -sparse with i.i.d. nonzero entries following a Uniform distribution on the interval $[-5,-4]\cup[4,5]$ . We generated $B=40$ measurement matrices $\{\mathbf{\Phi}_{b}\in{\mathbb{R}}^{M\times 100}\}_{b=1}^{B}$ . We used either Gaussian matrices or sparse ones with $1$ -sparse rows $(s=1)$ . To study the effects of compression, we considered $M=250,150,50$ , i.e. $\beta=0.5,0.3,0.1$ , respectively. The corresponding compression factors are $\gamma=0.5,0.3,0.1$ , respectively. For Gaussian $\mathbf{\Phi}$ , RandNet reduces the cost of storage by a factor of $\beta$ . For row-sparse $\mathbf{\Phi}$ , both the memory access cost and operation cost of $\mathbf{A}^{\text{T}}$ are reduced by a factor of $\gamma$ .

4.1.2 MNIST

We considered the MNIST dataset comprising $70{,}000$ $28\times 28$ grayscale handwritten digits, split into a training set with $60{,}000$ images and a test set with $10{,}000$ images. We vectorized the images, resulting in $\{\mathbf{y}^{j}\in{\mathbb{R}}^{784}\}_{j=1}^{J}$ , $J=70{,}000$ . We generated $B=1{,}000$ random measurement matrices $\{\mathbf{\Phi}_{b}\in{\mathbb{R}}^{M\times 784}\}_{b=1}^{B}$ , for $M=392$ ( $\beta=\gamma=0.5$ ). The goal is to learn a dictionary $\mathbf{A}\in{\mathbb{R}}^{784\times 784}$ that yields a $784$ -dimensional sparse representation of the images that is useful for classifying each image into one of $K=10$ categories. Hence, the classification matrix $\mathbf{C}\in{\mathbb{R}}^{10\times 784}$ .

4.2 Training

We implemented RandNet on PyTorch and trained it on a GPU with backpropagation (Tesla V100-SXM2) by mini-batch gradient descent. We used the ADAM optimizer.

4.2.1 Simulation

We divided the data into a training set of size $4{,}000$ and a test set of size $250$ . The number of trainable parameters is $500\times 20=10{,}000$ . We set the number of FISTA iterations to $T=400$ . This is crucial in producing sparse codes, which facilitates DL [1, 3]. We set $\lambda=\sigma\sqrt{2\log{p}}$ [12] and tuned $\sigma$ by grid search over the interval $[0.01,0.2]$ . We set $L=5$ , $12$ , and $2$ for CRsAE, Gaussian, and sparse RandNet, respectively. We picked each of these values so that it is greater than the maximum eigenvalue of $\mathbf{A}^{\text{T}}\mathbf{\Phi}^{\text{T}}\mathbf{\Phi}\mathbf{A}$ [10], where $\mathbf{\Phi}$ depends on the architecture. We used a batch size of $64$ , a learning rate of $0.001$ , and trained for $10$ epochs.

4.2.2 MNIST

We first trained the RandNet auto-encoder for $20$ epochs, and then the classifier for another $20$ epochs. We hypothesized that the sparse code $\mathbf{x}_{T}$ obtained at the output of the encoder would be a useful representation for classification. The number of trainable parameters is $784\times 784=614{,}656$ for the auto-encoder, and $10\times 784+10$ for the classifier, a total of $622{,}506$ parameters. We set the number of FISTA iterations to $T=60$ . We tuned $\lambda$ by grid search and used $\lambda=2.2$ and $\lambda=2$ for the Gaussian and sparse cases, respectively. We used $L=50$ and estimated it to be greater than the maximum eigenvalue of $\mathbf{A}^{\text{T}}\mathbf{\Phi}^{\text{T}}\mathbf{\Phi}\mathbf{A}$ when $\mathbf{A}$ is a random Gaussian matrix. We used a batch size of $16$ and a learning rate of $0.005$ . We initialized the estimated dictionary $\hat{\mathbf{A}}$ with i.i.d. $\mathcal{N}(0,\frac{1}{784})$ entries and normalized its columns to have unit length. We initialized $\hat{\mathbf{C}}$ and $\hat{\mathbf{d}}$ with i.i.d. Uniform entries in the interval $[\frac{-1}{\sqrt{784}},\frac{1}{\sqrt{784}}]$ .

Refer to caption — Fig. 2: Error $\text{err}(\mathbf{A},\hat{\mathbf{A}})$ as a function of epoch for CRsAE (green $\circ$ ) and RandNet. “G 0.1” stands for RandNet with Gaussian $\mathbf{\Phi}$ and $\beta=0.1$ . “S 0.5” stands for RandNet with sparse $\mathbf{\Phi}$ and $\beta=0.5$ . “CK” stands for CK-SVD.

4.3 Results

4.3.1 Simulation

We use the following measure [1] to quantify the distance between the learned dictionary and that used to simulate the data

\text{err}(\mathbf{A},\hat{\mathbf{A}})=\max_{i}\left(\sqrt{1-\frac{\langle\mathbf{a}_{i},\hat{\mathbf{a}}_{i}\rangle^{2}}{\|{\mathbf{a}}_{i}\|_{2}^{2}\|\hat{\mathbf{a}}_{i}\|_{2}^{2}}}\right).\vspace*{-1mm}

(8)

The lower this measure, which ranges from $0$ to $1$ , the closer the estimated dictionary $\hat{\mathbf{A}}$ to the true dictionary $\mathbf{A}$ . We initialized $\hat{\mathbf{A}}$ by randomly perturbing $\mathbf{A}$ so that $\text{err}(\mathbf{A},\hat{\mathbf{A}})\approx 0.5$ [1]. Fig. 2 shows $\text{err}(\mathbf{A},\hat{\mathbf{A}})$ as a function of epochs for CRsAE (green $\circ$ ), as well as RandNet with Gaussian $\mathbf{\Phi}$ (G) and row-sparse $\mathbf{\Phi}$ (S), for $\beta=0.1$ , $0.3$ , and $0.5$ . The figure highlights the ability of CRsAE to learn the underlying dictionary when the dictionary is dense, and more importantly, the ability of RandNet to learn the dictionary from randomly-projected data that live in a subspace whose dimension is a factor $\beta$ lower than the original dimension. The figure shows that for the case of $\beta=0.1,0.3,0.5$ and Gaussian $\mathbf{\Phi}$ , RandNet successfully learns the dictionary. When $\mathbf{\Phi}$ is row sparse, we are able to learn the dictionary using RandNet for $\beta=0.3,0.5$ , and perform better than CK-SVD (CK) for $\beta=0.1$ .

4.3.2 MNIST

Unsupervised Fig. 3(a) shows the $60$ columns of the learned matrix $\hat{\mathbf{A}}$ that are used the most to approximate $\mathbf{r}$ using $\mathbf{x}_{T}$ . RandNet has learned features similar to the digits in the original data space even though it was trained using randomly-projected data that visually look like noise. Fig. 3(b) is an example of how RandNet processes its inputs.

Supervised We benchmarked RandNet against CK-SVD, the discriminative recurrent sparse auto-encoder (DrSAE) [11] architecture, and the supervised DL (SDL) algorithm from [13]. For CK-SVD, we set the sparsity level to $15$ and learned $\mathbf{A}_{c}\in{\mathbb{R}}^{78}$ independently for each class $c$ over 180 iterations. We assign a randomly-projected image to the class with minimum reconstruction error $\|\mathbf{r}-\mathbf{\Phi}\mathbf{A}_{c}\mathbf{x}\|^{2}$ . DrSAE implements a variant of the ISTA algorithm [10] and aims to reconstruct the input data, as well as achieve high classification accuracy. SDL jointly solves a DL problem and a classification problem. Table 2 shows that RandNet with $\beta=0.5$ achieves test error rates of $1.56\%$ and $3.16\%$ using Gaussian and row-sparse $\mathbf{\Phi}$ . The corresponding rates for CK-SVD are $3.72\%$ and $5.20\%$ , respectively. DrSAE and SDL reach $1.08$ and $1.05$ error rates, respectively. For Gaussian $\mathbf{\Phi}$ , the fact that RandNet achieves such a low error using a dataset half the size of the original one is both surprising and impressive. We can interpret the higher error rate in the sparse case as the cost of computational efficiency. The accuracy for more efficient networks (lower $\beta$ ) was lower and hence not reported. Below, we explain the higher error rate in the sparse case from the perspective of compressed sensing.

	RandNet	CK-SVD	DrSAE	SDL
Error Rate [ $\%$ ]	(G) 1.56	(G) $3.72$	$1.08$	$1.05$
Error Rate [ $\%$ ]	(S) 3.16	(S) $5.20$	$1.08$	$1.05$

Table 2: MNIST classification error [

\%

] on test dataset. (G) stands for Gaussian and (S) for row-sparse projection.

Why does RandNet work? Suppose an oracle handed us a dictionary $\mathbf{A}$ such that each MNIST image admits a sparse representations in the dictionary. The theory of compressed sensing would then guarantee that the sparse representation can be estimated from very few random projections of the original MNIST images [14]. The close-to-state-of-the-art performance of RandNet on MNIST is evidence that there indeed exists a dictionary in which MNIST images have sparse representations and, more importantly, that the sparse representations in this dictionary are useful for classification. In the absence of such dictionary, classification of MNIST images from random compressed measurements would likely fail.

To further highlight this intuition, we analyzed the impact of sparsity of the representation $\mathbf{x}_{T}$ in classification accuracy by sweeping the value of the regularization parameter from $0.5$ to $4$ (the larger this value, the sparser the representation) when $\beta=0.5$ . We found that the classification error reaches its minimum at $\lambda=2.2$ in the Gaussian case (black $\circ$ ) and $\lambda=2$ in the sparse case (red $\star$ ), indicating that there is an optimal amount of sparsity in the learned dictionary that gives good reconstruction and classification accuracies. Fig. 4 shows such results. The compressed-sensing perspective also helps to explain the higher error rate obtained with sparse measurement matrices. For measurement matrices of the same dimension, the RIP constants of sparse matrices are worse than those of dense ones [5]. This suggests that, for a sparse measurement matrix, one would need more measurements to achieve the same error rate as for a Gaussian measurement matrix. We hypothesize that decreasing the sparsity of the rows of the measurement matrix will improve classification accuracy. We will explore this in future work.

5 Conclusion

We proposed a general framework to train NNs from compressed measurements. Specifically, we introduced RandNet, a class of networks that, in the unsupervised setting, performs dictionary learning from random projections of the original data. In the supervised setting, we highlighted the ability of RandNet in the classification of MNIST when the network is trained using random projections of the images that live in a subspace of smaller dimension compared to the original dimension. RandNet reached an error of $1.56\%$ when the measurement matrix was Gaussian and a $3.16\%$ error rate when using row-sparse measurements, the case where RandNet yields the most significant benefits in terms of memory access and computational efficiency. Overall, RandNet achieved a minimal loss in accuracy considering the increased efficiency in terms of computation and memory.

6 Acknowledgments

This work is partially supported by the Quantitative Biology Initiative at Harvard University.

References

[1] A Agarwal, A Anandkumar, P Jain, P Netrapalli, and R Tandon, “Learning sparsely used overcomplete dictionaries via alternating minimization,” SIAM Journal on Optimization, vol. 26, pp. 2775–2799, 2016.
[2] B Tolooshams, S Day, and D Ba, “Scalable convolutional dictionary learning with constrained recurrent sparse auto-encoders,” in Proc. of 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing, Sept. 2018, pp. 1–6.
[3] B Tolooshams, S Day, and D Ba, “Deep residual auto-encoders for expectation maximization-based dictionary learning,” 2019, arXiv:1904.08827.
[4] V Papyan, Y Romano, and M Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” Journal of Machine Learning Research, vol. 18, pp. 1–52, 2017.
[5] D Ba, “Deeply-sparse signal representations ( $\text{D}\text{S}^{2}\text{P}$ ),” 2018, arXiv:1807.01958.
[6] F Pourkamali Anaraki and Sh Hughes, “Memory and computation efficient pca via very sparse random projections,” in Proc. of the 31st International Conference on Machine Learning, Bejing, China, 22–24 Jun 2014, vol. 32, pp. 1341–1349.
[7] F Pourkamali-Anaraki, S Becker, and Sh M Hughes, “Efficient dictionary learning via very sparse random projections,” in Proc. of 2015 International Conference on Sampling Theory and Applications, 2015, pp. 478–482.
[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
[9] F. Pourkamali Anaraki and S. M. Hughes, “Compressive k-svd,” in Proc. of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 5469–5473.
[10] A Beck and M Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
[11] J T Rolfe and Y LeCun, “Discriminative recurrent sparse auto-encoders,” in Proc. of International Conference on Learning Representations, 2013, pp. 1–15.
[12] S S Chen, D L Donoho, and M A Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, pp. 129–159, 1998.
[13] J Mairal, J Ponce, G Sapiro, A Zisserman, and F R Bach, “Supervised dictionary learning,” in Proc. of Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., pp. 1033–1040. Curran Associates, Inc., 2009.
[14] E J Candes, “The restricted isometry property and its implications for compressed sensing,” Comptes rendus mathematique, vol. 346, no. 9-10, pp. 589–592, 2008.