RandNet: Deep Learning with Compressed Measurements of Images
Abstract
Principal component analysis, dictionary learning, and auto-encoders are all unsupervised methods for learning representations from a large amount of training data. In all these methods, the higher the dimensions of the input data, the longer it takes to learn. We introduce a class of neural networks, termed RandNet, for learning representations using compressed random measurements of data of interest, such as images. RandNet extends the convolutional recurrent sparse auto-encoder architecture to dense networks and, more importantly, to the case when the input data are compressed random measurements of the original data. Compressing the input data makes it possible to fit a larger number of batches in memory during training. Moreover, in the case of sparse measurements, training is more efficient computationally. We demonstrate that, in unsupervised settings, RandNet performs dictionary learning using compressed data. In supervised settings, we show that RandNet can classify MNIST images with minimal loss in accuracy, despite being trained with random projections of the images that result in a reduction in size. Overall, our results provide a general principled framework for training neural networks using compressed data.
Index Terms— Random Neural Networks, Dictionary Learning, Sparse Representation, Compressed Classification.
1 Introduction
Representation learning has become an important problem in recent years both in the signal processing and machine learning communities. In signal processing, dictionary learning (DL) [1] is the de facto method for learning adaptive data representations. In machine learning, deep learning is the method of choice to learn representations that are adapted to data. In several tasks such as image classification, the early success of DL, and more recently deep learning, can be attributed largely to their ability to learn representations tailored to the task of interest.
Traditionally, DL has been restricted to the shallow case, where only one set of weights is learned from data. The works in [2, 3] have shown a one-to-one correspondence between DL and neural networks (NNs) and, more specifically, how to train an auto-encoder to perform unsupervised DL. Recent work [4, 5] has shown the connections between deep DL and deep learning, emphasizing a new perspective on deep NNs as efficient algorithms for solving deep DL problems. The primary advantages of using deep NNs for DL are the widespread GPU-based infrastructure that supports them (e.g. TensorFlow), and the ease with which they can be deployed. However, in a number of applications of both DL and NNs, such as video processing, the size of datasets required for learning is a computational bottleneck.
This motivates the need for a framework that can learn from reduced-size data. In [6, 7], the authors propose two different methods for learning representations from random projections of data. The first, [6], performs PCA using compressed measurements. The second, called CK-SVD [7], is an optimization-based alternating-minimization algorithm [1] for DL from sparse random measurements of data. The benefits of these approaches are memory and computational efficiency [6], and the ability to train using reduced-size data [7].
This paper brings together the idea of training NNs for DL, and that of using compressed random measurements of data to solve DL tasks. Specifically, we introduce a framework to train NNs to perform unsupervised DL or a supervised task, e.g. classification, using only compressed projections of the data. The resulting class of architectures, which we call RandNet, is more efficient in terms of memory and computation than conventional ones. In the unsupervised setting, we demonstrate that RandNet performs DL. The architecture is a variant of the constrained recurrent sparse auto-encoder (CRsAE) [3] in which the encoder uses randomly projected data to obtain a sparse representation. For supervised tasks, RandNet uses the sparse representation produced at the output of the encoder as features. We demonstrate that the performance of RandNet in the classification of MNIST handwritten digits [8] rivals the state-of-the-art.
2 Dictionary Learning
Let be a sparse vector and be the vector obtained as the sum of the sparse linear combination of columns of a dictionary and additive noise
(1) |
where is the total number of vectors and is i.i.d. noise. In what follows, we refer to each as an example, each as a sparse code, and assume the entries of have variance . The goal of DL is to learn a dictionary such that each example can be approximated as the sparse linear combination of columns from using the sparse vector . From an optimization perspective, DL problem aims to solve
(2) |
for , where is a sparsity enforcing parameter that depends on the statistics of the noise . The vector denotes the column of the matrix , and the norm constraint is to avoid scaling ambiguity. The objective is jointly non-convex in and .
2.1 Classical DL
A classical method to circumvent the non-convexity in Eq. 2 is the alternating-minimization algorithm [1], which alternates between a sparse coding step and a dictionary update step. Given an estimate of the dictionary, the sparse coding step estimates each sparse code by solving
(3) |
The dictionary update step uses the newly-estimated codes to update the dictionary as the solution of
(4) |
2.2 Constrained Auto-Encoders
A recent line of work [2, 3] has developed an auto-encoder, named CRsAE, to solve Eq. 2 when is a convolutional operator. The encoder in CRsAE imitates the sparse coding step: it uses , , and the ReLU activation function to map the data to a sparse code. The decoder is linear and uses , constrained to the matrix used in the encoder, to map the output of the encoder to a reconstruction of the data. To learn , CRsAE minimizes the squared-error reconstruction loss by backpropagation. The unsupervised block of Fig. 1, when the matrix is the identity, illustrates the CRsAE architecture. We extend the work in [2, 3] and show, for the first time, that we can train CRsAE to learn a dense dictionary. In addition, through a connection with compressive DL [7, 9], we show that we can train a modified CRsAE, termed RandNet, to perform DL using randomly-compressed versions of the data .
2.3 DL from Compressed Data
Compressive DL is the problem of learning a dictionary from compressed data, where we assume the compressed data are obtained by projecting the original data into a lower dimensional subspace [7, 9]. Compressive DL [9] uses a version of alternating-minimization based on the KSVD algorithm and an constraint to enforce sparsity. Using the -norm, the problem becomes
(5) |
where is the compressed version (random projection) of the example and is a known random measurement matrix such that . The work in [9] shows that it is indeed possible to learn the dictionary when and is a random Gaussian matrix. For the rest of the paper, we drop the superscript to simplify notation.
3 RandNet
We introduce a class of NNs, which we call RandNet, for learning representations from compressed data. First, we introduce the architecture in unsupervised and supervised settings. In unsupervised settings, RandNet is a variant of the CRsAE architecture (Sec. 2.2) to solve Eq. 5. In supervised cases, RandNet uses the output of the encoder as features for the supervised task. Then, we explain why RandNet is memory and computationally efficient, both when the measurement matrix is Gaussian and when it is row sparse.
3.1 Unsupervised NNs with Compressed Inputs
RandNet in unsupervised settings is an auto-encoder. Given , and , the encoder solves the sparse coding step in Eq. 5. The encoder implements the FISTA algorithm [10], which generates a sequence , indexed by iteration number , that converges to a sparse code after iterations. Similar to [2], we define a state vector . Algorithm 1 details the steps of the encoder, where the two-sided ReLU non-linearity is defined element-wise as . The decoder applies and then to to obtain . Similar to CRsAE, the parameters of the encoder and decoder are tied. The unsupervised block in Fig. 1 shows this forward pass. The loss function associated with the architecture is .
3.2 Supervised NNs with Compressed Inputs
In supervised settings, e.g. classification, we extend RandNet following a similar approach to the architecture in [11]. Specifically, we show how a network should be designed for a classification task to be trained on compressed data. For -class classification, we define a -dimensional label vector such that for each class
(6) |
The goal is to learn a mapping from the sparse representation of the data produced at the output of the encoder (see Sec. 3.1) to an estimated vector of probabilities , where . The supervised block in Fig. 1 shows the forward pass of the resulting architecture. We define the categorical cross-entropy loss
(7) |
3.3 RandNet Backpropagation Algorithm
In the unsupervised case, we minimize with the constraint to avoid scaling ambiguity between the dictionary and sparse codes. For the supervised case, we proceed in two stages to optimize the parameters of interest. First, we train the unsupervised network to learn that approximates the compressed data when multiplied by . Then, given , we learn the matrix and the bias that minimize the loss .
Algorithm 2 is the backpropagation algorithm for computing the gradient, denoted , of the loss functions with respect to the parameters of interest. The vectors and are the vectors of stacked columns from and , respectively. In practice, we compute the gradients through PyTorch’s autograd. We omit the derivation.
(a) Gaussian | (b) Sparse | (c) Identity | ||
---|---|---|---|---|
Memory Storage | ||||
Memory Access | ||||
Matrix Operation | ||||
Benefits of random projections: One advantage of RandNet compared to classical NNs is that a larger amount of data can fit on GPU memory, because the network is trained with the compressed data. Let denote the measurement ratio [9]. The memory storage cost of RandNet is , as opposed to for a network without compression. For Gaussian , similar to [9], RandNet takes advantage of this reduction in memory storage cost by storing the compressed data instead of the original data. The benefits of RandNet are even more pronounced when is row sparse. In this case, we assume each row of is -sparse () with non-zero entries chosen uniformly at random and set to with equal probability. Let denote the compression factor [6, 7]. The cost for the GPU of accessing the data during training is , compared to the cost of accessing the data in its original dimension [6]. Therefore, in the sparse case and for small enough , the cost of accessing the data is even lower than storing and accessing it in compressed form. In addition to storage/access benefits, row-sparse make RandNet efficient computationally. In RandNet, the cost of applying is compared to without compression. This is because operates on the sparse vector produced by the column-sparse operator (Algorithm 1, line ). Table 1 summarizes these benefits.
Shared projection operators: In practice, similar to [7], we divide the data into blocks. Examples with indices belong to block and share the measurement matrix , . We denote an example from block as and its projection . We note that we do not need to store the matrices. Instead, we store the fixed random seed associated with each matrix. For the rest of the paper, we drop the notation to simplify the notation.
4 Experiments
We train RandNet on a simulated dataset and on MNIST. In the simulated case, we train a) a CRsAE architecture to learn the dictionary underlying data simulated according to Eq. 1, and b) an unsupervised RandNet architecture to learn the same dictionary from random projections of the data when is Gaussian and also when it is row sparse. As a benchmark, we implement the CK-SVD algorithm [7]. We use MNIST to demonstrate the use of RandNet in a classification task with randomly-projected data (digits) as inputs.
4.1 Datasets
4.1.1 Simulation
We simulated a dataset of examples from the generative model of Eq. 1 with . The elements of are i.i.d. and drawn according to a distribution, followed by the normalization of each column to have unit length. Each sparse code is -sparse with i.i.d. nonzero entries following a Uniform distribution on the interval . We generated measurement matrices . We used either Gaussian matrices or sparse ones with -sparse rows . To study the effects of compression, we considered , i.e. , respectively. The corresponding compression factors are , respectively. For Gaussian , RandNet reduces the cost of storage by a factor of . For row-sparse , both the memory access cost and operation cost of are reduced by a factor of .
4.1.2 MNIST
We considered the MNIST dataset comprising grayscale handwritten digits, split into a training set with images and a test set with images. We vectorized the images, resulting in , . We generated random measurement matrices , for (). The goal is to learn a dictionary that yields a -dimensional sparse representation of the images that is useful for classifying each image into one of categories. Hence, the classification matrix .
4.2 Training
We implemented RandNet on PyTorch and trained it on a GPU with backpropagation (Tesla V100-SXM2) by mini-batch gradient descent. We used the ADAM optimizer.
4.2.1 Simulation
We divided the data into a training set of size and a test set of size . The number of trainable parameters is . We set the number of FISTA iterations to . This is crucial in producing sparse codes, which facilitates DL [1, 3]. We set [12] and tuned by grid search over the interval . We set , , and for CRsAE, Gaussian, and sparse RandNet, respectively. We picked each of these values so that it is greater than the maximum eigenvalue of [10], where depends on the architecture. We used a batch size of , a learning rate of , and trained for epochs.
4.2.2 MNIST
We first trained the RandNet auto-encoder for epochs, and then the classifier for another epochs. We hypothesized that the sparse code obtained at the output of the encoder would be a useful representation for classification. The number of trainable parameters is for the auto-encoder, and for the classifier, a total of parameters. We set the number of FISTA iterations to . We tuned by grid search and used and for the Gaussian and sparse cases, respectively. We used and estimated it to be greater than the maximum eigenvalue of when is a random Gaussian matrix. We used a batch size of and a learning rate of . We initialized the estimated dictionary with i.i.d. entries and normalized its columns to have unit length. We initialized and with i.i.d. Uniform entries in the interval .
4.3 Results
4.3.1 Simulation
We use the following measure [1] to quantify the distance between the learned dictionary and that used to simulate the data
(8) |
The lower this measure, which ranges from to , the closer the estimated dictionary to the true dictionary . We initialized by randomly perturbing so that [1]. Fig. 2 shows as a function of epochs for CRsAE (green ), as well as RandNet with Gaussian (G) and row-sparse (S), for , , and . The figure highlights the ability of CRsAE to learn the underlying dictionary when the dictionary is dense, and more importantly, the ability of RandNet to learn the dictionary from randomly-projected data that live in a subspace whose dimension is a factor lower than the original dimension. The figure shows that for the case of and Gaussian , RandNet successfully learns the dictionary. When is row sparse, we are able to learn the dictionary using RandNet for , and perform better than CK-SVD (CK) for .
4.3.2 MNIST
Unsupervised Fig. 3(a) shows the columns of the learned matrix that are used the most to approximate using . RandNet has learned features similar to the digits in the original data space even though it was trained using randomly-projected data that visually look like noise. Fig. 3(b) is an example of how RandNet processes its inputs.
Supervised We benchmarked RandNet against CK-SVD, the discriminative recurrent sparse auto-encoder (DrSAE) [11] architecture, and the supervised DL (SDL) algorithm from [13]. For CK-SVD, we set the sparsity level to and learned independently for each class over 180 iterations. We assign a randomly-projected image to the class with minimum reconstruction error . DrSAE implements a variant of the ISTA algorithm [10] and aims to reconstruct the input data, as well as achieve high classification accuracy. SDL jointly solves a DL problem and a classification problem. Table 2 shows that RandNet with achieves test error rates of and using Gaussian and row-sparse . The corresponding rates for CK-SVD are and , respectively. DrSAE and SDL reach and error rates, respectively. For Gaussian , the fact that RandNet achieves such a low error using a dataset half the size of the original one is both surprising and impressive. We can interpret the higher error rate in the sparse case as the cost of computational efficiency. The accuracy for more efficient networks (lower ) was lower and hence not reported. Below, we explain the higher error rate in the sparse case from the perspective of compressed sensing.
RandNet | CK-SVD | DrSAE | SDL | |
---|---|---|---|---|
Error Rate [] | (G) 1.56 | (G) | ||
(S) 3.16 | (S) |
Why does RandNet work? Suppose an oracle handed us a dictionary such that each MNIST image admits a sparse representations in the dictionary. The theory of compressed sensing would then guarantee that the sparse representation can be estimated from very few random projections of the original MNIST images [14]. The close-to-state-of-the-art performance of RandNet on MNIST is evidence that there indeed exists a dictionary in which MNIST images have sparse representations and, more importantly, that the sparse representations in this dictionary are useful for classification. In the absence of such dictionary, classification of MNIST images from random compressed measurements would likely fail.
To further highlight this intuition, we analyzed the impact of sparsity of the representation in classification accuracy by sweeping the value of the regularization parameter from to (the larger this value, the sparser the representation) when . We found that the classification error reaches its minimum at in the Gaussian case (black ) and in the sparse case (red ), indicating that there is an optimal amount of sparsity in the learned dictionary that gives good reconstruction and classification accuracies. Fig. 4 shows such results. The compressed-sensing perspective also helps to explain the higher error rate obtained with sparse measurement matrices. For measurement matrices of the same dimension, the RIP constants of sparse matrices are worse than those of dense ones [5]. This suggests that, for a sparse measurement matrix, one would need more measurements to achieve the same error rate as for a Gaussian measurement matrix. We hypothesize that decreasing the sparsity of the rows of the measurement matrix will improve classification accuracy. We will explore this in future work.
5 Conclusion
We proposed a general framework to train NNs from compressed measurements. Specifically, we introduced RandNet, a class of networks that, in the unsupervised setting, performs dictionary learning from random projections of the original data. In the supervised setting, we highlighted the ability of RandNet in the classification of MNIST when the network is trained using random projections of the images that live in a subspace of smaller dimension compared to the original dimension. RandNet reached an error of when the measurement matrix was Gaussian and a error rate when using row-sparse measurements, the case where RandNet yields the most significant benefits in terms of memory access and computational efficiency. Overall, RandNet achieved a minimal loss in accuracy considering the increased efficiency in terms of computation and memory.
6 Acknowledgments
This work is partially supported by the Quantitative Biology Initiative at Harvard University.
References
- [1] A Agarwal, A Anandkumar, P Jain, P Netrapalli, and R Tandon, “Learning sparsely used overcomplete dictionaries via alternating minimization,” SIAM Journal on Optimization, vol. 26, pp. 2775–2799, 2016.
- [2] B Tolooshams, S Day, and D Ba, “Scalable convolutional dictionary learning with constrained recurrent sparse auto-encoders,” in Proc. of 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing, Sept. 2018, pp. 1–6.
- [3] B Tolooshams, S Day, and D Ba, “Deep residual auto-encoders for expectation maximization-based dictionary learning,” 2019, arXiv:1904.08827.
- [4] V Papyan, Y Romano, and M Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” Journal of Machine Learning Research, vol. 18, pp. 1–52, 2017.
- [5] D Ba, “Deeply-sparse signal representations (),” 2018, arXiv:1807.01958.
- [6] F Pourkamali Anaraki and Sh Hughes, “Memory and computation efficient pca via very sparse random projections,” in Proc. of the 31st International Conference on Machine Learning, Bejing, China, 22–24 Jun 2014, vol. 32, pp. 1341–1349.
- [7] F Pourkamali-Anaraki, S Becker, and Sh M Hughes, “Efficient dictionary learning via very sparse random projections,” in Proc. of 2015 International Conference on Sampling Theory and Applications, 2015, pp. 478–482.
- [8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
- [9] F. Pourkamali Anaraki and S. M. Hughes, “Compressive k-svd,” in Proc. of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 5469–5473.
- [10] A Beck and M Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
- [11] J T Rolfe and Y LeCun, “Discriminative recurrent sparse auto-encoders,” in Proc. of International Conference on Learning Representations, 2013, pp. 1–15.
- [12] S S Chen, D L Donoho, and M A Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, pp. 129–159, 1998.
- [13] J Mairal, J Ponce, G Sapiro, A Zisserman, and F R Bach, “Supervised dictionary learning,” in Proc. of Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., pp. 1033–1040. Curran Associates, Inc., 2009.
- [14] E J Candes, “The restricted isometry property and its implications for compressed sensing,” Comptes rendus mathematique, vol. 346, no. 9-10, pp. 589–592, 2008.