We discuss some open questions in Capsule Networks and present potential modifications that can bring their architectures closer to that of mammalian sensory systems. This will lead to better efficacy.
Perception includes identification, recognition, and tracking of objects within a scene; it may also involve the integration of information across modalities.2 The mechanics of perception reduce ambiguity and can be viewed as minimization of entropy (Hinton, Sejnowski, 1983; Harth, Pandya, Unnikrishnan, 1986). Sensory areas of the mammalian brain3 are hierarchically organized and the complexity of analysis increases up the hierarchy. But a unique feature of brain’s organization is the presence of massive amounts of feedback4 traversing the hierarchy (DeYoe, Van Essen, 1988). The continuous flow of information through the feed-forward and feedback pathways result in effortless perception of scenes.There has been a recent resurgence in the use of deep learning architectures and algorithms for perceptual tasks like detection and recognition of visual objects and even in domains as diverse as drug discovery and genomics (LeCun, Bengio, Hinton, 2015). Sejnowski, in his recent book (Sejnowski, 2018), details how Convolutional Neural Networks (CNNs) benefit from utilizing hierarchical architectures similar to that found in the mammalian visual system. Deep Learning systems with memory (Hochreiter, Schmidhuber, 1997) allow perception of sequential data such as text and speech. Attention mechanisms allow much more efficient processing in CNN and Long Short Term Memory (LSTM) architectures (Mnih et al, 2014). But many of the currently popular Deep Learning architectures are feed-forward and do not benefit from the feedback between layers, which are integral to real-time computations in mammalian sensory systems. Recently some authors have realized the usefulness of feedback for object identification in CNN-like structures (Cao et al; 2018; Hu, Ramanan, 2016).
The recently introduced Capsule Network formalism (Sabour, Frosst, Hinton, 2017; Sabour, Frosst, Hinton, 2018) has many inherent advantages for perception. For example, the structure of capsules allow the structure of the natural world to be represented efficiently, and the routing coefficients implement part-whole relationships by selecting the lower-level capsules to be linked to those at higher levels. Recent modifications to the Capsule Network formalism has demonstrated improved accuracies on benchmark datasets (Zhao et al, 2019a; Zhao et al, 2019b); these have also made training in deep Capsule Networks possible (LaLonde, Bagci, 2018; Zhao, Unnikrishnan, 2019).
In mammalian sensory systems, activities at higher levels use feedback to control the information flow from and between lower levels (Sherman and Koch, 1986; Mumford, 1992; Singer, 1977; Tsotos et al, 1995). Since the representations get more complex up the hierarchy, Harth and his collaborators proposed that an Inversion of Sensory Processing (ISP) may be a signature of the perceptual process in hierarchical systems with feedback (Harth, 1976, Harth, Unnikrishnan, 1985; Harth, Unnikrishnan, Pandya, 1987; Sastry, et al, 1999). The Capsule Network formalism performs Inversion of Sensory Processing (ISP) through reconstruction of the input image from recognition unit outputs and a reconstruction error5 is part of the total loss function. But in the current architectures, sensory inversion is utilized only for learning, not for recognition. Also, all architectures to-date are feed-forward.
The routing mechanism in Deep Capsule Networks is well suited for feedback from higher levels to control information flow at lower levels, and we are developing a new formalism for this purpose. Such dynamic, deep, Capsule Networks will be able to identify, recognize, and track multiple objects with high accuracy, allowing perception of cluttered scenes in real-time, and allowing autonomous vehicles to see, hear, think and do.
OPEN QUESTIONS ON NETWORK STRUCTURE, ROUTING COEFFICIENTS, AND LOSS FUNCTION
Q01. What is the optimal way to create capsules from neurons?
Capsules can be viewed as a collection of neurons. So, how does one optimally group these neurons to create capsules that can efficiently solve the task at hand? Currently there are several ways in which neurons are grouped to create capsules. Examples include those presented in (Sabour, et al, 2017; Sabour, et al, 2018; Phaye, et al, 2018; Zhao et al, 2019c). What are the architectural choices for grouping neurons into capsule? Should neuron groupings satisfy some consistency requirements across multiple hidden layers?
Q02a. What is the optimal way to bring spatial structure?
Many of the problems efficiently solved by Capsules contains data with spatial structure in the 3D world. To solve such problems, it would be advantageous to have some spatial anchoring of capsules. For example, should the interaction between capsules in the same layer, whose receptive fields overlap, be different from the interactions between capsules whose receptive fields do not overlap? Should the competitive interactions between capsules in different layers depend on the extent of overlap in their receptive fields?
Q02b. How do we achieve localization in space and in filters?
The connections in CNNs are local and this allows detection of features in small regions of the image. So, in Capsule Network architectures, if it is important to detect all features at one image point, the overall process should be capable of doing it. This leads to local connectivity and weight sharing in space. To achieve detection of multiple features, the network needs many filter planes. Each filter plane gives representation of image in terms of that filter. When these features are combined in a layered network, the connections are still local in space, but full in the filter domain.
Capsules say that each complex feature needs to look at only those simple features which are in tune with it, such that there is locality of connections in the feature domain. However, the features themselves are learnt, so the relevant features or locality of connections in feature’s overall domain itself needs to be learnt. This is more about relationships between different filters, and hence may be more general than learning specific object categories. The idea of routing coefficients can be stated more generally as having local connection in the filter space too. So, to make the overall architecture compact and efficient, should the connectivity between layers in the filter domain also be made local, like in the space domain? But this locality in the filter domain has to be learned, and so, what should be the information on which the routing coefficients cij is adapted?
Q03. Are the routing coefficients doing their purported job?
The original motivation for the routing coefficients cij in Capsule Networks is to group lower-level features to preferentially support higher level features or objects. In current architectures, are the routing coefficients properly contributing towards this objective? How does one assess this?
Q04. What information should drive the segregation of routing coefficients?
The cij should be viewed as transient and modulatory; i.e., they create sophisticated Hebb assemblies. In the current version, j cij = 1.0; i.e., the total synaptic strength from a capsule i to all its output connections j1, j2, j3, etc. is constrained to be 1.0. With appropriate non-linearities for this operation, this will force each lower-level capsule i to send its output to one higher-level capsule j.
With the reverse normalization, i cij = 1.0; i.e., the total input (in terms of the routing coefficients) a capsule j receives from all its input connections i1, i2, i3, etc. is constrained to be 1.0. When this new constraint is achieved through appropriate non-linearities, this will force each upper-level capsule j to receive its input from one lower-level capsule i. Are there advantages if both normalizations are present? If they are present, how should they interact?
Q05a. Should the routing coefficients be in a narrow range?
Current routing algorithms create cij with a narrow range in their magnitude. Is this desirable? If the routing coefficients are implementing what was originally envisioned, then after adaptation they should be close to their maximum or minimum values (i.e., 0 or 1). Should one have mechanisms to spread out cij?
Q05b. Is Softmax contributing to the narrow range?
The current cij updating algorithms use a Softmax function to normalize them. Softmax is not scale-invariant. For example, when the inputs to the Softmax function is in the range 0.0 to 1.0, the output of it tends to be close to the mean or the initial value. This may be one of the causes for the narrow dynamic range of cij in current implementations. Should one explore other normalization mechanisms, for example, max-min (see Zhao et al, 2019a)? Would this be effective in spreading cij? What type of nonlinearity is desirable for this normalization?
Q06. Should the routing coefficients be normalized?
The current update rule (the dot product between u^j|i and vj) can only increase the routing coefficients at each iteration; in other words, the update only moves the coefficients in one direction. When i and j’ are more closely related compared to i and j, such a mechanism will increase cij’ more than it increases cij, and hence it may be OK. But since they finally multiply with signals to control signal flow, they need to be normalized. So, normalization is an integral part of the routing-coefficient update. In the current versions of Capsule Networks, the normalization is done using the Softmax function. As we have pointed out, Softmax may be keeping the cij in a narrow range. If the adaptation algorithms can self-normalize, there is then no additional need to artificially normalize the routing coefficients after each iteration. In the nervous system, there is growing evidence that most of the adaptation/learning is through STDP6-type mechanisms that correlate changes in activities, rather than the traditional Hebb-mechanisms that correlate activities themselves (Sejnowski, 1999). In models of sensory development, we have shown that the use of STDP-type algorithms makes re-normalizations unnecessary (Unnikrishnan and Nine, 1997; Unnikrishnan and Nine, 2018). So, would the computations in Capsule Networks be more efficient if self-normalizing update rules are used for routing coefficients?
Q07a. How can we say that the routing coefficients have converged to optimal values?
A qualitative way of measuring effectiveness of cij is to ask how well the routing is contributing towards stabilization of the final decision. For example, if cij are grouping well, then multiple iterations to make the grouping more binary-like would only stabilize the final decision, not change it. In this case, should cij be updated using a separate ‘spread loss’ (see Sabour et al, 2018) type of function? In the current formalism, it is not explicit what the cij update is optimizing.
Q07b. Should learning of weights and adaptation of routing coefficients have different objectives?
Weight updates in Capsule Networks use a loss function and the learning algorithm maximizes the correctness of the network’s final decision. We can view the routing coefficients contributing to the stabilization of this final decision. In current formalisms, the objective of weight-learning is explicit, while that of routing-coefficient adaptation is implicit. Wang and Liu (2018) has analyzed the routing as an optimization process and Zhang et al (2018) has attempted to “learn” cij. Should we have an explicit objective function for cij adaptation?
Q08. Should the adaptation of cij be at a slower rate compared to the learning of wij?
What is the interplay between learning weights wij and adapting the routing coefficients cij? As we discussed above, should the evolution of each be driven by different error signals?
Until the lower level capsules or neurons learn some features that help the network to make decisions, it is difficult to envision the routing coefficients then routing these features to stabilize the decision. This raises the question: should the adaptation of cij proceed on a slower time-scale compared to the learning of wij ?
Q09. Should the magnitude and direction be treated separately in the loss function?
In the formalism presented in (Sabour et al 2017, a capsule is represented by a vector. The magnitude and direction of these vectors fulfill different purposes. Hence, care should be taken in implementation to keep this abstraction well-rooted, and to ensure that it is not lost in the usual tensor processing for forward and backward processing and for weight updates. This issue becomes acute in Deep Capsule Networks.
Q10. What should be the role of the reconstruction loss, and inversion of sensory processing?
The loss function for training weights has a recognition component and a reconstruction component. In the recognition component, can anything other than squared error have advantages; for example, mutual entropy?
The reconstruction error, in the current Capsule Network formalisms, can be analyzed from the point of view of Inversion of Sensory Processing (ISP) and feedback. As a first step, the relevance and utility of the reconstruction error should be systematically explored. The reconstruction part of the network is currently under-utilized. The reconstruction error essentially brings some amount of structure to the feed-forward recognition network. But reconstruction is not used during recognition. How do we make image reconstruction an integral part of recognition? As we mentioned in the Introduction section, in mammalian sensory systems the forward and feedback pathways intimately interact during perception.
OPEN QUESTIONS ON HIERARCHY AND FEEDBACK
In addressing the following questions, we present some promising directions to introduce hierarchy, feedback, and Inversion of Sensory Processing (ISP) into the Capsule Network formalism. We show why these directions are relevant, why these introduce new perspectives compared to what is happening in CNNs, and why these can also lead to exciting developments elsewhere. Our motivation comes from the fact that deep CNN structures are well understood and several groups have successfully introduced feedback (Cao et al, 2019; Hu, Ramanan, 2016), and attention (Mnih, et al, 2014) into CNN structures. Our proposal is that, through the use of architectures that mimic mammalian sensory systems, we can create similar capabilities in Capsule Networks. We especially suggest creating Capsule Networks with explicit Inversion of Sensory Processing (ISP). The sections below are written in a discussion-like format, and hence should be treated as such.
Q11. Should we have new update rules for routing coefficients in Deep Capsule Networks?
There have been several attempts to build Deep Capsule Networks (Sabour et al, 2018; LaLonde, Bagci, 2018; Rajasegaran et al, 2019) but many important hurdles remain before deep CapsNets become as widespread as deep CNNs. We have trained Deep Capsule Networks (with 6 layers) by modifying the formalism in (Sabour et al 2017) and have obtained competitive results on CIFAR10 and CIFAR100 (Zhao, Unnikrishnan, 2019). The current cij updates do not perform well in multilayer Capsule Networks, and their role in the capsule formalism needs to be carefully evaluated.
As discussed, if we assume that wijis responsible for the final decision (i.e., recognition) and cij is responsible for the stabilization of this final decision, then we can develop methods to a) see if cij is converging appropriately, and then b) control the rate of convergence of cij vis-a-vis the rate of convergence of wij. This issue becomes acute in deep architectures as the features need to develop first before cij can route them.
Q12. What should be the nature of feedback in Deep Capsule Networks?
Feedback from higher levels can be used for updating cij at a given level, following the concepts laid out in (Harth, Unnikrishnan, Pandya, 1987; Sastry, et al, 1999). The architectures in these publications have no learning. In this section, we propose how to bring the central ideas of ISP to Deep Capsule Networks. As we have noted earlier, the cij are well-suited to invert sensory processing.
Current Capsule Network architecture implements ISP, but the structure of the forward (recognition) network and that of the backward (reconstruction) network are very different. For example, in Sabour et al (2017), a multi-layer fully-connected neural network is used for image reconstruction, while the recognition is through a capsule network. In the mammalian brain, the same neurons participate in forward and backward processing. The two networks are the same, although the two pathways are distinct. If we make the structure of the forward and backward networks to be similar or identical, then it would be easy to bring “true ISP” to Deep Learning.
For example, this architecture uses the same network structure for forward and backward computations, but use totally different weights in forward and feedback pathways. A further mechanism for these two computations to be more closely interlinked at each level of the hierarchy can then be created. In these networks, when the learning proceeds, at each layer and at each capsule, we know what the input (e.g., the image) driven activities are and what the output (i.e., recognition layer) driven activities are. We can compare the weights in the forward and backward pathways to apply appropriate regularizations.
Q13. How can feedback help group the features?
Grouping of features should critically depend on feedback from higher layers to lower layers. Such feedback is not present in the current models. How should such feedback be organized?
We can think of three strategic ways: 1) Inversion during learning, but not recognition (as in Sabour, 2017); 2) Inversion during recognition, but not during learning (as in Cao, 2019); and 3) ISP during learning AND recognition. Zhang et al, 2018, and Cao et al, 2019 describe methods to find the structure of a trained CNN. These models identify sub-regions in the image relevant for different output nodes and methods similar to those can be applied to Capsule Networks to get sufficient information about the structure of a trained feed-forward network. Harth et al, 1987 and Sastry et al, 1999 show how to put feedback into hand-crafted networks. The methods described in these two sets of papers can be combined to introduce feedback into Capsule Networks, in a “true” ISP-sense.
1. Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 448-453). IEEE New York.
2. Harth, E., Pandya, A. S., & Unnikrishnan, K. P. (1986). Perception as an optimization process In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
3. DeYoe, E. A., & Van Essen, D. C. (1988). Concurrent processing streams in monkey visual cortex. Trends in neurosciences, 11(5), 219-226.
4. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436.
5. Sejnowski, T. J. (2018). The deep learning revolution. MIT Press.
6. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
7. Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).
8. Cao, C., Huang, Y., Yang, Y., Wang, L., Wang, Z., & Tan, T. (2018). Feedback convolutional neural network for visual localization and segmentation. IEEE transactions on PAMI, 41(7), 1627-1640.
9. Hu, P., & Ramanan, D. (2016). Bottom-up and top-down reasoning with hierarchical rectified gaussians. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5600-5609).
10. Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances in neural information processing systems (pp. 3856-3866).
11. Sabour, S., Frosst, N., & Hinton, G. (2018). Matrix capsules with EM routing. In 6th International Conference on Learning Representations, ICLR.
12. Zhao, Z., Kleinhans, A., Sandhu, G., Patel, I., & Unnikrishnan, K. P. (2019a). Capsule Networks with Max-Min Normalization. arXiv preprint arXiv:1903.09662.
13. Zhao, Z., Kleinhans, A., Sandhu, G., Patel, I., & Unnikrishnan, K. P. (2019b). Fast Inference in Capsule Networks Using Accumulated Routing Coefficients. arXiv preprint arXiv:1904.07304.
14. LaLonde, R., & Bagci, U. (2018). Capsules for object segmentation. arXiv preprint arXiv:1804.04241.
15. Zhao, Z., & Unnikrishnan, K. P. (2019). Training Deep Capsule Networks. Manuscript in preparation.
16. Sherman, S. M., & Koch, C. (1986). The control of retinogeniculate transmission in the mammalian lateral geniculate nucleus. Experimental Brain Research, 63(1), 1-20.
17. Mumford, D. (1992). Computational architecture of the neocortex. Biological cybernetics, 66(3), 241-251.
18. Singer, W. O. L. F. (1977). Control of thalamic transmission by corticofugal and ascending reticular pathways in the visual system. Physiological reviews, 57(3), 386-420.
19. Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nuflo, F. (1995). Modeling visual attention via selective tuning. Artificial intelligence, 78(1-2), 507-545.
20. Harth, E. (1976). Visual perception: A dynamic theory. Biological Cybernetics, 22(3), 169-180.
21. Harth, E., & Unnikrishnan, K. P. (1985). Brainstem control of sensory information: A mechanism for perception. International journal of psychophysiology, 3(2), 101-119.
22. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1987). The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science, 237(4811), 184-187.
23. Sastry, P. S., Shah, S., Singh, S., & Unnikrishnan, K. P. (1999). Role of feedback in mammalian vision: a new hypothesis and a computational model. Vision Research, 39(1), 131-148.
24. Phaye, S. S. R., Sikka, A., Dhall, A., & Bathula, D. R. (2018). Multi-level dense capsule networks. In Asian Conference on Computer Vision (pp. 577-592). Springer, Cham.
25. Zhao, Y., Birdal, T., Deng, H., & Tombari, F. (2019c). 3D Point Capsule Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1009-1018).
26. Sejnowski, T. J. (1999). The book of Hebb. Neuron, 24(4), 773-776.
27. Unnikrishnan, K. P., & Nine, H. S. (1997). An engineering principle used by mother nature: use of feedback for robust columnar development. In Computational Neuroscience (pp. 533-542). Springer, Boston, MA.
28. Unnikrishnan, K. P., & Nine, H. S. (2018). Feedback Circuits for Robust Columnar Development. DOI: 10.13140/RG.2.2.31145.65128. At researchgate.net
29. Wang, D., & Liu, Q. (2018). An optimization view on dynamic routing between capsules. At openreview.net
30. Zhang, L., Edraki, M., & Qi, G. J. (2018). CapProNet: Deep feature learning via orthogonal projections onto capsule subspaces. In Advances in Neural Information Processing Systems (pp. 5814-5823).
31. Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., & Rodrigo, R. (2019). DeepCaps: Going Deeper with Capsule Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 10725-10733).
1 I would like to acknowledge many useful discussions with Prof. PS Sastry from the Indian Institute of Science, Bangalore.
2 For example, we may combine visual with olfactory information to recognize a rose with a distinct color and smell.
3 Visual cortex, auditory cortex, somatosensory cortex, etc.
4 Feedback is between different layers in a hierarchical system while recurrence is between units in the same layer.
5 We use “error” and “loss” interchangeably.
6 STDP – Spike Time Dependent Plasticity