Date of Award


Document Type


Degree Name

Master of Science (MS)


Electrical and Computer Engineering (Holcomb Dept. of)

Committee Member

Dr. Melissa C. Smith, Committee Chair

Committee Member

Dr. Robert J. Schalkoff

Committee Member

Dr. Adam W. Hoover

Committee Member

Dr. Harlan B. Russell


Deep Convolutional Neural Networks (CNN) have emerged as the dominant approach for solving many problems in computer vision and perception. Most state-of-the-art approaches to these problems are designed to operate using RGB color images as input. However, there are several settings that CNNs have been deployed in where more information about the state of the environment is available. Systems that require real-time perception such as self-driving cars or drones are typically equipped with sensors that can offer several complementary representations of the environment. CNNs designed to take advantage of features extracted from multiple sensor modalities have the potential to increase the perception capability of autonomous systems. The work in this thesis extends the real-time CNN segmentation model ENet [39] to learn using a multimodal representation of the environment. Namely we investigate learning from disparity images generated by SGM [20] in conjunction with RGB color images from the Cityscapes dataset [10]. To do this we create a network architecture called MM-ENet composed of two symmetric feature extraction branches followed by a modality fusion network. We avoid using depth encoding strategies such as HHA [15] due to their computational cost and instead operate on raw disparity images. We also constrain the resolution of training and testing images to be relatively small, downsampled by a factor of four with respect to the original Cityscapes resolution. This is because a deployed version of this system must also run SGM in real-time, which could become a computational bottleneck if using higher resolution images. To design the best model for this task, we train several architectures that differ with respect to the operation used to combine features: channel-wise concatenation, element-wise multiplication, and element-wise addition. We evaluate all models using Intersection-over-Union (IoU) as a primary performance metric and the instance-level IoU (iIoU) as a secondary metric. Compared to the baseline ENet model, we achieve comparable segmentation performance and are also able to take advantage of features

that cannot be extracted from RGB images alone. The results show that at this particular fusion location, elementwise multiplication is the best overall modality combination method. Through observing feature activations at different points in the network we show that depth information helps the network reason about object edges and boundaries that are not as salient in color space, particularly with respect to spatially small object classes such as persons. We also present results that suggest that even though each branch learned to extract unique features, these features can have complementary properties. By extending the ENet model to learn from multimodal data we provide it with a richer representation of the environment. Because this extension simply duplicates layers in the encoder to create two symmetric feature extraction branches, the network also maintains real-time inference performance. Due to the network being trained on smaller resolution images to remain within the constraints of an embedded system, the overall performance is competitive but below state-of-the-art models reported on the Cityscapes leaderboard. When deploying this model in a high-performance system such as an autonomous vehicle that has the ability to generate disparity maps in real-time at a high resolution, MM-ENet can take advantage of unused data modalities to improve overall performance on semantic segmentation.