TREESEG

Problem Context

Convolution Neural Networks (CNNs) are an industry-standard AI architecture for image processing. This investigation analysed the effects of single pixel, local pixel and global pixel feature engineering as input to 3 different CNN architectures. The aim of which was to increase segmentation accuracy in the context of drone-captured images of trees provided by Aerobotics.

Project Goal

The goal of this project was to determine whether using engineered features as inputs to CNNs can improve their segmentation performance for tree data.

The space of possible features were divided into the following categories:
• Per-pixel transforms.
• Small width feature extractors.
• Large scale feature extractors.

For the investigation, we tested this approach with the following architecture classes:
• U-Nets
• Fully Convolutional Networks
• Atrous Convolutional Networks

Data Used

The data provided by Aerobotics was captured on multispecteral cameras attached to drones. These drones surveyed a large variety of tree crops, planted and arranged in various patterns. This system allowed Aerobotics to capture a variety of image bands, which represent a single aspect of a scene being imagined.

• RGB: Digital representation of a real-world scene. This is achieved by using three primary colours to represent any composite colour

• DEM: Digital Elevation Map which describes the height of a point with refrence to the location that point lies.

• NDVI: This is an image index used to represent vegetation in spectral form. This helps discriminate between vegetation which appears the same colour in RGB.

• NIR: This is another image index which measures the absorbtion of light in the electromagnetic spectrum. Again this helps discriminate between vegetation which appears the same colour in RGB.

• MASK: This is an image layer which represents the ground truth. It is used as a mechanism to train our CNNs by backpropagating the error in their predictions

Convolutional Neural Network Architectures

Fully Convolutional Neural Network

A Fully Convolutional Neural Network (FCN) is defined as an extension of the CNN model, where the basic idea is to make the CNN take an input of arbitrary-sized images. The main characteristic that differentiates an FCN from other CNNs, is that the last fully connected layer is substituted by another convolution layer with a large receptive field. One of the drawbacks of FCNs is that, because of the fixed nature of the size of the receptive field, if an object is substantially larger or smaller than the size of the receptive field, it could be mislabelled or fragmented. Another drawback is that the resolution of the feature map is downsampled due to all the convolutions and pooling layers that the data goes through. This leads to low-resolution predictions, which in turn leads to fuzzy object edges.

U-Net

The U-Net architecture is a novel variation of a Convolutional Neural Network that uses interconnected pairs of convolutional layers to achieve state-of-the-art accuracy with datasets as small as 300 images. Originally developed for use segmenting cells in medical imagery, the architecture transfers well to segmenting tree canopies.

Atrous Convolutional Neural Network

The Atrous architecture is another variation of a simple CNN. It uses atrous convolutions in place of normal convolutions in the hidden layers. This method of performing convolutions allows us to adjust the effective field of view of the convolutions more efficiently by padding the kernel with "holes". This helps preserve spatial accuracy through convolutions.

Engineered Features

Per-Pixel Transforms

Manipulating the value of each pixel of an image with the aim of enhancing values that correspond to canopy segmentation.

• Color Space Transformations: Lab, HSI
• Principle Component Analysis

Small-Width Feature Extractors

These are modifications dedicated to transforming a small neighbourhood of pixels. The focus of this transformation type being, modifications looking specifically at smaller objects in an image or at parts of an image.

• Mean Shift
• Canny Edge Detector
• Morphological Closing

Large-Width Feature Extractors

These are modifications dedicated to transforming a large neighbourhood of pixels The process of segmenting a neighbourhood of pixels typically follows one or more preprocessing steps to isolate key information. However, sometimes these strategies can be applied to the source image depending on the application.

• Thresholding
• Greyscale Histogram Equalization
• Independent Component Analysis

Findings & Conclusions

Improved Segmentation in Challenging Cases

(Determined by visual analysis)

Baseline RGB Result

Result with Feature Extraction

Large-Width Feature Extractors Removed Image Noise and Performed Better

• Accurately highlighted tree boundaries.

• Removed tree shadows.

• Excluded other ground vegetation.

This allowed the Convolutional Neural Networks to segment input images more accurately.

Improvements in Segmentation Metrics

(% increase from base RGB accuracy)

U-Net	LAB	Histogram Equalization	Morphological Closing	Hist Equal, Morph. Closing, Aerobotics (NIR, DEM, etc.)
	-1.67%	1.14%	0.73%	4.27%
FCNN	LAB	Histogram Equalization	Mean Shift	ICA
	1.63%	0.17%	0.60%	1.75%
Atrous	LAB	Histogram Equalization	Morphological Closing	Morphological Closing, ICA, Histogram Equalization
	0.12%	-2.00%	-2.01%	1.97%