Before diving into the development and training of computer vision models, it is important to learn about the machine learning workflow. Topics such as Data Acquisition, Analysis and Visualization must be considered before working on a model. Cross Validation, the tensor flow data format, tfrecord, and camera calibration techniques are discussed soon after. Linear and Logistic Regression is covered as an introduction to Neural Networks. The lessons on Gradient Descent and backpropagation soon follow.
Convolutional layers are neither the only nor most important layers in a Convolutional network. Implementation of other layers such as Pooling, Dropout, and Batch Normalization will improve the accuracy of the model. Other techniques such as Transfer Learning and Data Augmentation can reduce training time, and allow the model to better generalize to new data it has never seen, so you are basically getting free data!
Many models have built upon these basic layers, such as Resnet which skips layers to eliminate the vanishing gradient problem of larger networks. R-CNN implements selective search to generate bounding boxes around objects in addition to classification. SPPnet adds the spacial pyramid pooling layer to find feature maps of the input image improving performance. The increased performance of R-CNN and SPPnet does come at a cost as they are both two stage networks and thus cannot produce results in real-time. YOLO is a single stage network that slightly reduces performance for a massive increase in speed. Running at greater than 40 frames per second, it is the current standard.
The output of many of these models contain too many bounding boxes that often overlap. Non-Max suppression is used to reduce the outputs to only the most confident predictions. After precision and recall are calculated, Mean Average Precision can be gathered by finding the area under the Precision Vs Recall graph.
The Computer Vision project implements previous lessons in a Jupyter notebook running Python 3.7 and TensorFlow 2.4.1. The data used for the project was a collection of 97 TFRecords from the Waymo Open Dataset, where each TFRecord contained 10 images. The data was split into a 90/10 training validation split so out of 970 images, 870 were used to train. A few images containing bounding boxes and their classes are shown to the left.
A SSD Resnet 50 pre-trained on ImageNet is used in this project. To make tuning hyper-parameters easier, the TensorFlow Object Detection API is used. This API relies on a config file that contains model parameters, and training configs such as data augmentation and learning rate decay.
Tuning hyper-parameters was the focus of the project, so the results of adjusting learning rate decay functions and optimizers can be seen on the graphs to the left. The orange line represents the original untuned model, and the other colors are various changes I made.
To learn more about the project and view code, check out my github repo for the Computer Vision Project.
Sensor fusion is the use of multiple sensors such as LiDAR and cameras to track objects using a Kalman filter.
LiDAR emits light to measure an objects distance from the sensor and how reflective the object is. These sensors can vary in the horizontal range they can scan, how many beams are emitted in the vertical direction, or how it captures the image, such as a flash, scan or frequency modulated scan.
The data produced by LiDAR is often represented as a 3d point cloud containing the x, y, z coordinates and intensity value of each point. The data can also be represented as a range image, which is more like how pixels are represented in an image on a computer. Instead of containing RGB color values however, the data stored in each pixel of a range image is the range and intensity of the scan. This is how LiDAR data is represented in the Waymo dataset frames. To process LiDAR data in a yolo model, the input will be converted from the range image, in the Waymo TFRecord frame, to a 3d point cloud then to a Bird's Eye View perspective.
Now that objects have been detected, they need to be tracked. This provides context to where the object is and helps in predicting where the object will be with high confidence. The flow is to take the first measurement, then predict the next, then use the next measurement to update the first and predict again.
The Kalman Filter assumes the probability that an object is at an XYZ coordinate is a Gaussian distribution. Predicting an object's next position will reduce its confidence, but updating will increase the confidence. In addition, updating using multiple sensors will greater improve confidence and reduce error.
Once multiple objects are added, it becomes harder to track each one. Real life is filled with uncertainties and noise which must be accounted for. To reduce the chance of a false positive being registered as a vehicle, each object is given a score that increases when the object appears in multiple consecutive frames. After the score reaches a threshold, it is confirmed to be a tracked object.
Complex scenarios will often occur which contain previously tracked objects, new objects, and false positive objects which are also known as ghost tracks. It is also possible for these objects to be close enough together where it is not obvious which measurement corresponds to which vehicle that is being tracked. To solve this, the Mahalanobis distance between every track and every measurement is calculated. Gating is able to remove the largest unlikely associations so that the smallest distances can assumed to be associated.
The first part of the Sensor Fusion Project focuses on the flow of LiDAR object detection and is written in python 3.7. The range images are loaded from the Waymo dataset frames, then converted to point clouds. The Bird's Eye View (BEV) perspective projects a grid on the x-y axis (x length forward, y width sideways) of the point cloud and computes the intensity, highest point, and density of all point-cloud points in each grid cell. It is no coincidence that the data in the BEV looks similar to the data in an RGB image from a camera. In fact it was designed this way so that LiDAR data, represented in BEV space, can be used to train a YOLO model. More specifically, the BEV images are be fed into a Feature Pyramid Network (FPN) via the Super Fast and Accurate 3D Object Detection package (SFA3D) pre-trained on the Kitti dataset.
All objects in the output predictions are in BEV space and must be converted into metric coordinates in the vehicle space to be passed further into the pipeline for the second part of the project. In the image above, you can see the bounding box around the vehicle in the BEV space (bottom) from the output of the network and the overlay image in the transformed vehicle space (top).
To measure performance, the intersection over union, precision, and recall were calculated and analyzed.
To learn more, be sure to check out my writeup for Part 1 of the Sensor Fusion Project.
The second part of the Sensor Fusion Project builds upon the python code from part 1, and is split into 4 steps, Filter, Track Management, Association and Camera Fusion. To complete the filter, the prediction and update algorithms must be implemented. The prediction estimates where the track will be on the next frame, and the update function compares that estimate with the actual position from the LiDAR or camera sensors. The Track Management step was conducted in an environment where a single vehicle entered and exited the frame. The code was able to confirm that the vehicle was in fact a vehicle and not a ghost track, and was able to remove the vehicle after it left the frame. The Association step scaled up track management as multiple vehicles entered and left the frame at various points. The association matrix of gated Mahalanobis distances was calculated in this step.
Up to this point, only LiDAR data has been used to track objects however, a benefit of using Kalman Filters is that having more data improves confidence. After the prediction for where the track will be during the next frame has been calculated, measurements from both LiDAR and camera will be used to update the prediction. The update step for camera implementation is different because camera data is non linear and only in two dimensions. The vehicle predictions must be transformed from the vehicle space into the image pixel coordinates to be compared to the camera object detection outputs, then the updated tracks must be projected back into vehicle space.
To learn more and see my code, check out my writeup for part 2 of the Sensor Fusion Project.