3D Object Localization in 3D CT Volumes

Lohit Kapoor
8 min readApr 19, 2021

Introduction

This project was inspired from the paper by Criminisi et al. from Microsoft Research. In the paper, they proposed an approach in which they apply random regression forests to predict the location of anatomical structures in CT scans. They use a data set of 100 torso CT scans, which differ highly in several aspects (e.g. scanner type, resolution, organ size and pose, etc.). For these scans, the 3D bounding boxes for the organs of interest are computed at first to generate the ground truth. The bounding boxes are parametrized by the positions of the front upper left corner and back lower right corner.

During the training phase the individual regression trees are built by recursively splitting the voxels of a node into two subsets, in a way that the information gain is maximized. The binary splitting criterion is a comparison of a visual measure of each voxel with the optimal interval, defined through two scalars. The growing of the tree stops, if a maximal depth is reached. In the leaves the distribution parameters of the regression is stored. The regression is assumed to be normally distributed. Thus the mean and the covariance matrix are calculated and saved. For this calculation the offsets between the bounding boxes and the voxels, that reach a particular leaf, are used.

After having trained the forest, it can be tested on an unseen CT scan. During the testing phase each voxel of the scan is pushed through all the trees until it reaches a leaf node. Every single voxel votes for a particular bounding box location. Only the most confident votes are considered for the prediction, which is achieved by selecting the subset of the leaves with the lowest uncertainty. The final prediction is the expectation of all the votes. As a by-product, the voxels from the most confident leaves indicate the spatial relations between anatomical structures and regions in the scan.

Criminisi et al. train their forests on 55 scans and test them with the other 45. Even though there are large differences between the scans, their aggregated mean error is about 1.7 cm and the aggregated median of the error about 1.1 cm, which they consider as low. The error is measured as the absolute difference between the calculated and the actual wall positions of the bounding boxes.

Source

Regression Forests

A random regression forest consists of bunch of decision trees where each node in a tree splits the incoming datapoints according to random feature selection and random splits on those selected features, However, a decision is made at each node to pick the best randomly chosen split criteria. Hence, decision trees are binary trees, which recursively partition the input space into two subsets. This allows a partitioning of a complex problem into smaller, simpler problems. Every resulting subset contains a model to predict the output. For regression trees this is a regression function, which returns a real-valued prediction.

To introduce the regression trees a 1D example from the paper is presented below. The goal is to predict the value y for an input x. Since its a supervised learning method, some training pairs [x; y] are available, where the y is known for each x. The trees consist of leaf nodes and internal nodes, that define a test function ξ > f(x) > τ, where ξ, τ are scalars. If the function evaluates to true, the particular data point is passed to the left child, otherwise to the right child. Determining the structure of the trees is done by finding the optimal split, i.e. finding ξ and τ, that optimally divide the input into two subsets. In the example, the split is optimal if the geometric error e

Source

is maximally reduced, where the error reduction is defined as:

Source

Here, S is the set of points at a specific node, w_i= |S_i|/|S| is the ratio of the number of points in the ith child and η_S are the two line parameters describing the linear regression function of a particular leaf calculated
from all points in S. Aside of maximizing the error reduction, there are also other possible measures for finding the best split, like the information gain.

The tree is recursively grown starting at the root with the complete input space. This results in a greedy optimization, since at each node the optimal solution for the contained set of voxels is chosen. The tree growing stops, if the error reduction falls below a certain threshold. Additionally the tree growing stops, if the maximal tree depth is reached or if the resulting
subsets are of too small size.

Source

Preprocessing

Since the goal is to find the axis aligned bounding boxes for the femur, a parametrization is chosen, which fully describes the bounding boxes. To do so, for each bounding box, a vector

Source

is chosen where the subscripts are defined as: H = head, F = foot, L = left, R = right, A = anterior and P = posterior. In other words, that means finding the position of the front upper left corner and the back lower right corner of a bounding box. The single components of the vector b_c are the distances from the image border to the aligned wall of the bounding box measured in pixels

Subsampling:

To improve the runtime, the amount of data may be reduced in advance. The amount of data is reduced both at training and testing time. Because the resolution of the scans vary for the different coordinate axes, the sub sampling is defined for each direction.

Features computation:

In order to save time during training and testing, the feature responses and displacement vectors of each voxel in CT Volumes from the femur bounding box is computed beforehand. As suggested by Criminisi et al., the feature responses are calculated by taking the difference of the mean intensity of two cuboidal regions.

Source

Here, two cuboidal regions are selected randomly with random offsets and random extents. For this project, about 10 such feature responses are computed beforehand for each voxel in all CT Volumes. These feature responses are used during training for random selection at each node during splitting and then about 10 random splits for that selected feature response are chosen for consideration. The one with highest information gain results in the node’s splitting criteria. The feature response and its value are stored in the node to be used during testing for splitting the incoming datapoints to its children.

Training

The training phase of the regression forest starts at the root node of each tree. The nodes of the regression trees are recursively divided by splitting up all the voxels contained in the node into two disjunct sets of voxels. The voxels at given node are separated by the following test function: ξ > f(v; θj). Each voxel is passed to the left or to the right child if the test function evaluates to true or respectively to false. The function f defines the feature response of a particular voxel v for a feature θj. The quality of the split is measured by the information gain, which is defined based on the entropy H(S) as

Source

where w_i = |S_i|/|S| is the ratio of the number of points in the ith child.

The covariance matrix Λc is calculated from the d(v) of every voxel contained in the particular node, where d(v) is the displacement vector for each node. As per the derivation in the paper, the formula for IG becomes:

Source

For each feature, ten random sampled thresholds between the minimal and the maximal feature response are tested. The threshold that gives maximal information gain is chosen. The optimal feature θj and the corresponding threshold ξj are stored in the node and the node is split by passing the voxels to the children according to the comparison with the threshold. If the information gain is too small, i.e. if IG ≤ 0, the node is not split any further and marked as a leaf node. In the leaves, the parameters of the normal distribution are stored. These are the mean dc and the covariance matrix Λc, which are calculated per class over the offsets of all voxels that reach a particular leaf.

Testing

After having trained the regression forest, it is tested on a new, unseen scan. All the voxels of the new scan, along with their feature responses and voxel locations are pushed through each tree of the forest. The voxels are passed to the left or to the right child, based on comparison of the feature response with the threshold, stored at the node during the forest training. The voxels are recursively sent down the trees until each voxel reaches a leaf.

To predict the location, only the most confident leaves of the whole forest are considered. Therefore, the leaves are sorted by their uncertainty, which is measured by the sum of the diagonal elements, i.e. the trace tr(Λc), of the covariance matrix. Starting with the leaf with the smallest trace, leaves are added to a set L. Leaves are collected until 1 percent of all test voxels is contained in L. Now, we have voxels that vote for the bounding box parameters contained in L. Displacement vectors stored in L are subtracted respectively from voxel locations for each voxel in L and resulting bounding box parameters are stored. The mean vector for the bounding box parameters hence computed is the prediction for the given test volume.

Conclusion

The number of trees in the forest, number of feature responses, number of trial feature splits for finding split criteria with maximum information gain are the hyperparameters that can be searched upon to find the best ones for given test cases. I could only find about 15 CT Volumes for femur by visiting hospitals in my city , out of which I tested it on only 1 CT volume. I used about 12 trees, 10 feature responses and 10 feature splits for each feature response as my hyperparameter. The results were astonishngly good given such few training cases.

Femur detected as per the algorithm

bounding_box_ground_truth= (328, 222, 148, 216, 164, 24 )

bounding_box_prediction=(327.66,222.659,137.924,227.47,167.14,16.5152,)

Thank you for interest in the blog. Please leave comments, feedback and suggestions if you feel any.

Full code on my GitHub repo here.

--

--