Fusing visual and inertial sensors with semantics for 3d human pose estimation
Andrew Gilbert [1], Matthew Trumble [1], Charles Malleson [1], Adrian Hilton [1], John Collomosse [1,2]
[1] University of Surrey, [2] Adobe Research
In International Journal of Computer Vision, 127 (4), 381-397, 2019
[1] University of Surrey, [2] Adobe Research
In International Journal of Computer Vision, 127 (4), 381-397, 2019
Our two-stream network fuses IMU data with volumetric (PVH) data derived from multiple viewpoint video (MVV) to learn an embedding
for 3D joint locations (human pose)
for 3D joint locations (human pose)
Abstract
We propose an approach to accurately estimate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose embedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull. The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal
long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complementary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accuracy over prior methods. Extensive evaluation is performed with state of the art performance reported on the popular Human 3.6M dataset, the newly released TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor.We release the new hybridMVVdataset (TotalCapture) comprising of multi-viewpoint video, IMU and accurate 3D skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complementary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accuracy over prior methods. Extensive evaluation is performed with state of the art performance reported on the popular Human 3.6M dataset, the newly released TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor.We release the new hybridMVVdataset (TotalCapture) comprising of multi-viewpoint video, IMU and accurate 3D skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.
Network architecture comprising two streams: a 3D Convnet for MVV pose embedding, and kinematic solve from IMUs. Both streams
pass through LSTM before the Fusion of the concatenated estimates in a further FC layer
pass through LSTM before the Fusion of the concatenated estimates in a further FC layer
Paper
Fusing visual and inertial sensors with semantics for 3d human pose estimation A Gilbert, M Trumble, C Malleson, A Hilton, J Collomosse International Journal of Computer Vision 127 (4), 381-397, 2019
|
Videos
|
Citation
@inproceedings{Gilbert:BMVC:2020,
AUTHOR = Gilbert, Andrew and Trumble, Matthew and Malleson, Charles and Hilton, Adrian and Collomosse, John",
TITLE = "Fusing visual and inertial sensors with semantics for 3d human pose estimation,",
BOOKTITLE = "In Proc International Journal of Computer Vision (IJCV)",
YEAR = "2019",
}
AUTHOR = Gilbert, Andrew and Trumble, Matthew and Malleson, Charles and Hilton, Adrian and Collomosse, John",
TITLE = "Fusing visual and inertial sensors with semantics for 3d human pose estimation,",
BOOKTITLE = "In Proc International Journal of Computer Vision (IJCV)",
YEAR = "2019",
}