Abstract:A single shot of a light field camera can record the intensity and direction information of light at the same time. Compared with the RGB camera, it can better reveal the three-dimensional structure and geometric characteristics of the scene and has unique advantages in the field of object 6D pose estimation. Aiming at the problems of low detection accuracy and poor robustness in complex scenes in existing RGB pose estimation methods, this paper proposes an end-to-end convolutional neural network object pose estimation method based on light field images for the first time. In this method, the dual-channel EPI encoding module is used to process highdimensional light field data. By reconstructing the light field EPI image stack and introducing horizontal and vertical EPI convolution operators, the modeling ability of the spatial angle information association of the light field is improved. Two-branch siamese network for shallow feature extraction of light field images. Secondly, a feature aggregation module with skip connection is designed to perform global context aggregation on the concatenated light field EPI shallow features in the horizontal and vertical directions, so that the network can effectively combine global and local feature clues when predicting pixel-by-pixel key point positions. To solve the problem of insufficient light field data, this paper uses the Lytro Illum light field camera to collect real scenes and constructs a rich and complex light field pose dataset—LF-6Dpose. The experimental results on the light field pose dataset LF-6Dpose show that the average pose detection accuracy of this method is 57. 61% and 91. 97% under the ADD-S and 2D Projection indicators, which surpasses other advanced methods based on RGB and can better solve the target 6D pose estimation problem in complex scenes.