Abstract:Joint data combined with convolutional neural network for twoperson interaction behavior recognition has the problem of insufficient expression of interactive information during the imaging process and ineffective modeling of timeseries relations. In combination with recurrent neural network, there is a problem that focuses on the representation of time information. However, it ignores the problem of constructing information about the spatial structure of the twoperson interaction. Therefore, a novel model named CNN attentionbidirectional long shortterm memory (CNN ABLSTM) network is proposed. First, the joints of each person are arranged based on the traversal tree structure, and then the interaction matrix is constructed for each frame of data in the video. The values in the matrix are the Euclidean distance between the arranged joint coordinates of two persons. After encoding the grayscale image of the matrix, the images are sequentially sent to CNN to extract deeplevel features to obtain the feature sequence. And then the obtained feature sequence is sent to the ABLSTM network for time series modeling, and finally sent to the Softmax classifier to obtain the recognition result. The new model is applied to 11 types of twoperson interaction in NTU RGB D dataset, and the accuracy is 90%, which is higher than the current twoperson interaction recognition algorithm. The effectiveness and good generalization performance of the new model are verified.