Humans can infer approximate interaction force between objects from only vision information because we already have learned it through experiences. Based on this idea, we propose a recurrent convolutional-neural-network-based method using sequential images for inferring interaction force without using a haptic sensor. For training and validating deep learning methods, we collected a large number of (e.g. 359,413) images and corresponding interaction forces through an electronic motor-based device. To concentrate on changing shapes of a target object by the external force in images, we propose a sequential image-based attention module, which learns a salient model from temporal dynamics. The proposed sequential image-based attention module consists of a sequential spatial attention module and a sequential channel attention module, which are extended to exploit multiple sequential images. For gaining better accuracy, we also created a weighted average pooling layer for both spatial and channel attention modules. The extensive experimental results verified that the proposed method successfully infers interaction forces under the various conditions, such as different target materials, illumination changes, and external force directions.