Natural human-robot interaction requires robots to learn new tasks autonomously and link the learned actions to their corresponding words through grounding. Previous studies focused only on action learning or grounding, but not both. In this paper, we try to fill this gap by introducing a framework that uses reinforcement learning to learn actions and cross-situational learning to ground actions, object shapes and colors, and prepositions. The proposed framework is evaluated through a simulated interaction experiment between a human tutor and a robot. The results show that the employed framework can be used for simultaneous action learning and grounding.