In this paper, a grounding framework is proposed that combines unsupervised and supervised grounding by extending an unsupervised grounding model with a mechanism to learn from explicit human teaching. To investigate whether explicit teaching improves the sample efficiency of the original model, both models are evaluated through an interaction experiment between a human tutor and a robot in which synonymous shape, color, and action words are grounded through geometric object characteristics, color histograms, and kinematic joint features. The results show that explicit teaching improves the sample efficiency of the unsupervised baseline model.