In this paper, we investigate how multiple conversational behaviors can be detected by automatically analyzing facial expressions in video recordings of users talking to a dialog system. To this end, we recorded a video corpus of human-machine interactions containing distances between facial landmarks as well as a manually annotated behavior label for each recorded video frame. We evaluated the difficulty of defining unambiguous conversational behaviors and used a deep neural network to predict conversational behaviors on a frame-by-frame basis that, after extracting facial landmarks of detected persons, produced an F1-score of up to 0.86.