Keywords: Human-Computer Interaction, CSCW, Multimodal Machine Learning, Multiple Kernel Learning for Multimodal Fusion
This was my 10-701 Introduction to Machine Learning course project at CMU. The course was taught by Prof. Tom Mitchell.
Negotiation is a complex interaction in which two or more parties confer with one another to arrive at the settlement of some matter like resolving a conflict, or to share common resources. The parties involved in the negotiation often have non-identical preferences and goals that they try to reach. Sometimes the parties simply try to change a situation to their favor by haggling over price. In other cases, there can be a more complex trade-off between issues. Being a good negotiator is not a skill that all humans naturally have; therefore, developing technological interventions to help humans become better negotiators can be very useful. Predicting negotiation outcomes from nonverbal behavior is the first step for any such intervention.
Our dataset includes audiovisual recordings of dyads engaged in a negotiation task via video-conferencing, as shown in the figure above. OpenFace is used to detect facial expressions from the visual data, which are then translated to individualistic, mimicry-based, and synchrony-based features. Additionally, OpenSmile is used to detect speech-related features. More details about these features can be found in the project report. For the baseline models, we trained 3 Linear Support Vector Machines (SVMs) on audio-only, video-only, and audio+video features, which gave accuracies of 48.2%, 48.3%, and 53.33%, respectively. The limitation of these baseline models was that they ignored inherent differences between audio and visual features by fitting the same function to them. To overcome this limitation, we implemented a method called Multiple Kernel Learning (MKL) for Linear SVM. This method allowed us to learn different kernels (mathematical functions) for audio and visual features and combine them together. The Linear SVM with MKL model trained on audio+video features gave an accuracy of 63.1%, thereby increasing the accuracy by around 10% from the multimodal baseline and around 15% from the unimodal baselines. Thus, it is not only important to combine modalities, but it is also important to research new techniques to optimally fuse modalities and encode their interactions with each other.
For more details, view the report here: http://prernac.com/reports/mlnego.pdf