﻿<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="sahay-etal-2020-low">
    <titleInfo>
        <title>Low Rank Fusion based Transformers for Multimodal Sequences</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Saurav</namePart>
        <namePart type="family">Sahay</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Eda</namePart>
        <namePart type="family">Okur</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">shachi</namePart>
        <namePart type="family">H Kumar</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Lama</namePart>
        <namePart type="family">Nachman</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-jul</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)</title>
        </titleInfo>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Seattle, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~(CITATION) and~(CITATION), we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures.</abstract>
    <identifier type="citekey">sahay-etal-2020-low</identifier>
    <location>
        <url>https://www.aclweb.org/anthology/2020.challengehml-1.4</url>
    </location>
    <part>
        <date>2020-jul</date>
        <extent unit="page">
            <start>29</start>
            <end>34</end>
        </extent>
    </part>
</mods>
</modsCollection>
