Teleconferencing | Rachit Bhayana

Screenshot 2023-11-11 at 10.58_edited.jp

THE PROJECT

When engaging in a video call, there is often a limited sense of the surrounding context, which in turn impacts our experience, and to cater this, we built a videoconferencing module that visualises the user’s aural environment to enhance awareness between interlocutors. Through user studies, we found out that our visualisation system was about 50% effective at eliciting emotional perceptions in users that were similar to the response elicited by environmental sound it replaced.

TEAM SIZE
4

MY ROLE

User Research

Study Design

Qualitative Data (Thematic Coding)

Data Analysis

INTRODUCTION

Screenshot 2023-11-12 at 12.02_edited.jp

Conversational quality often declines in teleconferencing due to the loss of contextual awareness about the interlocutor's environment.
Few of the contributing factors includes

Restricted field of view of webcams
Masking of visual environment
Compression or reduction of video qualities
Use of avatars or background augmentation

A lack of contextual awareness can lead to a reduced feeling of being together in the same space and can also influence the way emotions are interpreted. For example, a speaker may be perceived as anxious while a dog was loudly barking behind them. But the same speaker in the context of a peaceful forest, the speaker may be perceived as more relaxed

RESEARCH QUESTION

We chose to improve the understanding of the context by visualizing the auditory environment of the user.
A primary objective of our system's design is to transform background audio into a visualization that evokes an emotional reaction similar to the original audio source. So our primary research question:

Screenshot 2023-11-12 at 12.09_edited.jp

RELATED WORK

Screenshot 2023-11-12 at 12.10_edited.jp

The concept of visualizing the audio aspect was employed by Visiphone, which informs remote callers about their conversation through volume levels and conversational rhythms. In non-remote scenarios, the Conversation Clock illustrates audio patterns and offers insights into the culture and status of the participants.

When it comes to offering ambient emotional awareness, SmartHeliosity establishes an emotional feedback mechanism with users by generating colored light based on their facial expressions, while BioCrystal utilizes physiological feedback to generate colored light

OUR METHOD

Screenshot 2023-11-12 at 12.13_edited.jp

We have a working prototype of a system that enhances emotional awareness while teleconferencing.
We then evaluated that system, including the emotional effects of cross modal representations
of the ambient environment. Lastly we conducted a preliminary analysis of which audio events during the teleconferencing are most relevant to the users.

SYSTEM DESIGN

Screenshot 2023-11-12 at 12.15_edited.jp

The system presents a particle animation that is influenced by both low-level acoustic characteristics and high-level semantic aspects of the user's auditory surroundings.
Semantic details influence the colour and shape of the particles, while acoustic factors impact their size, speed of movement, and path. This particle animation is overlaid on the video feed, and the combined output is shared with all users.

Screenshot 2023-11-12 at 12.30_edited.jp

The spectral properties of the source audio were used to adjust the rate, movement, and dimensions of the particles. We calculated a Fourier transform of the signal and took its spectral magnitude. Particles move at a faster pace across the screen when generated from higher-frequency bins. The size of the particles corresponds to the spectral magnitude of the bin.

We made use of Google's YAMNet, a pre-trained neural network designed to categorise audio events using a taxonomy created from YouTube audio data. YAMNet is capable of predicting 521 different sound classes.

Using YAMNet, we thematically organised the 521 classes into 6 high level semantic features, roughly aligning them with high level soundscape features. We decided to keep semantic features as coarse grained as possible, as we didn’t want to assume what specific sound events were most relevant to videoconferencing users without first running a user study.

Scores for each semantic feature were derived from the CNN and then converted into continuous, adjustable characteristics for the particles.

The three characteristics include colour, Spatial distribution and shape.

Colour — Initially, particles have a blue base, but they are mixed with red for artificial sounds and green for natural sounds in an additive manner.
Shape — interior sounds make the particles look more square, while exterior make the particles look more round.
Spatial distribution — particles positioned in proximity for foreground sounds and distanced further apart for background sounds.

USER STUDY

Screenshot 2023-11-12 at 12.35_edited.jp

We conducted user study in two parts. First, in the Rating Task, Participants evaluates the perceived emotions in pre-recorded videos depicting two individuals engaged in conversation within various auditory and visual settings. Then, through Qualitative Analysis, data was collected via a textbox prompt to investigate what the visualisations meant to the participants, and what background audio they considered most relevant based off their daily experience.

RATING TASK

This experiment investigates if we can convey similar emotional contexts between modalities

Participants were presented with videos of two-person discussions and tasked with providing emotional state ratings for the speaker for 10 validated emotion words sourced from the Positive and Negative Affect Schedule (PANAS).

We chose four brief video clips, each lasting between 10 to 20 seconds, from a single conversation. Participants were presented with these videos in a randomised fashion, each combined with different elements based on three factors: background audio, the visual output condition generated by our system, and the environmental condition featuring different sounds. Emotion ratings were Likert scales from 1-5, 5 stating emotions were highly present.

We have 4 environment sounds consisting of construction, dogs, a cafe, and a forest. By enabling or disabling these three features, we created a total of 16 video variations for a single trial.

Screenshot 2023-11-12 at 12.41_edited.jp

HYPOTHESIS 1

“Do emotion ratings vary depending on different environments?”

3-way ART ANOVA

H1₀ : 𝜇𝐵𝐺𝐴 = 𝜇𝑣𝑖𝑧 = 𝜇𝑒𝑛𝑣 ; H1₁ : ¬(𝜇𝐵𝐺𝐴 = 𝜇𝑣𝑖𝑧 = 𝜇𝑒𝑛𝑣)

Screenshot 2023-11-12 at 12.45_edited.jp

Hypothesis 1 posits that emotion ratings are affected by background audio, visualisations, environments, or combinations thereof. To verify the same, we conducted 3 way ART Anova to compare the effect on each of the emotion ratings.

There is a main effect of environment between the emotion words: distressed, guilty, scared, and hostile, and a main effect of visualisation for guilty, scared, and hostile. There is also an interaction effect between environment and visuals for the same emotions. From these findings, we can infer that visualisations do induce a perceived change of context.

HYPOTHESIS 2

“Do the visuals evoke similar emotions as the audio?“

The magnitude and direction of emotion ratings between visualization-only conditions and BGA- only are similar spearman rank correaltion

H2₀: ρ = 0; H2₁: ρ > 0

Screenshot 2023-11-12 at 12.47_edited.jp

Do the visualisations produce alterations in emotional ratings comparable to the impact of background audio? In addressing this inquiry, we have an additional hypothesis that suggests the extent and orientation of emotional ratings in visual-only conditions closely resemble those in background audio-only conditions

If the visuals conveyed similar contexts as the background audio, we would expect to see significantly strong correlations in the diagonal of this matrix. For the post correction matrix, there were just three significant cross modal correlations, all on the diagonal, for the emotions “strong”, “scared”, and “hostile”. From our previous test, there were four emotions that were affected by changing environments: “distressed”, “guilty”, “scared”, and “hostile”. Of those emotions, 50% of them were significantly correlated cross modally.

QUALITATIVE ANALYSIS

At the conclusion of the study, participants were given a textbox prompt that asked the following two questions. The first one aims to collect what did visualisation meant to the participants. The second question asked what were the most relevant sounds in a video conferencing for them.

Screenshot 2023-11-12 at 12.54_edited.jp

CONTENT ANALYSIS ON TEXT RESPONSE

Using the qualitative research method, we coded the results according to their literal meaning, then clustered them by affinity to categories.

Screenshot 2023-11-12 at 12.57_edited.jp

Emotions: Out of the participants, 40% believed that the particles inherently conveyed emotional significance and could not establish any direct connection between the particles and the sounds. The characteristics of the particles held diverse emotional meanings for the participants. For a few, colors influenced the perception of emotions, while others associated them with feelings of relaxation, excitement, or happiness.

Lack of Meaning: Approximately 31% of the participants did not attribute any clear meaning to the particle animation. Some participants perceived the particles as obstructive, while others mentioned that they found the visuals distracting, especially when they obscured the face of the speaker.

Sound: Participants identified the relationship between particle size and the loudness of the background audio, though their understanding of this relationship appeared to be somewhat vague. One participant said simply that the particles “represent the type of noise” in the background. No participants explicitly identified how particles reacted with the frequency or semantic aspects of sound.

RELEVANT SOUNDS

We reviewed participant responses and thematically clustered them and adjusted to the categories accordingly. The list of relevant sounds is hierarchically organised with each leaf as participant response.

We found three overarching categories of sounds: living creatures including human and non human sounds, outdoors including urban and transportation sounds, and lastly, the indoor that consists of household, appliances and other other misc sounds. The newly established higher-level categories allow for a more concise selection of semantic features based on actual user data and these these can be used as semantic featres in a future iteration

Screenshot 2023-11-12 at 1.02_edited.jpg

KEY TAKEAWAYS

CROSS MODAL CORRELATION
Of the emotions, “distressed”, “guilty”, “scared”, "strong" and “hostile” changed between different environments. The visualization system was able to capture that change in a way that was significantly correlated with the background audio 50% of the time and thus system is able to convey 50% of the emotions between modalities.
Words that were not correlated between modalities could have been because the visualization did not convey meaning or emotions may have not been present.

VISUALISATION MEANING

We learnt that participants correlated the particle animation with the emotions and loudness levels. They were not explicitly able to identify how particles reacted with the frequency or semantic aspects of sound. Furthermore, the new overarching categories enable a more parsimonious set of semantic features informed by real user data. For example, a future version of this system could replace our preliminary semantic features with simply: living creatures, outdoor, and indoor.

FUTURE SCOPE

SEMANTIC ONTOLOGY

There is no one universal semantic ontology that generalises to the needs of all participants. For example, A user in a day care may have a different set of relevant contextual needs compared to the one in an industrial shop. As such, the ability for them to customise and define their own semantic feature ontology would be a necessary step to align the system with their needs.

COMPOSITIONAL LANDSCAPE

A way to further improve the visualisations would be to preserve the “compositional” nature of the soundscape. Sounds can be perceptually decomposed into multiple coherent textures—for example, a listener can decompose the sounds of the city to the noises of the cars, people walking by, or the rain falling. The system currently analyses the semantics of the sound as a single “audio event”. A future version of the system could better reflect the compositional nature of sound.