Mozart

Multimodal 3D UI for Conceptual Modeling

Anirudh Sharma, Sriganesh Madhvanath, Ankit Shekhawat, and Mark Billinghurst. 2011. MozArt: a multimodal interface for conceptual 3D modeling. In Proceedings of the 13th international conference on multimodal interfaces. ACM, New York, NY. [PDF]
Mozart multimodal 3D modeling interface
System architecture of the MozArt multimodal interaction framework, illustrating the mapping between heterogeneous input modalities (voice, touch, textual, laser, mouse) and augmented reality output channels (3D CAD, 2D drawing, annotations). The layered AR representation demonstrates the fusion of real-world context with computer-generated modeling information.

According to a survey, 80 percent of people want to model or create to visualize their imagination using their computers; however, the difficult UI of such tools prevents them from doing so. Even to model something simple, the user has to navigate an obtrusive set of icons, toolbars, and features which are rarely used. Current 3D CAD software demands expert-level proficiency, creating a barrier between conceptual intent and digital expression for the vast majority of potential users.

Mozart system overview
A user operating the MozArt prototype, employing direct touch gestures on an inclined multitouch surface to manipulate 3D geometric primitives rendered in isometric projection. The headset-mounted microphone enables concurrent speech input for multimodal command fusion during the conceptual modeling workflow.

We propose a computer modeling interface to bring 3D visualization to the common layman who wishes to rapidly visualize their imagination. This work is motivated to employ natural expression with the fewest restrictions to free CAD users from tedious command buttons and menu items. We explored both the hardware and software aspects of the interface, specifically, the use of intuitive speech commands and multitouch gestures on an inclined interactive surface.

The initial TUIO and OSC touch-based integration was developed during Google Summer of Code. Touch+Speech multimodal fusion was subsequently implemented with Sriganesh Madhvanath at HP Labs, combining simultaneous gesture and voice input into a unified command stream for 3D object manipulation.

Team: Anirudh Sharma, Sriganesh Madhvanath, Ankit Shekhawat, Mark Billinghurst.

A within-subjects user study was conducted to compare the multimodal (MM) interface — combining speech and multitouch — against a multitouch-only (MT) baseline across two 3D modeling tasks of increasing complexity.

Participants

12 participants (8 male, 4 female, ages 20–29) with no prior experience in 3D modeling software. All participants completed both conditions in counterbalanced order.

Task Completion Time

Metric Result
Condition comparison Multitouch only (MT) vs. Multimodal (MM)
Statistical test One-way ANOVA
Significance No significant difference (p>0.05)

Task completion times were comparable across both conditions, indicating that the addition of speech input did not slow users down despite introducing a new modality.

Error Rate (Undo Count)

Metric Result
Task 2 (complex modeling) Significantly fewer errors for MM
Statistical test Paired t-test, t(11) = 3.07, p = 0.005

In the more complex modeling task, participants made significantly fewer errors (measured by undo count) when using the multimodal interface, suggesting that speech commands reduced accidental or imprecise touch inputs.

Subjective Workload (NASA TLX)

Dimension MT (Multitouch Only) MM (Multimodal)
Frustration Higher Lower
Physical demand Higher Lower

NASA TLX ratings revealed that participants experienced higher frustration and physical demand in the multitouch-only condition. The multimodal interface reduced both dimensions by offloading selection and parameterization commands to speech.

User Preference

9 of 12 participants (75%) preferred the multimodal interface over multitouch-only, citing more natural interaction and reduced reliance on on-screen menus as primary reasons.