Papers
Feasibility of different automated content analysis techniques in film sound classification: A look at internal validity
Pedro Silva1
psilva@sfsu.edu
18 December 2006
1 Introduction
1.1 Background
Film sound is an often overlooked field, be it in general film theory or in actual production practice. There is a lack of serious writing on the subject, even considering the works of [4], [2] and [27], on the theoretical side, and [28], [11] and [19] as practitioners of the medium. The area seems to be divided into:- Humanistic approaches, coming chiefly from film studies departments in Fine Arts and Humanities schools and focusing on the philosophy of sound as a support for film, albeit with some exceptions, such as the ones mentioned above. Methodologies are argumentative in nature, thereby lacking the potential validity of a scientific method.
- Professional approaches, which are usually technical (with the welcome exception of Walter Murch). These usually deal with application of technology and its incorporation into specific techniques. While these types of contributions are very important to the development of the practice, they add considerably less to the body of higher knowledge than its humanistic counterpart.
- The socio-scientific approach seems to be neglected. Although there are a number of studies using content analysis in film, these mostly are concerned with violence, drug use, et cetera, in film.
1.2 Context
I am focusing on understanding whether computational media aesthetics (CMA), as a field and a theory, can be used in my long-term research, a longitudinal study of the evolution of the film sound track in Hollywood, from the dominance of the sound film starting in the early 1930s through today, specifically looking at two sets of variables: the sound track elements of dialogue, music, noise, and silence, and the auditory scene, independent of the visual scene. The methodology will be automated content analysis; due to its speed and efficiency, this is the underlying catalyst to such a comprehensive study. However, while research on the capabilities of audio-specific data mining has been conducted, particularly in the fields of music information retrieval (MIR) and speech processing, no applications have resulted in film sound investigation. I thus propose to research the possibility of automatically analyzing content in the described setting effectively. This implies that the technology used for this effect must be able to extract and cluster low-level audio features, or units of analysis, and organize them into semantic units, or classes, before any statistically meaningful conclusions may be drawn. There are, then, two components to this methodology: the technological and the technical. The technological component is related to the state-of-the-art in computational data mining; the technical aspect derives from the field of media aesthetics; it is the study of semantic significance in production practice.2 Research question
2.1 Operational definitions
- Audio classification
- Algorithmic process of extracting low-level audio features, and using these to categorize a signal.
- Sound track
- Ensemble of sound elements in film; speech, music, noise, and silence.
- Auditory scene
- A semantic unit in film; the aural equivalent of a visual scene, although not necessarily the same.
- Semantic gap
- The difference between the low-level features that machines perceive and teh high-level attributes and meaning that humans require.
- Film sound theory
- Film sound theory is means academic literature produced mainly by film studies departments in the Humanities.
- Film sound practice
- Film sound practice is the professional field of production and/or post-production sound for moving pictures.
- Socio-scientific research
- In the proposed setting, a systematic investigative effort using quantitative and qualitative tools that attempts to:
- provide a scientific basis to humanistic assertions made in film sound theory and
- attribute socio-cultural and aesthetic meaning to professional practices in film sound.
2.2 Formal statement
I propose to study the feasibility of using automated content analysis techniques to gather accurate longitudinal data on- the evolution of the sound track elements -speech, music, noise, and silence- (i.e. proportion, change over time) and
- the correlation between auditory versus visual scene parameters
3 Literature review
3.1 Computational media aesthetics
Media semantics: Who needs it and why? [6] combines six panel statements taken from the tenth International Conference on Multimedia in 2002. They "represent the whole multimedia value chain" (p. 580), and assert authority on specific fields, conferred by their respective authors. Chitra Dorai has written extensively on the subject of computational media semantics [7,8,20]; Thomas Sikora is one of the leading sources in MPEG-7 audio [13]; Herbert Zettl is most definitely an authority in applied media aesthetics [29]. Their goal is to stimulate discussion in "bridging the semantic gap": that is, how to move away from mere analytical to contextual implementations of computational technology in relation to the ever-increasing amounts of media material available. They specifically address the following questions:- usability of automatic indexing and classification systems;
- human effort as a bottleneck in dealing with content;
- the kind of ontology to use in multimedia as a framework for accessibility, giving the examples of the Motion Picture Experts Group's MPEG-7 and the World Wide Web Consortium's Semantic Web standards as examples of current implementations;
- the possibility of a 'Media-Google' as a result of such an ontology, and the importance of machine-learning research in achieving that end;
- the importance -and so far, failure- of media scholars in taking the initiative of better understanding the aesthetics of their field, emphasizing synthesis over analytical efforts
3.2 Automated content analysis in the social sciences
Teaching computers to watch television: Content based image retrieval for content analysis [9] is relevant to those wanting to research in the social sciences in general, and communications and media in particular. It- reviews the state-of-the-art in automated content analysis systems,
- examines a number of exemplary systems,
- identifies emerging technologies and standards,
- and considers the role of human coding and intelligence in their effectiveness.
- is image-centric, disregarding even its own conclusions about the effectiveness of such an approach and
- fails to consider content analysis of audio independently from video;
- emerging technologies, arguably fundamental even in Evans's own perspective, aren't treated in depth;
- it places too much emphasis on human assisted coding and analysis, again defeating its own purpose.
- an updated review of the state-of-the-art in automated content analysis systems is in order;
- since Evans's conclusions regarding effectiveness of multi-modal analysis systems were undoubtedly right, there was no reason to designate the object of his study Content Based Image Retrieval (CBIR);
- human assisted, semi-automated content analysis defeats the initial objective of arguing for the use of automated media content analysis systems for television and film research.
3.3 Automated audio classification
[25] reports a three-part audio scene segmentation framework:- definition of a scene
- multiple feature models that characterize the dominant sources and
- a simple, causal listener model. (p. 2441)
4 Working framework
4.1 Description
Computational media aesthetics contends that "we must understand compositional and aesthetic media principles to guide [automated] content analysis" [20,p. 10]. It is defined as the "algorithmic study of a variety of image and aural elements in media (based on film grammar). It is also the computational analysis of the principles that have emerged underlying their manipulation in the creative art of clarifying, intensifying, and interpreting an event for an audience" ([20,p. 11]). It attempts to address a problem raised by multimedia content management (MCM): the semantic gap [1,p. 18]. The semantic gap is "the gulf between the rich meaning and interpretation that users expect systems to associate with their queries for searching and browsing media and the shallow, low-level features (content descriptions) that the systems actually compute" [8,p. 15]. [20] identify two sources for analyzing and interpreting media: first, structuralism is used in film studies as an analytical tool. It consists of segmenting content -film, in this case-, and analyzing and interpreting the resulting sections, usually based upon a semiotic approach. Second, film grammar, they consider, is a far richer grounding for the automated content analysis of media. Film grammar constitutes an effective ontology of production knowledge, in that it fairly represents a "worldwide use [of] accepted rules and techniques to solve problems in transforming a story from a written script to a captivating visual and aural narration" [3], cited in [20,p. 10]. These rules and others, covering production practices in other fields of the media, combine to form the general field of media aesthetics. This field's researchers, Zettl argues in [6,p. 582], are responsible for addressing the problem of the semantic gap. What should result from this description of CMA is its underlying interdisciplinary nature, as the diverse contributing panel in [6] exemplifies. Although computer science and media aesthetics are its main areas, these further subdivide into speech processing, MIR, content-based image retrieval (CBIR), audiovisual segmentation, classification and indexing, film aesthetics, and sound aesthetics.4.2 Critique
[20] (2001) have their concepts of structuralism and film ontology as pertaining to two different levels of analysis supported by previous research: citing [23], [1,p. 18] summarizes:The three types of indexes that are generally required [in voice-on-demand applications by end-users], of which two are of interest:Structuralism does not address the semantic gap problem. However, the use of a film ontology in computed analysis must necessarily build upon the results of segmentation and clustering of low-level description units into high-level semantic units. This means one builds on the other. I argue that this hierarchical structure is essential, because it accounts for the development of technology and its specific techniques. While the semantic gap is not resolved, one may implement structuralist approaches in insightful research. [1] reviews in detail past implementations of CMA. There have been multiple attempts at extracting high-level meaning from multimedia content, starting with structuralist approaches during the mid-late nineties [21,14,10] through current research in content-based semantic analysis [26,5,18]. Most of these investigative efforts have used film-specific terminology to guide their efforts. One striking example is found in [7], in which film grammar is used in semantic construct extraction of tone, shot rhythm, and pace in film; this is done using a "primitive feature extraction" (p. 96), which is simply a structural projection of Zettl's lighting, color, time-motion and sound, although they also extract simpler features such as shot length and type. Moreover, media aesthetics, and specifically Zettl's Sight Sound Motion [29], is the framework most widely used for such a terminology. This evidences how automated content analysis has taken into consideration the basic CMA assumptions. Therefore, one may say such efforts have been adequately implemented.
- Structural (for example, segments, scenes, and shots), and
- Content (for example, objects and actors in scenes).
5 Methodology
5.1 Relevance
To understand the relevance of CMA to my research, its premises should be analyzed in detail. These, I've shown, are as follows:- Automated content analysis entails an understanding of the basic compositional and aesthetic media principles
- Such understanding must be projected into algorithmic procedures
- The computational execution should be informed by the relevant grammar -film, as proposed by [20]
- Low-level structuralist descriptions are built upon to form higher levels of meaning
5.2 Tool
Content analysis in mass communication reviews the start-of-the-art in content analysis. [16] report the results of a content analysis of 200 other studies in the literature published between 1994 and 1998. Of these, only 69% report any kind of intercoder reliability; of those that did, many did not include either the size of the reliability sample, its coders, specific reliability variables, amount of training, or how discrepancies between coders were resolved. They go through a detailed overview of measuring reliability, offering percent agreement, Holti's method, Scott's Pi (p), Cohen's Kappa (k), and Krippendorff's Alpha (a) as possible indicators. They conclude by proposing a number of recommendations to social scientists involved in content analysis:- Calculate and report intercoder reliability,
- select one or more appropriate indices,
- obtain the necessary tools to calculate the index or indices selected,
- select and appropriate minimum acceptable level of reliability for the index or indices to be used,
- assess reliability informally during coder training,
- assess reliability formally in a pilot test,
- assess reliability formally during coding of the full sample,
- select and follow an appropriate procedure for incorporating the coding of the reliability sample into the coding of the full sample
5.3 Technique
I will use intercoder reliability indices, not percent agreement, to calculate the accuracy of a variety of different classification systems when applied to film sound tracks. I will begin by manually coding 10% of the sound track universe, with multiple human coders, ideally between three and five, with the following categories of analysis:- Speech
- Music
- Noise
- Silence
- Speech with music background
- Speech with noise background
- Music with noise background
- Auditory scene
5.4 Sampling
The sampling universe includes the top-grossing American feature films from 1970 to 2006 as reported by the Internet Movie Database as of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American means that the production was financed by at least one major American production company, and thus excludes independent productions. Feature film is defined as a production at least 40 minutes long released for the theatrical market. I will randomly select three films for each decade (1970-1979, 1980-1989, 1990-1999, and 2000-2006) for analysis. This is in line with recent research employing content analysis methods in film [17]. The choice of American feature films of no specific genre is arbitrary, as the goal is to test the techniques. The choice of top-grossing box office revenue is meant to frame the Hollywood industry specifically, and exclude independent efforts which would necessarily clutter the analysis -one needs to focus somewhere-, while at the same time making sure a quality parameter is not being used: it could be argued that box office figures don't show a strong correlation to aesthetic quality, although, to my knowledge, no research specifically supports this. Stratified sampling insures that no particular time-frame is over-represented. Three movies per decade will probably minimize chance combinations of aberrant results, if [17] is considered.6 Limitations
This study measures mainly internal validity, by ensuring I am indeed studying what I propose to. I am not concerned with external validity at this point: eventually, I will conduct a pilot study employing a meta-analysis of sampling sizes, frequencies, and techniques. There are limitations to the methodology: the obvious one is that the questions that I ask are necessarily bound by the limits of the technology. Until even higher-level semantics are computable from a medium such as audio or video, one can not investigate more complex production practices. Because of that, I am chiefly concerned with appraising what my methodological limitations are. Hence a preliminary study (the object of my proposal and research question) to determine whether the theoretical considerations exposed in this paper are indeed correct. However, there are two different ways of looking at such limitations: on the one hand, the research deliberately looks at internal validity only, to maintain its scope manageable; the missing external component will be looked at at another time. On the other hand, the expected output from such study is a new way of using computational techniques to do the work of multiple persons in typical content analysis.7 Future perspectives
The proposed study is a part of a larger long-term project: the longitudinal study of the evolution of the film sound track since its upcoming. Another such part is another pilot study, which I mentioned before, intended to test the external validity of the methodology, through a meta-analysis of sampling sizes, frequencies, and techniques, in film sound content analysis. This is important in that it will optimize sampling procedures in film sound, arguably a film in need of such research (there are many studies for content analysis of newspapers and magazines, specifically). The ultimate goal is to eventually provide a repository of film sound-specific information for researchers. An unintended consequence might be the closing of a small part of the semantic gap, to follow Herbert Zettl's call.References
- [1]
- B. Adams. Where does computational media aesthetics fit? IEEE Multimedia, 10(2):18-27, 2003.
- [2]
- Rick Altman, editor. Sound Theory Sound Practice. Routledge, New York, 1992.
- [3]
- D. Arijon. Grammar of the film language. Focal Press New York, 1976.
- [4]
- Michel Chion. Audio-Vision. Columbia University Press, New York, 1994.
- [5]
- M. Davis. Editing out video editing. IEEE MultiMedia, 10(2):54-64, 2003.
- [6]
- C. Dorai, A. Mauthe, F. Nack, L. Rutledge, T. Sikora, and H. Zettl. Media semantics: who needs it and why? Proceedings of the tenth ACM international conference on Multimedia, pages 580-583, 2002.
- [7]
- C. Dorai and S. Venkatesh. Bridging the semantic gap in content management systems: Computational media aesthetics. Proceedings of the First Conference on Computational Semiotics for Games and New Media-COSIGN, pages 94-99, 2001.
- [8]
- C. Dorai and S. Venkatesh. Bridging the semantic gap with computational media aesthetics. IEEE MultiMedia, 10(2):15-17, 2003.
- [9]
- William Evans. Teaching computers to watch television: Content based image retrieval for content analysis. Social Science Computer Review, 18(3):246-257, 2000.
- [10]
- Nuno Guimaraes, Nuno Correia, Ines Oliveira, and Joao Martins. Designing computer support for content analysis: A situated use of video parsing and analysis techniques. Multimedia Tools and Applications, (7):159-180, 1998.
- [11]
- Tomlinson Holman. Sound for Film and Television. Focal Press, Burlington, 2nd edition, 2001.
- [12]
- G. Hripcsak and A.S. Rothschild. Agreement, the f-measure, and reliability in information retrieval, 2005.
- [13]
- Hyoung-Gook Kim, Nicolas Morea, and Thomas Sikora. Audio classification based on the MPEG-7 spectral basis representations. IEEE Transactions on Circuits and Systems for Video Technology, 14(5):716-725, May 2004.
- [14]
- Rainer Lienhart, Silvia Pfeiffer, and Wolfgang Effelsberg. The MoCA Workbench: support for creativity in movie content analysis. Technical Report TR-95-034, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1995.
- [15]
- Ruei-Shiang Lin and Ling-Hwei Chen. A new approach for classification of generic audio data. International Journal of Pattern Recognition and Artificial Intelligence, 19(1):63-78, 2005.
- [16]
- Matthew Lombard, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4):587-604, October 2002.
- [17]
- Elizabeth Monk-Turner, Peter Ciba, Matthew Cunningham, P. Gregory McIntire, Mark Pollard, and Rebecca Turner. A content analysis of violence in american war movies. Analyses of Social Issues and Public Policy, 4(1):1-11, 2004.
- [18]
- P. Mulhem, M.S. Kankanhalli, J. Yi, and H. Hassan. Pivot vector space approach for audio-video mixing. IEEE MultiMedia, 10(2):28-40, 2003.
- [19]
- Walter Murch. In the Blink of an Eye. Silman-James Press, Los Angeles, 2nd edition, 2001.
- [20]
- F. Nack, C. Dorai, and S. Venkatesh. Computational media aesthetics: Finding meaning beautiful. IEEE Multimedia, 8(4):10-12, 2001.
- [21]
- Silvia Pfeiffer, Stephan Fischer, and Wolfgang Effelsberg. Automatic audio content analysis. Technical Report TR-96-008, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1996.
- [22]
- Srividya Ramasubramanian. A content analysis of the portrayal of india in films produced in the west. The Howard Journal of Communications, 16(4):243-265, 2005.
- [23]
- L. Rowe, J. Boreczky, and C. Eads. Indexes for user access to large video databases. Proc. Storage and Retrieval for Image and Video Databases, pages 150-161, 1994.
- [24]
- Amjad Samour and Hyoung-Gook Kim. Mpeg-7 audio analyzer: Low level descriptors extractor. Retrieved 17 December 2006, from http://mpeg7lld.nue.tu-berlin.de/, March 2004.
- [25]
- Hari Sundaram and Shih-Fu Chang. Audio scene segmentation using multiple features, models and timescales. In Acoustics, Speech, and Signal Processing. Proceedings. ICASSP'00. IEEE International Conference on, volume 6, pages 2441-2444, June 2000.
- [26]
- B.T. Truong, S. Venkatesh, and C. Dorai. Application of computational media aesthetics methodology to extracting color semantics in film. Proceedings of the tenth ACM international conference on Multimedia, pages 339-342, 2002.
- [27]
- Elizabeth Weis and John Belton, editors. Film Sound: Theory and Practice. Columbia University Press, New York, 1985.
- [28]
- David Lweis Yewdall. Practical Art of Motion Picture Sound. Focal Press, Burlington, 3rd edition, 2003.
- [29]
- Herbert Zettl. Sight Sound Motion: applied media aesthetics. Wadsworth Publishing, Belmont, 3rd edition, 1999.
Footnotes:
1Broadcast and Electronic Arts Department, San Francisco State UniversityFile translated from TEX by TTH, version 3.77.
On 24 Apr 2007, 23:49.
