Skip to main content

Papers

Feasibility of different automated content analysis techniques in film sound classification: A look at internal validity

Feasibility of different automated content analysis techniques in film sound classification: A look at internal validity

Pedro Silva1
psilva@sfsu.edu

18 December 2006

1  Introduction

1.1  Background

Film sound is an often overlooked field, be it in general film theory or in actual production practice. There is a lack of serious writing on the subject, even considering the works of [4], [2] and [27], on the theoretical side, and [28], [11] and [19] as practitioners of the medium.
The area seems to be divided into:
  1. Humanistic approaches, coming chiefly from film studies departments in Fine Arts and Humanities schools and focusing on the philosophy of sound as a support for film, albeit with some exceptions, such as the ones mentioned above. Methodologies are argumentative in nature, thereby lacking the potential validity of a scientific method.
  2. Professional approaches, which are usually technical (with the welcome exception of Walter Murch). These usually deal with application of technology and its incorporation into specific techniques. While these types of contributions are very important to the development of the practice, they add considerably less to the body of higher knowledge than its humanistic counterpart.
  3. The socio-scientific approach seems to be neglected. Although there are a number of studies using content analysis in film, these mostly are concerned with violence, drug use, et cetera, in film.

1.2  Context

I am focusing on understanding whether computational media aesthetics (CMA), as a field and a theory, can be used in my long-term research, a longitudinal study of the evolution of the film sound track in Hollywood, from the dominance of the sound film starting in the early 1930s through today, specifically looking at two sets of variables: the sound track elements of dialogue, music, noise, and silence, and the auditory scene, independent of the visual scene. The methodology will be automated content analysis; due to its speed and efficiency, this is the underlying catalyst to such a comprehensive study. However, while research on the capabilities of audio-specific data mining has been conducted, particularly in the fields of music information retrieval (MIR) and speech processing, no applications have resulted in film sound investigation. I thus propose to research the possibility of automatically analyzing content in the described setting effectively. This implies that the technology used for this effect must be able to extract and cluster low-level audio features, or units of analysis, and organize them into semantic units, or classes, before any statistically meaningful conclusions may be drawn. There are, then, two components to this methodology: the technological and the technical. The technological component is related to the state-of-the-art in computational data mining; the technical aspect derives from the field of media aesthetics; it is the study of semantic significance in production practice.

2  Research question

2.1  Operational definitions

Audio classification
Algorithmic process of extracting low-level audio features, and using these to categorize a signal.
Sound track
Ensemble of sound elements in film; speech, music, noise, and silence.
Auditory scene
A semantic unit in film; the aural equivalent of a visual scene, although not necessarily the same.
Semantic gap
The difference between the low-level features that machines perceive and teh high-level attributes and meaning that humans require.
Film sound theory
Film sound theory is means academic literature produced mainly by film studies departments in the Humanities.
Film sound practice
Film sound practice is the professional field of production and/or post-production sound for moving pictures.
Socio-scientific research
In the proposed setting, a systematic investigative effort using quantitative and qualitative tools that attempts to:
  • provide a scientific basis to humanistic assertions made in film sound theory and
  • attribute socio-cultural and aesthetic meaning to professional practices in film sound.

2.2  Formal statement

I propose to study the feasibility of using automated content analysis techniques to gather accurate longitudinal data on such as length and relative position (i.e. coincidental, asynchronous, etcetera) over an extended period of time.

3  Literature review

3.1  Computational media aesthetics

Media semantics: Who needs it and why? [6] combines six panel statements taken from the tenth International Conference on Multimedia in 2002. They "represent the whole multimedia value chain" (p. 580), and assert authority on specific fields, conferred by their respective authors. Chitra Dorai has written extensively on the subject of computational media semantics [7,8,20]; Thomas Sikora is one of the leading sources in MPEG-7 audio [13]; Herbert Zettl is most definitely an authority in applied media aesthetics [29]. Their goal is to stimulate discussion in "bridging the semantic gap": that is, how to move away from mere analytical to contextual implementations of computational technology in relation to the ever-increasing amounts of media material available. They specifically address the following questions: . In other words, Zettl (in this case) attributes the responsibility of 'bridging the semantic gap' to media researchers, as opposed to computer scientists.
[6] help understand some of the broader possibilities of my long-term investigation. They support the idea that inter-disciplinary efforts should and must engage in the ongoing research. The concept of film grammar-based automatic content analysis is touched here; I expect to develop it into a methodology of content analysis that is based on film sound grammar. The semantic web, MPEG-7 and related conceptualizations of knowledge domains suggest that an extension of my work in the future could be the development and implementation of a film sound ontology for researchers. In other words, an interfacing system of analysis of sound tracks which can be accessed via the web, for example. Some of the tools I will use in this research can be easily implemented on that domain (and a few already are -it is possible to upload an audio file to a web server, and have it come back processed as an MPEG-7 descriptor file containing information about that file's properties [24]. It is possible to implement another system which takes that same MPEG-7 descriptor file and process it into something semantically meaningful).

3.2  Automated content analysis in the social sciences

Teaching computers to watch television: Content based image retrieval for content analysis [9] is relevant to those wanting to research in the social sciences in general, and communications and media in particular. It Nevertheless, the paper is not without its shortcomings. It The first part briefly describes the general structure of Evans's study. The second section argues that
Evans considers current applications of CBIR in social science to be scarce, although the technology is available. He provides a number of examples of content analysis research employing CBIR, namely [10]. Still, he considers the preprocessing of video as the most likely application of CBIR to social science currently and in the near future, as a way of enabling a better human coding practice (p. 252). Asserting a high-level of expertise and high technology costs in current CBIR, social researchers are advised to at least seek computerized assistance in content analysis. Despite its shortcomings, Teaching computers to watch television is important, in more than one aspect: the year it was published was an essential one, as the MPEG-7 standard was ready in 2001. Thus, its publication just before may have alerted many social scientists to the ready benefits of automated content analysis; it is a comprehensive review of the technical and social literature, describing with more than mere superficial detail its inner workings, techniques and tools. It is thus a good starting point for someone starting out in the field; it provides a solid, balanced overview of both technology and its social science application. This is important in that most research tends to fall into the purely technical [10,21,14,13], or the purely social [17,22].
[10] report an early implementation of an automated content analysis system adapted to the social sciences. The underlying research was inter-disciplinary, involving a computer science research group and a group of social scientists. The paper discusses multimedia information parsing and matching algorithms more or less generally, as well as video segmentation and the role of information models in maximizing efficiency in computational content analysis (pp. 162-163). It very specifically reports the researchers's experience in implementing the system in 1995 during the Portuguese parliamentary election campaign. The system was used in determining the content of selected news segments as they reported the campaign. They conclude with technical and conceptual recommendations for future research; [10] find the system technically feasible and its output flexible to different needs other than specific content analysis in communications research (p. 177). Moreover, they believe they have been able to bridge the ßemantic gap" typically encountered in abstract signal analysis.
This is, perhaps, the earliest and most explicit suggestion that a metaphor can be constructed between content analysis and signal processing. While their terminology is somewhat different than that used more recently, and their specific focus is on the audiovisual stream, this research is conceptually akin to mine. It is also concerned with usability issues, for example, because it is inter-disciplinary. It bridges two of my topic areas: Content analysis in mass communications and signal processing in computer science, thereby giving me some perspective on some of the possible approaches to the problem. More importantly, it serves to demonstrate the "feasibility of designing Content Analysis processes under much faster and much less man powered conditions". This is crucial in opening a new range of questions to be asked, by enabling in-depth longitudinal studies of how the media has evolved -or in my case, of how film sound has changed, and how film theory has accompanied, modified, or been modified by, this change.

3.3  Automated audio classification

[25] reports a three-part audio scene segmentation framework: They propose that an auditory scene can be characterized by its sound sources, and the way their dominance over one another varies. This can be analyzed through psycho-acoustic models of the human auditory system, which they propose is defined best by two parameters: memory and attention span. Thus they can vary their segmentation accuracy results as they vary these parameters. This obviously points toward the possibility of manipulation of results. However, flexibility is something to be gained from their method. In any case, [25] report a 97% accuracy in identifying scene changes, with an error probability of 10%, which puts the study in the same estimate of figures others are achieving.
This work supports another set of possible variables to be looked at in my study: the relation between auditory and visual scenes. Furthermore, a longitudinal investigation of the sound track could see emerge a trend in how film editing and sound design have evolved over the years. The results reported imply that the state-of-the-art is at the point of implementation in a content analysis study of film sound. Not only that, scene segmentation followed by classification is essentially a hierarchical system, which others have shown to produce the best results.
[15] propose a "real-time classification method to classify audio signals into several basic audio types such as pure speech, music, song, speech with music background, and speech with environmental noise background" (p. 63). Their approach is "generic and model free [and thus their] method can be applied to many applications" (p. 77). Reviewing the literature, they conclude that most approaches are developed toward specific scenarios and, consequently, are of little use in actual implementation to researchers in areas other than those such approaches are optimized toward. They offer a hierarchical system, which classifies sounds into classes coarsely first, and then moves into progressively finer detail. This can be equated to human-coded content analysis, where cues are given to a tester as a way of improving results. The result is a higher than 96% accuracy rate.
There are a number of distinguishing features in this study, when compared to other recent research. [15] offer a flexible model, which is therefore directly applicable toward my research. Their method is computationally inexpensive, due to the nature of the calculations involved. Resultantly, the system is able to process a file in approximately 1/20th of its duration. This is extremely important when dealing with feature-length audio files. In fact, it may enable an in-depth longitudinal study of sound tracks in film across a number of decades, which easily involves thousands of hours of audio, considering a stratified random sample. While audio features are described mathematically, their specific algorithms are notated in a step-by-step diagram flow. This is certainly easier for laymen to understand and implement though high-level programming environments. Summarizing, their model is remarkably accurate, simple, fast, and adaptable. It is therefore ideal for my project.

4  Working framework

4.1  Description

Computational media aesthetics contends that "we must understand compositional and aesthetic media principles to guide [automated] content analysis" [20,p. 10]. It is defined as the "algorithmic study of a variety of image and aural elements in media (based on film grammar). It is also the computational analysis of the principles that have emerged underlying their manipulation in the creative art of clarifying, intensifying, and interpreting an event for an audience" ([20,p. 11]). It attempts to address a problem raised by multimedia content management (MCM): the semantic gap [1,p. 18]. The semantic gap is "the gulf between the rich meaning and interpretation that users expect systems to associate with their queries for searching and browsing media and the shallow, low-level features (content descriptions) that the systems actually compute" [8,p. 15]. [20] identify two sources for analyzing and interpreting media: first, structuralism is used in film studies as an analytical tool. It consists of segmenting content -film, in this case-, and analyzing and interpreting the resulting sections, usually based upon a semiotic approach. Second, film grammar, they consider, is a far richer grounding for the automated content analysis of media. Film grammar constitutes an effective ontology of production knowledge, in that it fairly represents a "worldwide use [of] accepted rules and techniques to solve problems in transforming a story from a written script to a captivating visual and aural narration" [3], cited in [20,p. 10]. These rules and others, covering production practices in other fields of the media, combine to form the general field of media aesthetics. This field's researchers, Zettl argues in [6,p. 582], are responsible for addressing the problem of the semantic gap. What should result from this description of CMA is its underlying interdisciplinary nature, as the diverse contributing panel in [6] exemplifies. Although computer science and media aesthetics are its main areas, these further subdivide into speech processing, MIR, content-based image retrieval (CBIR), audiovisual segmentation, classification and indexing, film aesthetics, and sound aesthetics.

4.2  Critique

[20] (2001) have their concepts of structuralism and film ontology as pertaining to two different levels of analysis supported by previous research: citing [23], [1,p. 18] summarizes:
The three types of indexes that are generally required [in voice-on-demand applications by end-users], of which two are of interest:
  • Structural (for example, segments, scenes, and shots), and
  • Content (for example, objects and actors in scenes).
Structuralism does not address the semantic gap problem. However, the use of a film ontology in computed analysis must necessarily build upon the results of segmentation and clustering of low-level description units into high-level semantic units. This means one builds on the other. I argue that this hierarchical structure is essential, because it accounts for the development of technology and its specific techniques. While the semantic gap is not resolved, one may implement structuralist approaches in insightful research. [1] reviews in detail past implementations of CMA. There have been multiple attempts at extracting high-level meaning from multimedia content, starting with structuralist approaches during the mid-late nineties [21,14,10] through current research in content-based semantic analysis [26,5,18]. Most of these investigative efforts have used film-specific terminology to guide their efforts. One striking example is found in [7], in which film grammar is used in semantic construct extraction of tone, shot rhythm, and pace in film; this is done using a "primitive feature extraction" (p. 96), which is simply a structural projection of Zettl's lighting, color, time-motion and sound, although they also extract simpler features such as shot length and type. Moreover, media aesthetics, and specifically Zettl's Sight Sound Motion [29], is the framework most widely used for such a terminology. This evidences how automated content analysis has taken into consideration the basic CMA assumptions. Therefore, one may say such efforts have been adequately implemented.

5  Methodology

5.1  Relevance

To understand the relevance of CMA to my research, its premises should be analyzed in detail. These, I've shown, are as follows:
Ideally, the long-term project will span across the entire length of the existence of the sound film in a single place (i.e. United States, France), industry (i.e. Hollywood, Bollywood), school of practice (i.e. French modernist, Russian formalist), or genre (i.e. horror, comedy). These requirements pose constraints on the methodology: obviously, the system must be able to accurately classify the sound track elements in a structuralist approach. Additionally, there is a semantic requirement, and that is that the system be able to classify, through whatever low-level features necessary, when an auditory scene begins and ends, and what are some of its characteristics. A less obvious requirement is that this process be fast enough for a timely classification of an adequate sample size of a universe that most likely will conglomerate seven decades of film production.
An assessment of the relevant literature shows that these three main requirements are probably achievable with current technology and techniques. Conceptually, my methodology for doing so is computational media aesthetics. Again, this theory describes the process of automating the process of analyzing content, starting with a simple structuralist approach in dividing media elements (whatever they may be), clustering these in categories, and classifying such categories. It combines a multitude of structuralist approaches with production knowledge, or grammar, or terminology -aesthetic principles, simply put- to close in on the semantic gap, thereby creating a system capable of inferring higher-level meaning from simple descriptors. Computational efficiency is achievable currently, and this supports a longitudinal approach to the research. The fact that CMA is a methodological framework is defendable by the fact that the research it supports is strictly intended to test a set of tools and techniques.

5.2  Tool

Content analysis in mass communication reviews the start-of-the-art in content analysis. [16] report the results of a content analysis of 200 other studies in the literature published between 1994 and 1998. Of these, only 69% report any kind of intercoder reliability; of those that did, many did not include either the size of the reliability sample, its coders, specific reliability variables, amount of training, or how discrepancies between coders were resolved. They go through a detailed overview of measuring reliability, offering percent agreement, Holti's method, Scott's Pi (p), Cohen's Kappa (k), and Krippendorff's Alpha (a) as possible indicators. They conclude by proposing a number of recommendations to social scientists involved in content analysis: . [16] recommend coefficients of at least .90 or greater for most indices; .80 or greater are adequate most of the time; .70 or greater for exploratory or pilot studies. These are the values I use as guidelines in assessing the readiness of automated content analysis methods. Since such studies öften quantify system performance as precision, recall, and F-measure, or as agreement" [12,p. 296], the specific accuracy rates to be looked for will depend substantially on the algorithm used and the statistical method employed to measure its coding reliability.

5.3  Technique

I will use intercoder reliability indices, not percent agreement, to calculate the accuracy of a variety of different classification systems when applied to film sound tracks. I will begin by manually coding 10% of the sound track universe, with multiple human coders, ideally between three and five, with the following categories of analysis:
This will be done by subjectively discriminating against changes in volume proportions, poor audio quality, et cetera. I can not clearly state units of analysis due to the nature of the medium under study; these would be pitch, loudness, timbre, formants, and other subjective audio-specific attributes. The auditory scene class is defined by being a coherent collection of sound sources, and indeed is dominated by some of these sources [25,p. 2441]. Still resorting to the same model, "a scene change is said to occur when the majority of the few dominant sources in the sound change" (ibid). The results of this step provide a baseline suitable for benchmarking the classification accuracies of the computational classification techniques. By contrasting intercoder reliability between different methods, including manual coding, and the several automated content analysis algorithms available in the literature, I will be able to choose the one, if any, better suited to a content analysis of film sound tracks.

5.4  Sampling

The sampling universe includes the top-grossing American feature films from 1970 to 2006 as reported by the Internet Movie Database as of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American means that the production was financed by at least one major American production company, and thus excludes independent productions. Feature film is defined as a production at least 40 minutes long released for the theatrical market. I will randomly select three films for each decade (1970-1979, 1980-1989, 1990-1999, and 2000-2006) for analysis. This is in line with recent research employing content analysis methods in film [17]. The choice of American feature films of no specific genre is arbitrary, as the goal is to test the techniques. The choice of top-grossing box office revenue is meant to frame the Hollywood industry specifically, and exclude independent efforts which would necessarily clutter the analysis -one needs to focus somewhere-, while at the same time making sure a quality parameter is not being used: it could be argued that box office figures don't show a strong correlation to aesthetic quality, although, to my knowledge, no research specifically supports this. Stratified sampling insures that no particular time-frame is over-represented. Three movies per decade will probably minimize chance combinations of aberrant results, if [17] is considered.

6  Limitations

This study measures mainly internal validity, by ensuring I am indeed studying what I propose to. I am not concerned with external validity at this point: eventually, I will conduct a pilot study employing a meta-analysis of sampling sizes, frequencies, and techniques.
There are limitations to the methodology: the obvious one is that the questions that I ask are necessarily bound by the limits of the technology. Until even higher-level semantics are computable from a medium such as audio or video, one can not investigate more complex production practices. Because of that, I am chiefly concerned with appraising what my methodological limitations are. Hence a preliminary study (the object of my proposal and research question) to determine whether the theoretical considerations exposed in this paper are indeed correct.
However, there are two different ways of looking at such limitations: on the one hand, the research deliberately looks at internal validity only, to maintain its scope manageable; the missing external component will be looked at at another time. On the other hand, the expected output from such study is a new way of using computational techniques to do the work of multiple persons in typical content analysis.

7  Future perspectives

The proposed study is a part of a larger long-term project: the longitudinal study of the evolution of the film sound track since its upcoming. Another such part is another pilot study, which I mentioned before, intended to test the external validity of the methodology, through a meta-analysis of sampling sizes, frequencies, and techniques, in film sound content analysis. This is important in that it will optimize sampling procedures in film sound, arguably a film in need of such research (there are many studies for content analysis of newspapers and magazines, specifically). The ultimate goal is to eventually provide a repository of film sound-specific information for researchers. An unintended consequence might be the closing of a small part of the semantic gap, to follow Herbert Zettl's call.

References

[1]
B. Adams. Where does computational media aesthetics fit? IEEE Multimedia, 10(2):18-27, 2003.
[2]
Rick Altman, editor. Sound Theory Sound Practice. Routledge, New York, 1992.
[3]
D. Arijon. Grammar of the film language. Focal Press New York, 1976.
[4]
Michel Chion. Audio-Vision. Columbia University Press, New York, 1994.
[5]
M. Davis. Editing out video editing. IEEE MultiMedia, 10(2):54-64, 2003.
[6]
C. Dorai, A. Mauthe, F. Nack, L. Rutledge, T. Sikora, and H. Zettl. Media semantics: who needs it and why? Proceedings of the tenth ACM international conference on Multimedia, pages 580-583, 2002.
[7]
C. Dorai and S. Venkatesh. Bridging the semantic gap in content management systems: Computational media aesthetics. Proceedings of the First Conference on Computational Semiotics for Games and New Media-COSIGN, pages 94-99, 2001.
[8]
C. Dorai and S. Venkatesh. Bridging the semantic gap with computational media aesthetics. IEEE MultiMedia, 10(2):15-17, 2003.
[9]
William Evans. Teaching computers to watch television: Content based image retrieval for content analysis. Social Science Computer Review, 18(3):246-257, 2000.
[10]
Nuno Guimaraes, Nuno Correia, Ines Oliveira, and Joao Martins. Designing computer support for content analysis: A situated use of video parsing and analysis techniques. Multimedia Tools and Applications, (7):159-180, 1998.
[11]
Tomlinson Holman. Sound for Film and Television. Focal Press, Burlington, 2nd edition, 2001.
[12]
G. Hripcsak and A.S. Rothschild. Agreement, the f-measure, and reliability in information retrieval, 2005.
[13]
Hyoung-Gook Kim, Nicolas Morea, and Thomas Sikora. Audio classification based on the MPEG-7 spectral basis representations. IEEE Transactions on Circuits and Systems for Video Technology, 14(5):716-725, May 2004.
[14]
Rainer Lienhart, Silvia Pfeiffer, and Wolfgang Effelsberg. The MoCA Workbench: support for creativity in movie content analysis. Technical Report TR-95-034, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1995.
[15]
Ruei-Shiang Lin and Ling-Hwei Chen. A new approach for classification of generic audio data. International Journal of Pattern Recognition and Artificial Intelligence, 19(1):63-78, 2005.
[16]
Matthew Lombard, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4):587-604, October 2002.
[17]
Elizabeth Monk-Turner, Peter Ciba, Matthew Cunningham, P. Gregory McIntire, Mark Pollard, and Rebecca Turner. A content analysis of violence in american war movies. Analyses of Social Issues and Public Policy, 4(1):1-11, 2004.
[18]
P. Mulhem, M.S. Kankanhalli, J. Yi, and H. Hassan. Pivot vector space approach for audio-video mixing. IEEE MultiMedia, 10(2):28-40, 2003.
[19]
Walter Murch. In the Blink of an Eye. Silman-James Press, Los Angeles, 2nd edition, 2001.
[20]
F. Nack, C. Dorai, and S. Venkatesh. Computational media aesthetics: Finding meaning beautiful. IEEE Multimedia, 8(4):10-12, 2001.
[21]
Silvia Pfeiffer, Stephan Fischer, and Wolfgang Effelsberg. Automatic audio content analysis. Technical Report TR-96-008, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1996.
[22]
Srividya Ramasubramanian. A content analysis of the portrayal of india in films produced in the west. The Howard Journal of Communications, 16(4):243-265, 2005.
[23]
L. Rowe, J. Boreczky, and C. Eads. Indexes for user access to large video databases. Proc. Storage and Retrieval for Image and Video Databases, pages 150-161, 1994.
[24]
Amjad Samour and Hyoung-Gook Kim. Mpeg-7 audio analyzer: Low level descriptors extractor. Retrieved 17 December 2006, from http://mpeg7lld.nue.tu-berlin.de/, March 2004.
[25]
Hari Sundaram and Shih-Fu Chang. Audio scene segmentation using multiple features, models and timescales. In Acoustics, Speech, and Signal Processing. Proceedings. ICASSP'00. IEEE International Conference on, volume 6, pages 2441-2444, June 2000.
[26]
B.T. Truong, S. Venkatesh, and C. Dorai. Application of computational media aesthetics methodology to extracting color semantics in film. Proceedings of the tenth ACM international conference on Multimedia, pages 339-342, 2002.
[27]
Elizabeth Weis and John Belton, editors. Film Sound: Theory and Practice. Columbia University Press, New York, 1985.
[28]
David Lweis Yewdall. Practical Art of Motion Picture Sound. Focal Press, Burlington, 3rd edition, 2003.
[29]
Herbert Zettl. Sight Sound Motion: applied media aesthetics. Wadsworth Publishing, Belmont, 3rd edition, 1999.

Footnotes:

1Broadcast and Electronic Arts Department, San Francisco State University


File translated from TEX by TTH, version 3.77.
On 24 Apr 2007, 23:49.

Accessibility
  • Creative Commons License
  • Valid XHTML 1.0 Strict
  • Valid CSS!
  • Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0

This page employs valid XHTML 1.0 Strict and CSS for cross-browser compatibility.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.