Skip to main content

Papers

Sampling, feature extraction, and classification techniques for automated content analysis of film sound tracks: A pilot study

Sampling, feature extraction, and classification techniques for automated content analysis of film sound tracks: A pilot study

Pedro Silva1
psilva@sfsu.edu

21 December 2006

1  Introduction

1.1  Background

Film sound is an often overlooked field, be it in general film theory or in actual production practice. There is a lack of serious writing on the subject, even considering the works of [4], [2], and [41] on the theoretical side, and [42], [12], and [23] as practitioners of the medium.
The area seems to be divided into:
  1. Humanistic approaches, coming chiefly from film studies departments in Fine Arts and Humanities schools and focusing on the philosophy of sound as a support for film, albeit with some exceptions, such as the ones mentioned above. Methodologies are argumentative in nature, thereby lacking the potential validity of a scientific method.
  2. Professional approaches, which are usually technical (with the welcome exception of Walter Murch). These usually deal with application of technology and its incorporation into specific techniques. While these types of contributions are very important to the development of the practice, they add considerably less to the body of higher knowledge than its humanistic counterpart.
  3. The social scientific approach seems to be neglected. Although there are a number of studies using content analysis in film, these mostly are concerned with violence, drug use, et cetera, in film.

1.2  Context

I am focusing on assessing the external validity of automated content analysis in the medium of film sound, through a meta-analysis of different sampling sizes, frequencies, and techniques. It is also a pilot study, exploratory in nature, to my long-term research project, a longitudinal study of the evolution of the film sound track in Hollywood, from the dominance of the sound film starting in the early 1930s through today, specifically looking at two sets of variables: the sound track elements of dialogue, music, noise, and silence, and the auditory scene, independent of the visual scene. The methodology will be automated content analysis; due to its speed and efficiency, this is the underlying catalyst to such a comprehensive study. However, while research on the capabilities of audio-specific data mining has been conducted, particularly in the fields of music information retrieval (MIR) and speech processing, no applications have resulted in film sound investigation. I thus propose to research the possibility of automatically analyzing content in the described setting effectively. This implies that the technology used for this effect must be able to extract and cluster low-level audio features, or units of analysis, and organize them into semantic units, or classes, before any statistically meaningful conclusions may be drawn. There are, then, two components to this methodology: the technological and the technical. The technological component is related to the state-of-the-art in computational data mining; the technical aspect derives from the field of media aesthetics; it is the study of semantic significance in production practice.

1.3  A note on structure

I do not state formalized research questions or hypotheses because of the exploratory nature of the study. In any case, I expect that the above section clarifies my intent sufficiently. The methodology section further discriminates the various steps involved; the relevance section will emphasize some of the reasons to do so.
The literature review is divided in topics, for clarity's sake. However, I feel it would be adequate to address relevant research in other sections. Specifically, this means additional literature in computational media aesthetics, of which I present a discussion in the Theoretical framework section, and intercoder reliability in content analysis, analyzed in the methodology section.
I am not concerned with any theoretical aspects of film sound in this study. My sole goal is to attempt a social scientific approach to analyzing the medium, through quantitative techniques. There is no section dedicated to research problems; in the sense that these are what must be overcome to conclude the research, such problems are quite simple: to find a way of computationally analyzing audio content with internal validity, and to do so as efficiently as possible, through externally valid sampling methods.

2  Literature review

2.1  Film sound theory

[35] proposes a theory of musique concrete, which strongly builds on his own work at the French radio-television organization starting in the 1940s. Such music uses natural or environmental sounds -concrete sounds- as its building blocks. The idea implies abstracting sound from its biological (source) and semantic (meaning) characteristics. What remains is its objective features. [20] developed this concept with their aural objects:
Ideologically, the aural source is an object, the sound itself a "characteristic." Like any characteristic, it is linked to the object, and that is why identification of the latter suffices to evoke the sound, whereas the inverse is not true. "To understand" a perceptual event is not to describe it exhaustively but to be able to classify and categorize it [...] (pp. 26-7)
A disciple of Schaeffer, [4] presents a theoretical model for audio-visual relations in film. He offers his insight on projections of sound on image through value added by sound and the influence of sound in visual perception (pp. 3-24); he proposes three listening modes: the causal, which we use in tracking sources in everyday life, the semantic, through which we extract meaning from sound, and the reduced, which is used when abstracting listening into sound itself (pp. 25-34). This is certainly a Schaefferian construct in the line of Metz's aural objects. More interestingly, these works have opened other avenues of research, particularly in algorithmic music composition, and sound object programming environments. And although this is not mentioned in any literature, it is possible to link aural objects and the idea of objective, machine perceptible features versus subjective, human-perceptible attributes, to recent research into speech processing, automatic audio indexing and classification, and artificial intelligence. Even more recently, computational media aesthetics has risen as an inter-disciplinary field concerned with the feature-attribute dichotomy, or what is called the semantic gap.
In Vertical perspectives on audio-visual relations, the author addresses the problem of asynchronization and synchronization of film sound. He continues by proposing the formal non-existence of the sound track as a single entity independent from the images (pp. 35-65); [4] develops the concepts of auditory scene as a semantic unit (pp. 66-94) and phantom audio-vision through the acousmatic -offscreen diegetic sound- and acousmetre -offscreen diegetic sound source (pp. 123-137); perhaps more importantly, he outlines an introduction to audiovisual analysis, thereby setting the way for a relatively standardized approach to content analysis of film sound (pp. 185-213).
Asynchronization versus synchronization of sound in film is a much debated topic on the part of theorists, and one I shall address through content analysis, as that is one area where computer science has successfully produced results. On the same methodological note, I am also interested in understanding how auditory scenes can be analyzed automatically, and current literature in the field suggests adequate segmentation efficacy. Although my methodology for analysis of film sound content is different, Chion's methods of observation and general outline for audiovisual analysis undoubtedly inform my research. Finally, a film sound-specific terminology is expanded in this work, one that has been incrementally used in later years by other authors, and one I shall use in my own work, as discussed in the computational media aesthetics (CMA) review. Such terminology is at of the root the computational media aesthetics theory, which contends that any media-classification algorithmic process must use a suitable media ontology.

2.2  Computational media aesthetics

Media semantics: Who needs it and why? [6] combines six panel statements that "represent the whole multimedia value chain" (p. 580), and assert authority on specific fields, conferred by their respective authors. Chitra Dorai has written extensively on the subject of computational media semantics [7,8,24]. In fact, [24] first proposed, to my knowledge, the CMA approach. Thomas Sikora is an important source in MPEG-7 audio [14]; Herbert Zettl is most definitely an authority in applied media aesthetics [43]. Their goal is to stimulate discussion in "bridging the semantic gap": that is, how to move away from mere analytical to contextual implementations of computational technology in relation to the ever-increasing amounts of media material available. They specifically address the following questions: . In other words, Zettl (in this case) attributes the responsibility of 'bridging the semantic gap' to media researchers, as opposed to computer scientists.
[6] help understand some of the broader possibilities of my long-term investigation. They support the idea that inter-disciplinary efforts should and must engage in the ongoing research. The concept of film grammar-based automatic content analysis is touched here; I expect to develop it into a methodology of content analysis that is based on film sound grammar. The semantic web, MPEG-7 and related conceptualizations of knowledge domains suggest that an extension of my work in the future could be the development and implementation of a film sound ontology for researchers. In other words, an interfacing system of analysis of sound tracks which can be accessed via the web, for example. Some of the tools I will use in this research can be easily implemented on that domain, and a few already are it is possible to upload an audio file to a web server, and have it come back processed as an MPEG-7 descriptor file containing information about that file's properties [33]. It is possible to implement another system which takes that same MPEG-7 descriptor file and process it into something semantically meaningful.
Since [24], there has been other research similarly concerned: [7] approach CMA from the content management system's perspective. Specifically in film, [40] look at the extraction of meaning from color. This is in striking resemblance to other research on light variables in film, albeit with sophisticated methods, by comparison [36]; nevertheless, I consider Report on a study of light variables measured as a function of time in the cinema analogous to my proposed study. It is a precursor to modern automated content analysis, done at the time optically-electronically through the use an apparatus specifically built for it. The output of such a system was a record of light variation and density across time in a number of films. In that sense, the units of analysis were light variation and light density, categorized into what the authors called light rhythm. The sampling methodology was not specified; other than a reference to films from ßome of our greatest directors", nothing else is said. One assumes that this was not a concern to the authors. [36] claim that öur 'great' films, such as The Passion of St. Joan, Battleship Potemkin, Earth, et cetera, do show a variation of light intensity which sets them apart from 'nondescript' and nonheralded films" (p 47). The need for some kind of statistics is obvious in their work. It was probably difficult, at the time, to use either descriptive or analytical statistics on a magneto-optical recording. That would mean converting, by hand, the continuously-varying values their apparatus produced into discrete values -like an analog-to-digital converter. As a result, they have no analytical instrument at hand, and so their interpretation of the findings is necessarily limited.
I find many tangents between [36]'s work and my research. It is a content analysis deviating from traditional approaches because of the medium and units of analysis under observation. They both look at an aesthetic function of film. I believe an appropriate sampling methodology is necessary, however, if the study is to have any long-lasting impact on the field. Case studies are interesting, but rarely possess any external validity, thereby lacking generalization potential. Also, I do not think [36] viewed their experimental set-up as anything more than an exploratory look into what might be possible to study -certainly they did not consider what they were doing in terms of content analysis. This would explain the lack of coding specifications and operational definitions. The work also lacks the work of established theory upon which to deduce insight from. Conceivably, an exploratory study could -and should- propose its own theory, into which the observed phenomena would fit. As it was, their conclusions were somewhat vague.

2.3  Automated audio classification

Teaching computers to watch television: Content based image retrieval for content analysis [9] Nevertheless, it The first part briefly describes the general structure of Evans's study. The second section argues that
Evans considers current applications of CBIR in social science to be scarce, although the technology is available. He provides a number of examples of content analysis research employing CBIR, namely [10]. Still, he considers the preprocessing of video as the most likely application of CBIR to social science currently and in the near future, as a way of enabling a better human coding practice (p. 252). Asserting a high-level of expertise and high technology costs in current CBIR, social researchers are advised to at least seek computerized assistance in content analysis. Despite its shortcomings, Teaching computers to watch television is important, in more than one aspect: the year it was published was an essential one, as the MPEG-7 standard was ready in 2001. Thus, its publication just before may have alerted many social scientists to the ready benefits of automated content analysis; it is a comprehensive review of the technical and social literature, describing with more than mere superficial detail its inner workings, techniques and tools. It is thus a good starting point for someone starting out in the field; it provides a solid, balanced overview of both technology and its social science application. This is important in that most research tends to fall into the purely technical [10,26,16,14], or the purely social [21,27].
Likewise, [10] report an early implementation of an automated content analysis system adapted to the social sciences. The underlying research was inter-disciplinary, involving a computer science research group and a group of social scientists. The paper discusses multimedia information parsing and matching algorithms more or less generally, as well as video segmentation and the role of information models in maximizing efficiency in computational content analysis (pp. 162-163). It very specifically reports the researchers's experience in implementing the system in 1995 during the Portuguese parliamentary election campaign. The system was used in determining the content of selected news segments as they reported the campaign. They conclude with technical and conceptual recommendations for future research; [10] find the system technically feasible and its output flexible to different needs other than specific content analysis in communications research (p. 177). Moreover, they believe they have been able to bridge the ßemantic gap" typically encountered in abstract signal analysis.
This is, perhaps, the earliest and most explicit suggestion that a metaphor can be constructed between content analysis and signal processing. While their terminology is somewhat different than that used more recently, and their specific focus is on the audiovisual stream, this research is conceptually akin to mine. It is also concerned with usability issues, for example, because it is inter-disciplinary. It bridges two of my topic areas: Content analysis in mass communications and signal processing in computer science, thereby giving me some perspective on some of the possible approaches to the problem. More importantly, it serves to demonstrate the "feasibility of designing Content Analysis processes under much faster and much less man powered conditions". This is crucial in opening a new range of questions to be asked, by enabling in-depth longitudinal studies of how the media has evolved -or in my case, of how film sound has changed, and how film theory has accompanied, modified, or been modified by, this change.
[39] report a three-part audio scene segmentation framework: . They propose that an auditory scene can be characterized by its sound sources, and the way their dominance over one another varies. This can be analyzed through psycho-acoustic models of the human auditory system, which they propose is defined best by two parameters: memory and attention span. Thus they can vary their segmentation accuracy results as they vary these parameters. This obviously points toward the possibility of manipulation of results. However, flexibility is something to be gained from their method. In any case, [39] report a 97% accuracy in identifying scene changes, with an error probability of 10%, which puts the study in the same estimate of figures others are achieving.
This work supports another set of possible variables to be looked at in my study: the relation between auditory and visual scenes. Furthermore, a longitudinal investigation of the sound track could see emerge a trend in how film editing and sound design have evolved over the years. The results reported imply that the state-of-the-art is at the point of implementation in a content analysis study of film sound. Not only that, scene segmentation followed by classification is essentially a hierarchical system, which others have shown to produce the best results.
[17] propose a "real-time classification method to classify audio signals into several basic audio types such as pure speech, music, song, speech with music background, and speech with environmental noise background" (p. 63). Their approach is "generic and model free [and thus their] method can be applied to many applications" (p. 77). Reviewing the literature, they conclude that most approaches are developed toward specific scenarios and, consequently, are of little use in actual implementation to researchers in areas other than those such approaches are optimized toward. They offer a hierarchical system, which classifies sounds into classes coarsely first, and then moves into progressively finer detail. This can be equated to human-coded content analysis, where cues are given to a tester as a way of improving results. The result is a higher than 96% accuracy rate.
There are a number of distinguishing features in this study, when compared to other recent research. [17] offer a flexible model, which is therefore directly applicable toward my research. Their method is computationally inexpensive, due to the nature of the calculations involved. Resultantly, the system is able to process a file in approximately 1/20th of its duration. This is extremely important when dealing with feature-length audio files. In fact, it may enable an in-depth longitudinal study of sound tracks in film across a number of decades, which easily involves thousands of hours of audio, considering a stratified random sample. While audio features are described mathematically, their specific algorithms are notated in a step-by-step diagram flow. This is certainly easier for laymen to understand and implement though high-level programming environments. Summarizing, their model is remarkably accurate, simple, fast, and adaptable. It is therefore ideal for my project.

3  Relevance

3.1  Contemporary research on the topic

While it is tempting to point general film sound theoretical and technical writings as contemporary research in the field, these truly are primary sources. My research topic strives to study the body of knowledge in the theory and practice fields and arrive at some correlation that will, perhaps, help make some more sense of it. So, if film sound theory is one of my objects of study, along with film sound practice, what exactly constitutes analogous research? Content analysis literature comes to mind.
  1. Effects measurement research: the main research topic employing content measurement in electronic media seems to focus heavily on effects measurement in the audience [21,27,34,38]. These all have in common the methodology, which is content analysis, applied to motion pictures, in the context of social science, communications or media research. They present useful information on sampling strategies, from horizontal strategies such as case studies to vertical approaches, in the case of longitudinal studies.
  2. Automated content analysis: preliminary investigation identified two separate projects researching on automatic audio content analysis: the Movie Content Analysis Project (MoCA) has dealt with audio track segmentation and classification, as reported in [26,16]: "[A] first classification should distinguish music, speech, silence, and other sound sequences, because handling of content is fundamentally different for each of these classes." ([26,p.4]). The question has to be, as pointed above, whether this segmentation, in the context of content analysis, may contribute to answering some of the questions raised. On the other hand, is it even viable, technologically, to segment complex soundtracks? Other authors seem to think so, as they've began using classifier systems in the categorizing of broadcast news [25]. Likewise, the MPEG-7 generic multimedia content description standard, described in [19] has provisions for audio-specific segmentation in the form of a number of low-level descriptors (the nuclear unit of analysis on a hypothetical study employing the standard). [14] also consider audio classification to be possible with low-level descriptors. Of course, none of this is relevant if a practical tool is not made available. H. Crysandt at the Institute of Communications Engineering, Aachen University, has developed an MPEG-7-based low-level audio descriptor classifier which is freely available to the public.
  3. Automated content analysis in social scientific studies: as technology advances toward precision and efficiency in computational generation, processing and reception of multimedia content, it is inevitable that communications in general and media studies in particular will tend to use it in, one hopes, insightful quantitative research. At this point, however, the scenario is frankly poor. In fact, I haven't been able to find much more than the work of [9] in the use of automated video analysis for the social scientific study of film and television content.
If the much more developed field of content-based image retrieval is not being much used in media studies, then audio-specific applications in film are in much worse shape. I propose to use quantitative research to amass a significant amount of data on film sound's use of different parameters. These can then be used in a diachronic analysis which may help establish a hypothesis on how some kind of film sound theory and practice correlate, if not influence, with one another. I realize my approach may be totally irrelevant to film sound theorists. However, it is - apparently - no more distanced from professional practice than post-modern film theory is, in the western film industry.

3.2  The semantic gap

Zettl considers the closing of the semantic gap a responsibility of media researchers [6,p. 582]. He argues that, computational methods being available, what is needed is a higher involvement in media aesthetics. The film sound literature presented here, I argue, does address the problem of film sound aesthetics. And it may be that film theory indeed fulfils the semantic gap, through semiotics, for example. However, it has striking limitations, which have been pushed far beyond its boundaries: an essentially argumentative approach is of little use to broad research on any medium; the quantitative method film studies uses (content analysis) is incapable of a large-scale longitudinal investigation. On the other hand, I contend that automated audio classification is mature enough for being used in such research, from a communications and media benchmarking stance. The corollary is that computational media aesthetics, and, specifically, its focus on closing the semantic gap, is achievable by today's technological and technical standards.

3.3  The breakdown of the studio system

Not only is this research inter-disciplinary in nature, involving computer and social science, humanistic theory of film, and the professional practice of film sound production and post-production; it is relevant to the professional industry in that it may generate a better understanding of its practices. In fact, it could even benefit a struggling film sound community, which has been severely impacted by the collapse of the Hollywood studio system. Production techniques, set protocol and post-production practices are followed without purpose, being thinly rooted on a vanishing tradition. Film schools do not, generally, educate specialized sound professionals, and trade schools don't even offer film sound-specific programs. An effort which brings non-engineering audio research to the spotlight - any effort - is welcome. Bridging the gap between theory and practice is essential for improving the state of film sound practice.

4  Theoretical framework

4.1  Description

Computational media aesthetics contends that "we must understand compositional and aesthetic media principles to guide [automated] content analysis" [24,p. 10]. It is defined as the "algorithmic study of a variety of image and aural elements in media (based on film grammar). It is also the computational analysis of the principles that have emerged underlying their manipulation in the creative art of clarifying, intensifying, and interpreting an event for an audience" ([24,p. 11]). It attempts to address a problem raised by multimedia content management (MCM): the semantic gap [1,p. 18]. The semantic gap is "the gulf between the rich meaning and interpretation that users expect systems to associate with their queries for searching and browsing media and the shallow, low-level features (content descriptions) that the systems actually compute" [8,p. 15]. [24] identify two sources for analyzing and interpreting media: first, structuralism is used in film studies as an analytical tool. It consists of segmenting content -film, in this case-, and analyzing and interpreting the resulting sections, usually based upon a semiotic approach. Second, film grammar, they consider, is a far richer grounding for the automated content analysis of media. Film grammar constitutes an effective ontology of production knowledge, in that it fairly represents a "worldwide use [of] accepted rules and techniques to solve problems in transforming a story from a written script to a captivating visual and aural narration" [3], cited in>[p. 10]nack2001. These rules and others, covering production practices in other fields of the media, combine to form the general field of media aesthetics. This field's researchers, Zettl argues in [6,p. 582], are responsible for addressing the problem of the semantic gap. What should result from this description of CMA is its underlying interdisciplinary nature, as the diverse contributing panel in [6] exemplifies. Although computer science and media aesthetics are its main areas, these further subdivide into speech processing, MIR, content-based image retrieval (CBIR), audiovisual segmentation, classification and indexing, film aesthetics, and sound aesthetics.

4.2  Critique

[24] (2001) have their concepts of structuralism and film ontology as pertaining to two different levels of analysis supported by previous research: citing [32], [1,p. 18] summarizes:
The three types of indexes that are generally required [in voice-on-demand applications by end-users], of which two are of interest:
  • Structural (for example, segments, scenes, and shots), and
  • Content (for example, objects and actors in scenes).
Structuralism does not address the semantic gap problem. However, the use of a film ontology in computed analysis must necessarily build upon the results of segmentation and clustering of low-level description units into high-level semantic units. This means one builds on the other. I argue that this hierarchical structure is essential, because it accounts for the development of technology and its specific techniques. While the semantic gap is not resolved, one may implement structuralist approaches in insightful research. [1] reviews in detail past implementations of CMA. There have been multiple attempts at extracting high-level meaning from multimedia content, starting with structuralist approaches during the mid-late nineties [26,16,10] through current research in content-based semantic analysis [40,5,22]. Most of these investigative efforts have used film-specific terminology to guide their efforts. One striking example is found in [7], in which film grammar is used in semantic construct extraction of tone, shot rhythm, and pace in film; this is done using a "primitive feature extraction" (p. 96), which is simply a structural projection of Zettl's lighting, color, time-motion and sound, although they also extract simpler features such as shot length and type. Moreover, media aesthetics, and specifically Sight Sound Motion [43], is the framework most widely used for such a terminology. This evidences how automated content analysis has taken into consideration the basic CMA assumptions. Therefore, one may say such efforts have been adequately implemented.

5  Methodology

5.1  Relevance

To understand the relevance of CMA to my research, its premises should be analyzed in detail. These, I've shown, are as follows:
Ideally, the long-term project will span across the entire length of the existence of the sound film in a single place (i.e. United States, France), industry (i.e. Hollywood, Bollywood), school of practice (i.e. French modernist, Russian formalist), or genre (i.e. horror, comedy). These requirements pose constraints on the methodology: obviously, the system must be able to accurately classify the sound track elements in a structuralist approach. Additionally, there is a semantic requirement, and that is that the system be able to classify, through whatever low-level features necessary, when an auditory scene begins and ends, and what are some of its characteristics. A less obvious requirement is that this process be fast enough for a timely classification of an adequate sample size of a universe that most likely will conglomerate seven decades of film production.
An assessment of the relevant literature shows that these three main requirements are probably achievable with current technology and techniques. Conceptually, my methodology for doing so is computational media aesthetics. Again, this theory describes the process of automating the process of analyzing content, starting with a simple structuralist approach in dividing media elements (whatever they may be), clustering these in categories, and classifying such categories. It combines a multitude of structuralist approaches with production knowledge, or grammar, or terminology -aesthetic principles, simply put- to close in on the semantic gap, thereby creating a system capable of inferring higher-level meaning from simple descriptors. Computational efficiency is achievable currently, and this supports a longitudinal approach to the research. The fact that CMA is a methodological framework is defendable by the fact that the research it supports is strictly intended to test a set of tools and techniques.

5.2  Tools

Content analysis in mass communication reviews the start-of-the-art in content analysis. [18] report the results of a content analysis of 200 other studies in the literature published between 1994 and 1998. Of these, only 69% report any kind of intercoder reliability; of those that did, many did not include either the size of the reliability sample, its coders, specific reliability variables, amount of training, or how discrepancies between coders were resolved. They go through a detailed overview of measuring reliability, offering percent agreement, Holti's method, Scott's Pi (p), Cohen's Kappa (k), and Krippendorff's Alpha (a) as possible indicators. They conclude by proposing a number of recommendations to social scientists involved in content analysis: . [18] recommend coefficients of at least .90 or greater for most indices; .80 or greater are adequate most of the time; .70 or greater for exploratory or pilot studies. These are the values I use as guidelines in assessing the readiness of automated content analysis methods. Since such studies öften quantify system performance as precision, recall, and F-measure, or as agreement" [13,p. 296], the specific accuracy rates to be looked for will depend substantially on the algorithm used and the statistical method employed to measure its coding reliability.

5.3  Techniques

5.3.1  Preliminary internal validity check

My BECA 703 research paper, Feasibility of different automated content analysis techniques in film sound classification: A look at internal validity, proposes the use of intercoder reliability indices to calculate the accuracy of a variety of different classification systems when applied to film sound tracks. I will begin by manually coding 10% of the sound track universe, with multiple human coders, ideally between three and five, with the following categories of analysis:
This will be done by subjectively discriminating against changes in volume proportions, poor audio quality, et cetera. I can not clearly state units of analysis due to the nature of the medium under study; these would be pitch, loudness, timbre, formants, and other subjective audio-specific attributes. The auditory scene class is defined by being a coherent collection of sound sources, and indeed is dominated by some of these sources [39,p. 2441]. Still resorting to the same model, "a scene change is said to occur when the majority of the few dominant sources in the sound change" (ibid). Subjectively, speech is defined as a portion of the sound track in which dialogue is perceptually unaccompanied by music and environmental sounds, noise, and other transient sounds. Computationally, the definition is left to different algorithms (for example, it might have to do with the zero-crossing rate low-level feature, which measures transitions of polarity in an audio signal). Likewise, music is defined as a portion of the sound track in which a harmonic composition with a varying degree of subjective dissonance (according to western standards) is perceptually unaccompanied by speech, noise, or other environmental sounds. Noise encompasses various elements, including foley (i.e., clothes rustling, footsteps), environmental sounds (i.e., wind blowing, traffic), and noise in the strictest sense (i.e., white noise and other types of random or mathematically-sequenced inharmonic sounds). Silence means the perceptual absence of any sound. Speech with music or noise background is defined as speech that is perceptually dominant over music or noise. Music with noise background is a portion of the sound track in which music is perceptually dominant over noise. Again, the computational interpretation of these events is left to specific algorithms that are discussed elsewhere.
The results of this step provide a baseline suitable for benchmarking the classification accuracies of the computational classification techniques. By contrasting intercoder reliability between different methods, including manual coding, and the several automated content analysis algorithms available in the literature, I will be able to choose the one, if any, better suited to a content analysis of film sound tracks. This preliminary step will then lead to the procedures proposed in the following section.

5.3.2  External validity of sampling methods: size and frequency

Sampling rationale   The sampling universe includes the top-grossing American feature films from 1970 to 2006 as reported by the Internet Movie Database as of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American means that the production was financed by at least one major American production company, and thus excludes independent productions. Feature film is defined as a production at least 40 minutes long released for the theatrical market. I will randomly select four films for each decade (1970-1979, 1980-1989, 1990-1999, and 2000-2006) for analysis. This is in line with recent research employing content analysis methods in film [21]. Given that my research is intended as a pilot study for a longitudinal investigation, this methodology is adequate: the choice of American feature films of no specific genre is arbitrary, as the goal is to test the techniques. The choice of top-grossing box office revenue is meant to frame the Hollywood industry specifically, and exclude independent efforts which would necessarily clutter the analysis -one needs to focus somewhere-, while at the same time making sure a quality parameter is not being used: it could be argued that box office figures don't show a strong correlation to aesthetic quality, although, to my knowledge, no research specifically supports this. Stratified sampling insures that no particular time-frame is over-represented. Four movies per decade will minimize chance combinations of aberrant results.
Meta-analysis   Others have written on the subject of achieving an appropriate balance between sample size, length, and stratification and validity [37,29,15,30,28]. However, I am not aware of such studies in content analysis of film. Hence, an important part of the research will be the design of a cumulative meta-analysis of sampling methods. The technique's starting point is stratified random sampling covering the period from 1970 to 2006. Starting with a baseline of four full-length items per decade, I will construct progressively less representative samples. This will entail both less items per decade (four, three, two, and one film per decade), variations in item length (full-length, three-quarters, one-half, and one-quarter of the total length), and variations in the length of the unit of analysis for coding (starting with typical times for film content analysis, five to ten minutes, and that to typical automated content analysis's coding period times, 512 samples or 100 milliseconds at CD resolution, and finally looking at human perception-related values of 132 samples or 30 milliseconds as proposed by [11]). A simple analysis of mean differences in the studies's classification accuracy or inter-coder reliability coefficients will provide understanding of the relative validity of each technique. This is essential, because the long-term objective is to conduct a longitudinal study spanning more than seven decades, which necessarily means that a high volume of material will have to be processed and coded, albeit computationally. The use of a technique which reduces computational times while preserving internal external validity is therefore required.

6  Conclusion

I've argued that film sound theory, on its own, is insufficient to answer questions of correlation of ideology, social significance, cultural impact, and aesthetic understanding with the actual practice of producing sound for film. There are no defining answers nor supporting statistics that evidence whatever is theorized. On the same note, content analysis, on its own, is unable to achieve meaning, or understand causes. Automated content analysis, by its nature, is particularly blind to varying shades of significance. Computational media aesthetics, however, proposes to solve this feature-attribute, objective-subjective, deductive-inductive dichotomy. I propose to optimize a small part of the process: to understand how sampling size, frequency, and technique impact the results of an automated content analysis of film sound. In the process, I hope to explore the possibilities of research for a long-term longitudinal study of the evolution of the sound track. Specifically, I will be looking for interesting correlations between theory and practice, such as those proposed in general form in diffusion of innovations theory [31]. The focus of the study is on asserting external validity (sampling), and will build on a preliminary investigation of its internal validity (classification).

References

[1]
B. Adams. Where does computational media aesthetics fit? IEEE Multimedia, 10(2):18-27, 2003.
[2]
Rick Altman, editor. Sound Theory/Sound Practice. Routledge, New York, 1992.
[3]
D. Arijon. Grammar of the film language. Focal Press New York, 1976.
[4]
Michel Chion. Audio-Vision. Columbia University Press, New York, 1994.
[5]
M. Davis. Editing out video editing. IEEE MultiMedia, 10(2):54-64, 2003.
[6]
C. Dorai, A. Mauthe, F. Nack, L. Rutledge, T. Sikora, and H. Zettl. Media semantics: who needs it and why? Proceedings of the tenth ACM international conference on Multimedia, pages 580-583, 2002.
[7]
C. Dorai and S. Venkatesh. Bridging the semantic gap in content management systems: Computational media aesthetics. Proceedings of the First Conference on Computational Semiotics for Games and New Media-COSIGN, pages 94-99, 2001.
[8]
C. Dorai and S. Venkatesh. Bridging the semantic gap with computational media aesthetics. IEEE MultiMedia, 10(2):15-17, 2003.
[9]
William Evans. Teaching computers to watch television: Content-based image retrieval for content analysis. Social Science Computer Review, 18(3):246-257, 2000.
[10]
Nuno Guimaraes, Nuno Correia, Ines Oliveira, and Joao Martins. Designing computer support for content analysis: A situated use of video parsing and analysis techniques. Multimedia Tools and Applications, (7):159-180, 1998.
[11]
H. Haas. Uber den einfluss eines einfachechos auf die hörsamkeit von sprache. Acustica, 1951.
[12]
Tomlinson Holman. Sound for Film and Television. Focal Press, Burlington, 2nd edition, 2001.
[13]
G. Hripcsak and A.S. Rothschild. Agreement, the f-measure, and reliability in information retrieval, 2005.
[14]
Hyoung-Gook Kim, Nicolas Moreau, and Thomas Sikora. Audio classification based on MPEG-7 spectral basis representations. IEEE transactions on circuits and systems for video technology, 14(5):716-725, May 2004.
[15]
S. Lacy. Sample size in content analysis of weekly newspapers. Journalism and Mass Communication Quarterly, pages 336-345, 1995.
[16]
Rainer Lienhart, Silvia Pfeiffer, and Wolfgang Effelsberg. The MoCA Workbench: support for creativity in movie content analysis. Technical Report TR-95-034, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1995.
[17]
Ruei-Shiang Lin and Ling-Hwei Chen. A new approach for classification of generic audio data. International Journal of Pattern Recognition and Artificial Intelligence, 19(1):63-78, 2005.
[18]
Matthew Lombard, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4):587-604, October 2002.
[19]
José M. Martinez, Rob Koenen, and Fernando Pereira. MPEG-7: the generic multimedia content description standard, part 1. IEEE Multimedia, 09(2):78-87, April-June 2002.
[20]
C. Metz and G. Gurrieri. Aural objects. Yale French Studies, (60):24-32, 1980.
[21]
Elizabeth Monk-Turner, Peter Ciba, Matthew Cunningham, P. Gregory McIntire, Mark Pollard, and Rebecca Turner. A content analysis of violence in american war movies. Analyses of Social Issues and Public Policy, 4(1):pp. 1-11, 2004.
[22]
P. Mulhem, M.S. Kankanhalli, J. Yi, and H. Hassan. Pivot vector space approach for audio-video mixing. IEEE MultiMedia, 10(2):28-40, 2003.
[23]
Walter Murch. In the Blink of an Eye. Silman-James Press, Los Angeles, 2nd edition, 2001.
[24]
F. Nack, C. Dorai, and S. Venkatesh. Computational media aesthetics: Finding meaning beautiful. IEEE Multimedia, 8(4):10-12, 2001.
[25]
Tin Lay Nwe and Haizhou Li. Broadcast news segmentation by audio type analysis. In Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05)., volume 2, pages 1065-1068. IEEE International Conference, March 2005.
[26]
Silvia Pfeiffer, Stephan Fischer, and Wolfgang Effelsberg. Automatic audio content analysis. Technical Report TR-96-008, University of Mannheim, Department for Mathematics and Computer Science, Mannheim, 1996.
[27]
Srividya Ramasubramanian. A content analysis of the portrayal of india in films produced in the west. The Howard Journal of Communications, 16(4):pp. 243-265, 2005.
[28]
D. Riffe. The effectiveness of simple and stratified random sampling in broadcast news content analysis. Journalism and Mass Communication Quarterly, pages 159-168, 1996.
[29]
D. Riffe, C. Aust, and S. Lacy. The effectiveness of random, consecutive day and constructed week sampling in newspaper content analysis. Journalism Quarterly, pages 133-139, 1993.
[30]
D. Riffe, S. Lacy, and M.W. Drager. Sample size in content analysis of weekly news magazines. Journalism And Mass Communication Quarterly, 73:635-644, 1996.
[31]
E.M. Rogers. Diffusion of innovations. Free Press New York, 1995.
[32]
L. Rowe, J. Boreczky, and C. Eads. Indexes for user access to large video databases. Proc. Storage and Retrieval for Image and Video Databases, pages 150-161, 1994.
[33]
Amjad Samour and Hyoung-Gook Kim. Mpeg-7 audio analyzer: Low level descriptors extractor. Retrieved 17 December 2006, from http://mpeg7lld.nue.tu-berlin.de/, March 2004.
[34]
Barry S. Sapolsky, Fred Molitor, and Sarah Luque. Sex and violence in slasher films: Re-examining the assumptions. Journalism and Mass Communication Quarterly, 80(1):28-38, 2003.
[35]
P. Schaeffer. Musique concrète. Candide, 1968.
[36]
Robert Steele and Robert Logan. Report on a study of light variables measured as a function of time in the cinema. The Journal of the Society of Cinematologists, 4:pp. 37-54, 1964-1965.
[37]
G.H. Stempel. Sample size for classifying subject matter in dailies. Journalism Quarterly, 29:333-334, 1952.
[38]
Susannah R. Stern. Messages from teens on the big screen: smoking, drinking, and drug use in teen-centered films. Journal of Health Communication, 10(4):331-346, 2005.
[39]
Hari Sundaram and Shih-Fu Chang. Audio scene segmentation using multiple features, models and timescales. In Acoustics, Speech, and Signal Processing. Proceedings. ICASSP'00. IEEE International Conference on, volume 6, pages 2441-2444, June 2000.
[40]
B.T. Truong, S. Venkatesh, and C. Dorai. Application of computational media aesthetics methodology to extracting color semantics in film. Proceedings of the tenth ACM international conference on Multimedia, pages 339-342, 2002.
[41]
Elizabeth Weis and John Belton, editors. Film Sound: Theory and Practice. Columbia University Press, New York, 1985.
[42]
David Lewis Yewdall. Practical Art of Motion Picture Sound. Focal Press, Burlington, 3rd edition, 2003.
[43]
Herbert Zettl. Sight Sound Motion: applied media aesthetics. Wadsworth Publishing, Belmont, 3rd edition, 1999.

A  Filmography

The following list enumerates the top-grossing American feature films from 1970 to 2006 as reported by the Internet Movie Database as of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American means that the production was financed by at least one major American production company, and thus excludes independent productions. Feature film is defined as a production at least 40 minutes long released primarily for the theatrical market.
  1. Patton (1970)
  2. Harold and Maude (1971)
  3. The Godfather (1972)
  4. The Exorcist (1973)
  5. The Sting (1973)
  6. Chinatown (1974)
  7. The Conversation (1974)
  8. The Godfather: Part II (1974)
  9. Young Frankenstein (1974)
  10. Dog Day Afternoon (1975)
  11. Jaws (1975)
  12. One Flew Over the Cuckoo's Nest (1975)
  13. Taxi Driver (1976)
  14. Annie Hall (1977)
  15. Star Wars (1977)
  16. The Deer Hunter (1978)
  17. Apocalypse Now (1979)
  18. Manhattan (1979)
  19. The Elephant Man (1980)
  20. Raging Bull (1980)
  21. The Shining (1980)
  22. Star Wars: Episode V - The Empire Strikes Back (1980)
  23. Raiders of the Lost Ark (1981)
  24. Blade Runner (1982)
  25. The Thing (1982)
  26. A Christmas Story (1983)
  27. Scarface (1983)
  28. Star Wars: Episode VI - Return of the Jedi (1983)
  29. Amadeus (1984)
  30. Once Upon a Time in America (1984)
  31. The Terminator (1984)
  32. Back to the Future (1985)
  33. Aliens (1986)
  34. Platoon (1986)
  35. Stand by Me (1986)
  36. Full Metal Jacket (1987)
  37. The Princess Bride (1987)
  38. Die Hard (1988)
  39. Glory (1989)
  40. Indiana Jones and the Last Crusade (1989)
  41. Goodfellas (1990)
  42. The Silence of the Lambs (1991)
  43. Terminator 2: Judgment Day (1991)
  44. Reservoir Dogs (1992)
  45. Unforgiven (1992)
  46. Groundhog Day (1993)
  47. Schindler's List (1993)
  48. Ed Wood (1994)
  49. Forrest Gump (1994)
  50. Pulp Fiction (1994)
  51. The Shawshank Redemption (1994)
  52. Braveheart (1995)
  53. Heat (1995)
  54. Se7en (1995)
  55. Toy Story (1995)
  56. Twelve Monkeys (1995)
  57. The Usual Suspects (1995)
  58. Fargo (1996)
  59. Sling Blade (1996)
  60. L.A. Confidential (1997)
  61. American History X (1998)
  62. The Big Lebowski (1998)
  63. Saving Private Ryan (1998)
  64. American Beauty (1999)
  65. Fight Club (1999)
  66. The Green Mile (1999)
  67. Magnolia (1999)
  68. The Matrix (1999)
  69. The Sixth Sense (1999)
  70. The Straight Story (1999)
  71. Toy Story 2 (1999)
  72. Gladiator (2000)
  73. Memento (2000)
  74. Requiem for a Dream (2000)
  75. Snatch. (2000)
  76. Wo hu cang long (2000)
  77. Donnie Darko (2001)
  78. The Lord of the Rings: The Fellowship of the Ring (2001)
  79. Monsters, Inc. (2001)
  80. Shrek (2001)
  81. Cidade de Deus (2002)
  82. The Lord of the Rings: The Two Towers (2002)
  83. Big Fish (2003)
  84. Finding Nemo (2003)
  85. Kill Bill: Vol. 1 (2003)
  86. The Lord of the Rings: The Return of the King (2003)
  87. Mystic River (2003)
  88. Pirates of the Caribbean: The Curse of the Black Pearl (2003)
  89. Before Sunset (2004)
  90. Crash (2004/I)
  91. Eternal Sunshine of the Spotless Mind (2004)
  92. Finding Neverland (2004)
  93. Hotel Rwanda (2004)
  94. The Incredibles (2004)
  95. Kill Bill: Vol. 2 (2004)
  96. Million Dollar Baby (2004)
  97. Batman Begins (2005)
  98. Cinderella Man (2005)
  99. Sin City (2005)
  100. V for Vendetta (2005)
  101. Walk the Line (2005)
  102. Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan (2006)
  103. The Departed (2006)
  104. Little Miss Sunshine (2006)

Footnotes:

1Broadcast and Electronic Arts Department, San Francisco State University


File translated from TEX by TTH, version 3.77.
On 24 Apr 2007, 23:49.

Accessibility
  • Creative Commons License
  • Valid XHTML 1.0 Strict
  • Valid CSS!
  • Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0

This page employs valid XHTML 1.0 Strict and CSS for cross-browser compatibility.

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.