Research
Automatic film sound track classification
MA Thesis proposal accepted
Here is my proposal on Computational media aesthetics methods for automated content analysis of film sound tracks. And here is a detailed paper on temporal decomposition as an audio segmentation method, based on Atal and Ghaemmaghami's work.
Software tools
I have developed a tool for aiding manual sound track classification, an essential but tedious process to establish baseline results. Both work by splitting a sound track into pre-defined length segments, and semi-automating the process of having the user label each segment into a given class. It runs on the Bourne-Again Shell (BASH), on any UNIX-type OS. It has two dependencies, mp3splt and mpg321, which I include in the tarball.
Files in software directory:
- casa-bash-0.2.tar.gz (304KB)
- gpl-3.0.txt (34KB)
(Old) Summary
My project attempts computational quantification of certain sound aesthetic variables in key film sound tracks produced by major studios in Hollywood from 1970 through 2006. It is split in three stages: internal validity, where computational algorithms are being tested on a sample universe; external validity, where sampling frequencies and techniques will be tested on varying samples; the final phase attempts to fully sample the universe as efficiently as possible while retaining validity and reliability.
More detailed proposals for the first and second phases can be found in my BECA 703 - Feasibility of different automated content analysis techniques in film sound classification: A look at internal validity and BECA 700 - Sampling, feature extraction, and classification techniques for automated content analysis of film sound tracks: A pilot study papers. 1
Background
I am interested in the encoding (production or synthesis) and decoding (criticism or analysis) processes in film sound. In other words, how film sound theory and film sound practice relate. The way this is to be acomplished is through automated analysis of audio scene characteristics and sound track elements. These two break down into the following units of analysis:
- Auditory scene characteristics:
- Aural scene length
- Aural and visual scene correlation
- Sound track elements:
- Speech
- Music
- Noise
- Silence
- Speech with music background
- Speech with noise background
- Music with noise background
Specifically, I will be looking at how all these variables interrelate, whether there are correlations, etcetera. The auditory scene class is defined by being a coherent collection of sound sources, and indeed is dominated by some of these sources. Still resorting to the same model, a scene change is said to occur when the majority of the few dominant sources in the sound change. Subjectively, speech is defined as a portion of the sound track in which dialogue is perceptually unaccompanied by music and environmental sounds, noise, and other transient sounds. Computationally, the definition is left to different algorithms (for example, it might have to do with the zero-crossing rate low-level feature, which measures transitions of polarity in an audio signal). Likewise, music is defined as a portion of the sound track in which a harmonic composition with a varying degree of subjective dissonance (according to western standards) is perceptually unaccompanied by speech, noise, or other environmental sounds. Noise encompasses various elements, including foley (i.e., clothes rustling, footsteps), environmental sounds (i.e., wind blowing, traffic), and noise in the strictest sense (i.e., white noise and other types of random or mathematically-sequenced inharmonic sounds). Silence means the perceptual absence of any sound. Speech with music or noise background is defined as speech that is perceptually dominant over music or noise. Music with noise background is a portion of the sound track in which music is perceptually dominant over noise.
Sampling
The sampling universe includes the top-grossing American feature films from 1970 to 2006 as reported by the Internet Movie Database as of 19 November 2006. Top-grossing is defined as being in the top-250 list of all-time U.S. box office revenue, without adjustment for inflation. American means that the production was financed by at least one major American production company, and thus excludes independent productions. Feature film is defined as a production at least 40 minutes long released for the theatrical market. I will randomly select four films for each decade (1970-1979, 1980-1989, 1990-1999, and 2000-2006) for analysis (15% of 104, stratified).
Methodology
There are two stages involved in computational content analysis of sound tracks: audio segmentation and audio classification. Before a sound track is classified, it needs to be broken down into semantically coherent segments, or auditory scenes, as defined above. This is both a technical requirement in computational techniques and a fortunate coincidence for me since the auditory scene is also of interest to the project.
Audio segmentation
This is typically done after a feature extraction step using an unsupervised clustering process (c-means, k-means, listener model) to find boundary thresholds, which are then used as audio cuts to enable scene classification into different categories (in my case, the sound track elements itemized above). By storing the number and timestamp of audio cuts, we can reach a figure for auditory scene lengths and synchronicity. Best results seem to have been achieved by [2], using a listener model of memory and attention span.
Audio classification
While some researchers also use clustering algorithms for this step (in fact, in some cases segmentation and classification are performed as a single stage), most times classification is done via a supervised process (nearest-neighbour, gaussian mixture model, support vector machines). After feature extraction, a model is constructed via instance-based learning, numeric prediction, etc. Each instance inside the audio cut boundaries is compared to a known categorized instance. Euclidean metric distances (or more advanced models) can be used for determining classes. [1] report excellent results using a hierarchical Bayesian decision function model.
Baselines, training and testing sets
Assuming an unsupervised (clustering) classification model, the process I plan to follow is to perform manual coding/classification on a 50/50 split of the sample. These 8 films will then be treated as a holdout set for a validation procedure. The other 8 films will be used as a training set; appropriate features need to be extracted, and a learner model built. This model will then be used to classify the holdout set. These results can then be compared to the baseline results and intercoder reliability (using, for example, the kappa statistic) computed.
It is not clear how to combine audio segmentation and classification in my project, given the literature involved. Classification of time/value series tends to defy supervised classification because it is unfeasible to manually classify each training instance (since most of the time we're dealing with 512 samples windows). A way around this is to use the segmentation method above as an input to a following classification procedure.
References
- [1]
- Ruei-Shiang Lin and Ling-Hwei Chen. A new approach for classification of generic audio data. International Journal of Pattern Recognition and Artificial Intelligence, 19(1):63-78, 2005.
- [2]
- Hari Sundaram and Shih-Fu Chang. Audio scene segmentation using multiple features, models and timescales. In Acoustics, Speech, and Signal Processing. Proceedings. ICASSP'00. IEEE International Conference on, volume 6, pages 2441-2444, June 2000.
- [3]
- I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2nd edition, 2000.
Footnotes
1A brief summary of the project follows. This particular version was prepared for faculty in the Computer Science department. Please visit http://thecity.sfsu.edu/~psilva/research.php for the current status of the project.File translated from TEX by TTH, version 3.67.
On 4 Sep 2007, 23:35.
