Image-based facial animation system for interactive services



Mouth Animation:


    1. Analysis

        In the analysis part a database with images of deformable facial parts of the human subject is created. The input of the visual analysis process is a video sequence and a face model of the recorded human subject. For positioning the face model to the recorded human subject in the initial frame, facial features such as eye corners and nostrils have to be localized. These facial features, which are independent of local deformations such as a moving yaw or blinking eye, are selected to initially position the face model. Furthermore, the camera is calibrated so that the intrinsic camera parameters, such as focal length, are known. Thus, only the position and orientation of the model in the initial image must be reconstructed. Afterwards the pose of the head is estimated in each frame by a gradient-based motion estimation algorithm, which determines the three rotation and three translation parameters. An accurate pose estimation is required in order to avoid a jerky animation. After the motion parameters are calculated for each frame, mouth samples are normalized and stored into a database. Normalizing means to compensate for head pose variations. Each mouth sample is characterized by audio and visual information. Audio information is retrieved by phoneme labeling, which assigns a phoneme to each frame. The visual information consists of geometrical features (mouth height and width) and pixel information. Since characterizing an image by its pixel information requires a high dimension, the normalized mouth images are processed by a Principal Component Analysis (PCA). Then each image can be sufficiently characterized by few PCA parameters.

        Hence, each frame is characterized by its phonetic content, geometrical features and PCA parameters. Our database consists of approximately 20000 images, which is equal to 10 minutes recording time.



    1. Synthesis

        In order to synthesize photo-realistic facial animations, the appropriate mouth samples for the spoken output need to be selected. Here a synchronization between lip movements and the audio is essential. One difficulty are so-called coarticulation effects. For instance, lips start moving before the sound comes out.

        The TTS provides the unit selection system with phonemes and their duration, which are transformed into a target feature vector T_0, ..., T_i, ..., T_N. The goal of the unit selection algorithm is to find the most appropriate mouth samples in the database for each target feature vector T_i by minimizing a cost function. To each alternative sample in the database two types of cost are assigned. In order to achieve appropriate mouth movements for the spoken output, target costs (TC) are computed, which regard the phonetic context of the image. However, not only lip synchronization but also a smooth transition between consecutive frames is important. Therefore, concatenation costs (CC) considering the visual differences between consecutive samples are computed. The well-known Viterbi algorithm is used to find the lowest-cost path. Then, image rendering overlays these mouth samples to the generated speech over a background video sequence. Background sequences are recorded video sequences of the human subject with typical short head movements. In order to conceal illumination differences between an image of the background video and the current mouth sample, the mouth samples are blended in the background sequence using alpha-blending.

        The Figure shows the illustration of unit selection algorithm. The input text is hello. Audio and phonemes are produced by TTS. There are many candidate images available for each frame. Using the defined cost functions, the optimal paht can be determined to produce the desired animation. The selected mouth sequence consists of several segments. A segment is a chunk of consecutively recorded mouth images.




    1. Demos

      1. Motion estimation

        1. normalized mouth images( one sentence )

        2. original and normalized sequences( recorded video , normalized mouth samples )

      2. AAM-based feature detection

        1. head shoulder sequences ( real sequence )

        2. mouth samples in the database ( mouth contour of database A , mouth contour of database B, mouth feature points)

      3. Blending vs. Morphing

        1. Animation using blending ( blending )

        2. Animation using morphing ( morphing )

      4. BBC news reader

        1. News demo ( News #1 , News #2 )

        2. real video vs. animation ( real , animation , animation2 )

      5. Facial animations

        1. text-driven animation ( demo_db1.mpg, demo_db2.mpg )

        2. speech-driven animation ( robec , leibniz )

      6. Web-based dialog system

        1. Auto insurance website ( English carinsurance)

        2. Bundestag website ( German Bundestag )

      7. Facial animation plugin test Web site at TNT(NEW !!)

      8. Demo for 3DTV



    1. Contact

      Prof. Dr.-Ing. Jörn Ostermann

      M.Sc. Kang Liu

      Leibniz Universität Hannover

      Institut für Informationsverarbeitung

      Appelstr. 9A, D-30167 Hannover

      Tel. +49(0)511 7625316

      Fax +49(0)511 7625333

      Email: kang@tnt.uni-hannover.de



    Leibniz Universitaet Hannover

      

    TNT Home | LUH | Search | Administrator
    Updated 18/02/2009
    Webpage design M.Sc. Kang Liu