ABSTRACT - This paper presents a system that will automatically recognize the video frames during the broadcast a typical news program. Such frames represent usually images of anchorpersons, guest speakers, commercials and key news images. The system also works on prioritizing the news images per news story, in an attempt to generate an attractive and easy to read pictorial transcript. The system focuses on selecting the proper keyframes from every news clip avoiding selecting images from commercials. It also works on recognizing the anchorperson image, and distinguishing it from images that belong to guest speakers. The outcome of this system is a pictorial transcript that is friendly and easy to read, index, and understand, especially when marking news images by their importance within each news story.
Key Words: Multimedia, Pictorial Transcript, Camera Cuts, New Programs, Video Processing
It is important to know what is going on in your neighborhood, city, country or even in the world you are living in. Everywhere you go, you see people watching news over TV's. TV news is still preferable over and more convincing than newspapers as a news source, because of its condensed information that is almost always accompanied with video clips. Viewers, news analysts, and researchers usually experience the problem of choosing the right channel to view at a certain news time, against the desire to watch them all at the same time, so as not to miss any detail. Fortunately, they are offered the option to replay the stored news from tapes at a later time. Such tapes usually fade out over time, and may drive searchers crazy when they try to locate a piece of news clip with the absence of proper indexing information. Although news are now stored digitally, but the problem of capacity will always exist. For example, storing 10 minutes of news may need at least 10 MBytes of disk space. Pictorial transcripts, on the other hand, are handy, easy to index, and contain few key-images per story. Therefore, they need no large capacity to store. Automated pictorial transcript algorithms are needed to create these scripts as manual creation is a time consuming process. However, selecting key-images from a video stream that may contain all sorts of images like commercials, news clips, and speakers including anchorpersons and guests is not an easy task. Furthermore, having this process to satisfy the realtime requirements is another story.
This paper presents an algorithm to automatically generate a pictorial transcript from a digital video feed coming from a regular TV channel that may include commercials between its news clips. The algorithm, namely Automatic Pictorial Transcript Generator (APTG), focuses on selecting the proper key-frames from every news clip, and avoiding frames from commercials. It also works on recognizing the anchorperson image to properly place it in the transcript avoiding selecting it to represent news clips. The transcript text is captured from the program closed-caption transmission and converted to html format.
In what follows, the paper will give a brief overview on the field in question, as well as some related work, that will include phase one of the proposed algorithm. Section 3 will explain the various components of a typical news program. In section 4, the algorithm as a whole will be presented and discussed. Finally, the paper will present the results of the experiments that used the algorithm to generate pictorial transcripts, and then conclude the paper with some suggestions of future work.
The idea of having an automated pictorial transcript is not that common. This is because of many reasons, one of which is that such algorithms assume that the video feed must include closed-caption signal, yet not many TV channels do include such information in their transmission. Besides, all TV channels are moving towards having their news transcripts be publicly available on the web. For that, the demands on generating such transcripts automatically by viewers have decreased. However, on the other hand, such demands have increased by the news agencies themselves, in order to reduce efforts and time in generating news transcripts manually, especially if they care about quick updates to their web sites.
This idea probably could be related to Bell Labs a decade ago , when they announced that they are around to build a software capable of distinguishing significant still images from a television program and storing them in parallel with a text transcript of the TV program. The plan was also to add hypertext links to the generated transcript to make the document searchable and linked to related information on the web. It also claimed that major broadcasters will announce partnerships with Bell Labs. Yet, they stopped short soon after that from any further announcements, but published couple of papers on the same issue by Shahraray et al, [1,2]. Their work was based on content-based sampling of the video program. However, their methods did not pay enough attention to video programs with interrupting commercials, as much as it did on refining the text linguistically and converting it to lower-case as well as the synchronizations issues between key-images and text.
Other work such as Raaijmakers et al, , described a model for topic segmentation and classification. It presented automated sequential feedback model for video analysis, where linguistic analysis was combined with visual information for the purposes of both segmentation and classification. Yet their work focused on Dutch news video.
The video feed, including Closed-Caption (CC) signals, once received will be time stamped. Our proposed algorithm uses such timestamps information for synchronization. Furthermore, the proposed algorithm is based on scene detection algorithms for filtering scenes to key-frames which will later be filtered to key-images that will be placed in the transcript.
Selecting key-frames, instead of full video, to transmit over the network, save on a hard disk, or use for browsing certainly reduces bandwidth, capacity, and time. However, video segmentation is yet a difficult process when considering various types of camera breaks and operations. A typical simple camera cuts detection algorithm may result in detecting false cuts or missing true ones. False cuts may result from certain camera operations, object movements, or flashes within a video clip; while missed ones may result from gradual scene changes.
Video parameters to be considered by such algorithms may include intensity, red-green-blue (RGB), hue-value-chroma (HVC), and motion vector. A basic approach  to detect cuts is to compare one or more of these parameters, such as intensity of the corresponding pixels in a pair of consecutive frames. In simple words, if the number of pixels whose intensity values have changed from one frame to the next exceeds a certain threshold, a cut is detected. Although, the solution is quite simple, but it does not usually result in high detection rates.
RESULTS AND ANALYSIS
The APTG has been applied on both stored and live news to generate pictorial transcripts. The following table shows the parameters that give the flexibly to the algorithm to adopt and tuned to different news programs, and their suggested values for CNN news programs.
CONCLUSION AND FUTURE WORK
The results expressed a very good performance in distinguishing anchorpersons from guest speakers, commercial breaks from news clips, and important key-images per story from less significant key-images. The algorithm is best applied in integration with a database management system or indexing techniques. It satisfies real-time requirements so pictorial transcripts can be generated on the fly from live channels and in the background.
To offer a comprehensive system that not only generate transcripts but also work on indexing them for future rendering, one may link this algorithm with an indexing algorithm that could be as simple as keywords extraction to a more complicated one such as facial and object recognition. With such system, searches become more efficient to researches.
- Huang, Q., Liu, Z., Rosenberg, A., Gibbon, D. And Shahraray, B., ``Automated Generation Of News Content Hierarchy By Integrating Audio, Video And Text Information,'' ICASSP, Vol. Vi, Phoenix, May 1999, pp. 3025-3028.
- Shahraray, B. And Gibbon, D., ``Automatic Generation Of Pictorial Transcripts,'' Proc. SPIE Conf. Multimedia Computing And Networking 1995, SPIE 2417, San Jose, Ca., Feb. 1995, pp. 512-518.
- Raaijmakers, S.; den Hartog, J.; Baan, J.; Multimodal topic segmentation and classification of news video, ICME, Aug 26-29, 2002, pp33-36, Vol2.
- N. Hirzalla and A. Karmouch, "Detecting Scene Boundaries For Video Indexing" Advanced Digital Library Forum'95, May 15-18, 1995 Washington D.C.
- United States Patent #6,415,000 "Method Of Processing A Video Stream" By Hirzalla, Nael; Streatch, Paul; Maclean, Roger; and Menard, Rob , 2002
- Na'el Hirzalla, ?A Pictorial Transcript Generation Algorithm For A News Program Containing Guest Speakers And Commercials?, International Multiconference on Computer Science & Information Technology, Amman ? Jordan , April 2006.
- Nael Hirzallah and Ahmed Karmouch, "Detecting Cuts By Understanding Camera Operations", Journal Of Visual Languages And Computing, Vol 6, No 4, pp385- 404, 1996
- Nael Hirzallah and A. Karmouch, "Automatic Cut And Camera Operation Detection For Video," International Conference On Consumer Electronics'95, Chicago, Illinois, June 7-9,1995.