Article - Issue 7, February 2001
Video coding: past, present and future
Mohammed Ghanbari FIEEE
What is the point of video coding? The first thing to understand is that even storing a picture digitally takes a lot of memory. This introduction is about 320 words long and contains about 1850 characters. Using 8 bits of information (8 on/off decisions) to code each character (the standard) makes a total of about 15,000 bits of information. A television picture (in the UK) is 625 lines with nearly a thousand spots on each line. Even if each spot (pixel) is only on or off that requires about 600,000 bits of information for each picture. (It is considerably more if you want to code each spot with a true colour: each pixel then requires 24 bits.) Video consists of 25 pictures per second. It is not difficult to see how a wire that copes very happily with a few pages of text a second would be overwhelmed by video transmission. Hence the need for video coding and compression: squashing the information in the video pictures into fewer bits gives the wire a chance. Or it gives a ‘super’ wire (an optic fibre) a chance to carry more video signals simultaneously. This is important for the delivery of the multitude of new digital TV programs by cable and for newer services such as video on demand.
Squashing up the information also means that it takes less space to store it, that is to record it: video compression now allows an entire movie to be squashed onto a CD, a much less bulky and more robust object than a VCR tape.
Besides conventional wires and optical cables the other growing conduit for information, including demand for video, is in mobile communications. Here the bandwidth (the size of the ‘wire’) can be very restricted, making compression technology absolutely vital.
It is important that everyone agrees to use the same compression technology. Not only does this eliminate the need to build translating boxes, it also gives rise to economies of scale in the design and manufacture of the chips needed to accomplish the compression and decompression. Hence international groups of experts meet to agree standards and to design new ones. This article looks at the progress of video coding technology in the past, now and what may happen in the future.
In the 1960s researchers developed and tried out an analogue videophone system. However, it required a wide bandwidth and the postcard-size blackand- white pictures produced did not add appreciably to the voice communication. In the 1970s, the business community realised that being able to see another speaker could substantially improve a multiparty discussion and so the possibility of videoconference services came under consideration. Interest in the idea increased with improvements in picture quality and digital coding. By the 1980s with the available technology, the COST211 video codec – the earliest coding system for digital video – was standardised by the international body known as CCITT *, under the H.120 standard. The coding principle of the H.120 standard was based on ‘conditional replenishment’ where only the moving parts of the pictures were differentially pulse code modulated (DPCM), and the non-changed parts were copied from the previous picture frame. The DPCM for a particular pixel was based on simple element prediction from the previous pixel in the same or previous line. The codec's target bit rate was 2 Mbit/s for Europe and 1.544 Mbit/s for North America, suitable for their respective first levels of digital hierarchy. However, the image quality, although having very good spatial resolution (due to the nature of DPCM working on a pixel-by-pixel basis), had a very poor temporal quality: the picture was sharp but jumpy. It was soon realised that in order to improve the image quality without exceeding the target bit rate, less than one bit should be used to code each pixel. This was only possible if a group of pixels were coded together, such that the ‘bit per pixel’ is fractional. This led to the design of so-called blockbased codecs.
A late-1980s study of the 15 proposed block-based codecs for videoconferencing submitted to the ITU-T (the former CCITT) showed that 14 of them were based on the Discrete Cosine Transform (DCT) and only one on Vector Quantisation (VQ). In the DCT-based codecs, every 8x8 interframe error pixel is transformed into a frequency domain, so that only a small fraction of the total number of pixels need to be coded. In the VQ method, normally a group of 4?4 pixels (or error pixels) is represented by a 16- dimensional vector of predefined components. The subjective quality of video sequences presented to the panel showed hardly any significant differences between the two coding techniques.
In parallel to ITU-T's investigation during 1984–8, the Joint Photographic Experts Group (JPEG) was also interested in the compression of static images. They chose DCT as the main unit of compression, mainly because it allows the possibility of progressive image transmission (an image can be transmitted in several bursts of data rather than one continuous stream of data; this is the way data on the internet is transmitted). JPEG’s decision undoubtedly influenced the ITU-T to favour DCT over VQ. By now there was worldwide activity in implementing DCT in chips and on DSPs.
By the late 1980s it was clear that the recommended ITU-T videoconferencing codec would use a combination of interframe DPCM to achieve minimum coding delay and DCT, where the difference between the consecutive frames is DCT coded. Here DPCM reduces the temporal redundancy of images and DCT reduces the spatial redundancy within the interframe pictures. The codec showed greatly improved picture quality over H.120 of the early 1980s. In fact, the image quality for videoconferencing applications was found reasonable at 384 kbit/s or higher and good quality was possible at significantly higher bit rates of around 1 Mbit/s (still below the 2 Mbit/s target of H.120). This effort, although originally directed at video coding at 384 kbit/s, was later extended to systems based on multiples of 64 kbit/s (p x 64 kbit, where p can take values from 1 to 30). The standard definition was completed in late 1989 and is officially called the H.261 standard (although the coding method is often referred to as ‘p x 64’).
The success of H.261 was a milestone for the low bit-rate coding of video at reasonable quality. In the early 1990s, the Motion Picture Experts Group (MPEG) started investigating coding techniques for the storage of video on media such as CD-ROMs. Their aim was to develop a video codec capable of compressing highly active video such as movies (where the entire field of a picture changes rapidly) on hard discs, with a performance comparable to that of VHS home video cassette recorders (VCRs). The first generation of MPEG, called the MPEG-1 standard, was capable of accomplishing this task at 1.5 Mbit/s. However, the type of picture artefacts in analogue VCRs and in digitally coded MPEG-1 can be very different. For example, due tobandwidth constraint in analogue VCRs, picture resolution is reduced and pictures appear blurred, while in the MPEG-1 coded video the result appears blocky. Figure 2 shows two single frames of MPEG-1 coded video at 384 kbit/s and 1.4 Mbit/s, where the picture blockiness in the lower bit-rate picture is evident.
In the design of the MPEG-1 codec the basic framework of the existing H.261 standard was used as a starting point. However, due to different requirements there are significant differences between them. First of all, because of editing requirements and the need to be able to play MPEG-1 coded video starting from any part of the video, all pictures cannot be interframe coded, as is the case for H.261. In MPEG-1, some pictures (usually every 12 frames in the European and 15 frames in the North American standards), called I-pictures, are intraframe coded, like JPEG pictures, as the reference frames. The second difference is that, rather than interframe coding of all the successive remaining pictures after each I-picture, only every second, third or fourth picture is interframe coded like H.261: these are called P-pictures. The remaining pictures are then interframe coded with predictions from either previous and/or future I- or P-coded pictures. These remaining pictures are called bi-directionally predicted, or Bpictures. Figure 3 shows a Group of Pictures (GoP) with the directions of the predictions for coding of P- and Bpictures. The figure also shows that the encoding (transmission) order of pictures is different from their display order. Hence for this type of coding and decoding, several picture frames need to be temporarily stored (buffered) prior to transmission and again temporarily stored at the decoder to put them in the right order. However, since the main use of MPEG-1 is for storage (recording), this extra delay is of no concern.
It is ironic that in the development of H.261, motion compensation was thought to be optional, since it was believed that after motion compensation little was left to be decorrelated by the DCT. However, later research showed that efficient motion compensation can reduce the bit rate required for transmission. For example, it is difficult to compensate for the background in a picture, unless one looks ahead at the movement of the objects over the background. This is in fact what is done in the B-pictures in MPEG-1, and it has proved to be very effective.
These days, MPEG-1 decoders/players are becoming commonplace for multimedia on computers. MPEG-1 decoder plug-in hardware boards (e.g. MPEG magic cards) have been around for a few years, and now software MPEG-1 decoders are available with the release of new operating systems or multimedia extensions for PC and Mac platforms. Since in all standard video codecs, only the decoders have to comply with proper syntax, software-based coding has added extra flexibility that might even improve the performance of MPEG-1 in the future.
Although MPEG-1 was optimised for typical applications using non-interlace video of 25 frames/s (in European format) or 30 frames/s (in North America) at bit rates in the range of 1.2 to 1.5 Mbit/s (for image quality comparable to home VCRs), it can certainly be used at higher bit rates and resolutions. Early versions of MPEG-1 for interlaced video, such as those used in broadcast, were called MPEG-1+. Broadcasters, who were initially reluctant to use any compression on video, fairly soon adopted a new generation of MPEG, called MPEG-2, for coding of interlaced video at bit rates of 4–9 Mbit/s. However, this simple interlaced video means that the motion estimation can now be carried out in either the previous frame, or previous field, or their combination, which makes it more complex than MPEG-1 (see Further Reading for more details).
MPEG-2 is now well on its way to making a significant impact in a range of applications such as digital terrestrial broadcasting, digital satellite TV, digital cable TV and digital versatile disc (DVD). In the UK in November 1998, OnDigital started terrestrial broadcasting of BBC and ITV programmes in MPEG-2 coded digital forms, and almost at the same time several satellite operators such as Sky- Digital launched MPEG-2 coded television pictures direct to homes. Since in MPEG-2 the number of bidirectionally predicted pictures (Bpictures) is at the discretion of the encoder, this number may be chosen for an acceptable coding delay. This technique may then be used for telecommunication systems. For this reason ITU-T has also adopted MPEG-2 under the generic name of H.262 for telecommunications. H.262/MPEG-2, apart from coding high resolution and higher bit-rate video, also has the interesting property of scalability: from a single MPEG-2 bit-stream two or more video images at various spatial, temporal or quality resolutions can be extracted. This scalability is very important for video networking applications. For example in applications such as video on demand and multicasting, the client may wish to receive video of his/her own quality choice; in video networking applications, during network congestion less essential parts of the bit-stream can be discarded without undue loss of quality in the received video pictures.
Following the MPEG-2 standard, coding of High Definition Television (HDTV) was identified as the next requirement. This became known as MPEG-3. However, the versatility of MPEG-2 (being able to code video of any resolution) left no place for MPEG-3 and hence it was abandoned. Although Europe has been slow in deciding whether to use HDTV, in the USA transmission of HDTV with MPEG-2 compression has already started. It is foreseen that by the year 2014, the existing transmission of analogue video will cease in the USA, and all broadcasting will be at HDTV quality with MPEG-2 compression.
After so much development on MPEG-1 and MPEG-2, one might wonder what is next? Certainly we have not yet addressed the question of sending video at very low bit rates, such as of 64 kbit/s or less. This of course depends on the demand for such services. However, there are signs that in the very near future such demands may arise. For example, currently, owing to a new generation of modems allowing bit rates of 56 kbit/s or so over Public Switched Telephone Networks (ordinary telephone lines), videophones at these low bit rates are needed. In the near future there will be demands for sending higher quality video over mobile networks (such as the emergence of the forthcoming Universal Mobile Telecommunication Systems (UMTS) carrying video, voice and data), where the channel capacity is very scarce.
To fulfil this goal, the MPEG group started working on a very low bit-rate video codec, under the name of MPEG-4. Before they could achieve acceptable image quality at such bit rates, new demands arose. These were mainly caused by the requirements of multimedia, where there is a considerable demand for coding of multiviewpoint scenes, graphics, and synthetic as well as natural scenes. Applications such as virtual studios (where real and synthetic images can be mixed) and interactive video have been the main driving forces. Some critics say that since MPEG-4 could not deliver a low bit-rate video codec, as had been promised, the goal posts were moved.
Work on very low bit-rate systems, to fulfil the requirement of telephone networks and mobile applications, has been carried out by ITU-T. A new video codec named H.263 has been devised to fulfil the goal of MPEG-4. This codec, which is an extension of H.261 (but which uses lessons learned from MPEG developments), is sophisticated enough to code small-size video pictures at low frame rates within 10–64 kbit/s. Due to a very effective coding strategy used in this codec, the recommendation even defines the application of this codec for very high resolution images such as HDTV, albeit at higher bit rates.
Before leaving the topic of MPEG-4, I should add that today’s effort on MPEG-4 is directed towards functionality, since this is what makes MPEG-4 distinct from other codecs. In MPEG-4, images are coded as objects and the generated bit-stream is scalable. This provides the users with the ability to interact with video, choosing the parts which are of interest to them. Moreover, natural images can be mixed with synthetic video, in what is called a virtual studio.
MPEG-4 defines a new coding method based on models of objects for coding synthetic objects. It also uses the wavelet transform for the coding of still images. In this coding technique, the signal spectrum is partitioned into several frequency bands, and each band is coded and transmitted independently. This is particularly suited to image coding for various reasons. First, natural images tend to have a non-uniform frequency spectrum, with most of the energy concentrated in the lower-frequency band. Secondly, in the human visual system, the noise visibility tends to fall off at both high and low frequencies and this enables the designer to adjust the compression distortion according to perceptual criteria. Thirdly, since images are processed in their entirety and not in artificial blocks, there is no block structure distortion in the coded picture, as occurs in the transformbased image encoders.
This mode of MPEG-4 is to be compatible with the new method of coding of still images under JPEG-2000. The main coding in JPEG-2000 is also wavelet-transform based, which not only allows images to be coded more efficiently, but also enables one to code various parts of the image at a quality of one’s own choosing. However MPEG-4, as part of its functionality for coding of natural images, uses a similar technique to H.263; hence it is now equally capable of coding video at very low bit rates. The study for the first phase of MPEG-4 is now complete and it is expected that the standard for the second phase will be published in the fairly near future.
Work on MPEG-4 has brought new requirements, in particular in image databases. Currently a working group under MPEG-7 has undertaken to study these requirements. The MPEG-7 standard builds on the other standards: its main function is to define a search engine for multimedia databases to look for specific image/video clips, using image characteristics such as colour, texture and information about the shape of objects. These pictures may be coded by either of the standard video codecs, or even in analogue forms.
Finally, MPEG has set up a new working group to look at the requirements for interoperable multimedia content delivery services. It is known as Multimedia Framework, under the generic name of MPEG-21. Issues such as copyright, multimedia networking and watermarking (‘marking’ electronic information so that illegal copying can be detected) will be amongst the most important items to be looked at by MPEG-21.
M. Ghanbari, Video Coding: an introduction to standard codecs. The Institution of Electrical Engineers, London, August 1999.
CCITT The International Telephone and Telegraph Consultative Committee, founded in 1956 by the amalgamation of earlier bodies to respond to new requirements generated by telephone/telegraph development.
DCT Discrete Cosine Transformation: a method of compressing images.
DPCM Differential Pulse Code Modulation: coding the difference between signal samples.
DSP Digital Signal Processing (chips): these chips can speed up repetitive integer arithmetic of the sort used in coding mechanisms. frequency domain A signal can be a function of time or frequency; thus it can be presented in the time or frequency domain.
H.120 standard The first recommended international standard for video coding. (The ‘H’ prefix is used for audio and video standards.) interframe coding Coding a frame using information from the previous frame.
interframe DPCM DPCM coding of differences between frames. (8 8) interframe error pixel A block of pixels to be coded with DCT. interlaced video Video in which each frame is scanned more than once and the scanned pictures are interleaved. intraframe coding Coding a frame using only the information in the frame itself.
ITU The International Telecommunication Union, an umbrella organisation for the coordination of global telecom networks and services, based in Geneva.
ITU-T The Telecommunication Standardisation Sector of the International Telecommunication Union: studies technical and operational questions and provides recommendations for worldwide standards.
JPEG The group of experts nominated by national standard bodies and major companies to produce standards for coding still images. Also refers to the image coding mechanism itself: JPEG can provide a 20:1 compression of fullcolour data with no visible loss of quality. motion compensation Compensating for the movement of objects from one frame to another.
MPEG Motion Picture Experts Group: the moving-picture equivalent of JPEG. non-interlaced video Video in which each frame is scanned only once. picture artefacts Distortions introduced into pictures due to compression.
VQ Vector Quantisation: a method of representing a group of values using only a subset of the complete set of possible values. wavelet transform A method of compressing images; the basis for the new generation of image-compression standards in MPEG-4 and JPEG-2000.