AVT Christian Hoene Internet Draft University of Tuebingen Intended status: Informational August 17, 2009 Expires: February 2010 Requirements of an Audio Communication System (ACS) draft-hoene-avt-acs-requirements-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on February 17, 2010. Copyright Notice Copyright (c) IETF Trust and the persons identified as the document authors. All rights reserved. Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Hoene Expires February 17, 2010 [Page 1] Internet-Draft Requirements of ACS August 2009 Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract This document describes the requirements of an audio communication system (ACS) for acoustic content, especially speech and music. The ACS consists of all components above the IP layer and below a digital PCM audio interface. These include codec, jitter buffer, and transport. The goal of the ACS is to provide a bidirectional acoustic communication between any two Internet hosts at a good quality, constrained only by the available resources at the hosts and the characteristics of the transmission path between both hosts. The intention of the document is to provide the requirements for a codec that is solely intended for the Internet, to provide the requirements for the codec's payload specification, and to define the requirements on the transport protocol. Table of Contents 1. Introduction...................................................3 1.1. Basics Architectural Guidelines of the Public Internet....4 1.2. Problem Statement.........................................5 2. Usage Scenarios................................................7 2.1. Scenario 1: Person-to-person calls (VoIP).................8 2.2. Scenario 2: High quality interactive audio transmissions (AoIP).........................................................8 2.3. Scenario 3: Ensembles performing over a network (MMoIP)...9 2.4. Scenario 3: Push-to-talk like service (PTT)...............9 3. High-Level Requirements.......................................10 3.1. Low cost and licensing free..............................10 3.2. Reliable on the Internet.................................11 3.3. Quality..................................................11 4. Technical Requirements........................................12 4.1. Audio content............................................12 4.2. Quality..................................................12 4.3. Reliability and congestion control.......................12 4.4. Coding bit rate..........................................13 4.5. Sampling rate............................................13 4.6. Complexity...............................................13 4.7. Latency..................................................14 4.8. Packet rate..............................................14 4.9. Packet loss resilience...................................15 4.10. Frame erasure concealment...............................15 Hoene Expires February 17, 2010 [Page 2] Internet-Draft Requirements of ACS August 2009 4.11. Jitter compensation and playout buffer..................15 4.12. Playout adjustments.....................................16 4.13. Concealment of mode switches............................16 4.14. Extrapolation...........................................16 4.15. Interpolation...........................................17 4.16. DTX.....................................................17 4.17. Testing.................................................17 4.18. Licensing and source code...............................17 4.19. Versioning and software updates.........................18 4.20. RFC Type................................................18 4.21. Side channel............................................18 4.22. Layered coding..........................................18 4.23. Interoperability with PSTN..............................19 4.24. Conferencing and speech recognition.....................19 4.25. Self-testing support....................................19 4.26. Self-awareness..........................................19 5. Out of scope..................................................19 5.1. Multichannel.............................................19 5.2. Repacketization..........................................19 5.3. Support for circuit-switched transmissions...............19 5.4. Support of packet networks other than the Internet.......20 5.5. Support of streaming.....................................20 5.6. Random packet losses.....................................20 5.7. Packet loss differentiation..............................20 5.8. Robustness against bit errors............................20 5.9. IRS and other kind of bandwidth filters..................20 5.10. Support of voice band data, fax and DTMF................20 5.11. Idle noise..............................................21 5.12. Tandem coding...........................................21 5.13. FEC.....................................................21 6. Security Considerations.......................................21 7. IANA Considerations...........................................21 8. References....................................................21 8.1. Normative References.....................................21 8.2. Informative References...................................22 9. Acknowledgments...............................................23 1. Introduction This document is based mainly on the discussions on the Codec BOF mailing list, which took place in 2009. It also based on the internal requirement documents of ITU-T G.718 [SG16 314-WP3], on the ITU-T G.719 standard, on the 3GPP document [TS26.114-830], and on existing IETF codec drafts. It is intended as basis of a requirement document that should lead to the design of an audio codec for the Internet. However, this document Hoene Expires February 17, 2010 [Page 3] Internet-Draft Requirements of ACS August 2009 address the requirements of the entire system not only of a single component because we want to ensure that the system as a hole works well not only some parts of it. We introduce the term audio communication system to describe the parts of an IP based telephone which are care for the bidirectional transmission of acoustic content between two Internet hosts. These include the encoder, the payload encapsulation, guide lines on how to use transport protocols (RTP, UDP, TCP, DCCP), the playout buffer, the decoder, the concealment of packet loss, time adjustments, changes of encoding parameter, and various mechanisms to manage, control and monitor the acoustic transmission. The ACS is intended mainly for the use on the public Internet and should be as easily distributable as most other Internet protocols that run on virtually all kind of devices and on all kind of communication links. Also, the ACS shall be affordable by all humans that have Internet access. If possible, it should be royalty free and available as open-source software. If these requirements are given, then the ACS can fulfill its goal of providing acoustic transmission between _any_ two Internet hosts. 1.1. Basics Architectural Guidelines of the Public Internet The ACS is intended for the public Internet and follows similar architectural design guidelines as those which are valid for other Internet protocols, too. These include: o End-to-end semantics saying that transport protocol units are transmitted from one end (an Internet host) to the other end without any intermediate changes. o Network neutrality. o Best effort service that try to transmit packets as good as possible but that cannot guaranty any minimal transmission bandwidth or maximal transmission delay. Instead one has to cope with any end-to-end transmission quality that is provided. o Congestion control to prevent congestion collapse of the Internet (such as TCP or DCCP). Typically, TCP controls the number of packets that are sent during periods of congestion. Thus, one has to consider that the number of packets per second might be an important constraining limitation and not only the bits per second. Hoene Expires February 17, 2010 [Page 4] Internet-Draft Requirements of ACS August 2009 o Internet protocols are scalable to wide degree. They work on links having a very low bandwidth (in the order of bits per second) and with very high bandwidth (in the order of gigabits per second). The transmission latency can range from microsecond up to seconds. Also, the Internet hosts might have very low processing and memory capabilities (such as an 8-bit micro controller). However, even then they can communicate with any other hosts. Flow control (such as in TCP) is used to cope with hosts that have limited resources. o Functions to help monitoring the communication (such as the features provided ICMP) o The most important Internet protocols can be used without paying royalties. o The public Internet allows global communication between any two hosts connected to the public Internet. Typically, the user only has to pay for getting access to the public Internet not for the distance that the IP packets have to travel. o Internet standards should be as simple as possible (but no simpler). 1.2. Problem Statement The ACS should enable an acoustic communication between any two Internet hosts considering the features of the Internet as described above. We see the need for designing the ACS because we see the following weaknesses in the existing codec and VoIP designs. o Many standardized speech and audio codecs require the payment of royalty fees. Only codecs such as G.711, G.722, G722.1, and G.722.1C that have mediocre performances can be used license free. Thus, one cannot ensure that a good codec can be afforded by all owners of all Internet hosts. o All known codecs have a small operational range, in addition they do not adapt to a wide range of bandwidth. For example, AMR support bit rates between 4.75 and 12.2 kbps and ITU G.719 support rates between 32 kbps and 128 kbps. Hoene Expires February 17, 2010 [Page 5] Internet-Draft Requirements of ACS August 2009 o An acoustic communication at superb transmission quality is not supported. Especially, if the latency is very low and the bandwidth is very high, we do not have a standardized codec that support hifi quality at ultra low delays. Only the SBC audio codec standardized by Bluetooth SIG [A2DPV10] can be considered for the usage scenario. Ultra-low delay transmissions at hifi quality are especially useful for distributed ensemble performances or distributed choruses. o Similar, if the transmission quality is very bad, no standardized audio codec supports a grateful degradation. If the loss rate becomes too high then all speech and audio codecs become useless. However, in those cases one can use half-duplex, push-to-talk like transmission of short audio segments that would still allow a very slow communication at very low bitrates. o Frequently, a PSTN call needs to be transcoded. Transcoding reduces the speech quality and increases latency. Thus, most codecs are designed to work well in conditions of transcoding. However, in case end-to-end IP transmission, the need for transcoding vanishes. It might only be needed for teleconferencing applications or for connecting to the PSTN network. o The quality of a PSTN call has hardly increased during the last decade. Often, it is even worse because of IP based interconnections and support of cellular networks. Even those support of wideband speech transmission system have been developed, the lack of the willingness of users to pay more has limited the introduction of wideband speech. Also on the Internet we do not expect users to pay more for high quality phone calls. However, we believe that they will be delighted if they can communicate at nearly perfect quality. o Neither a single standardized codec nor its RTP payload RFC specifies how to cope with time varying bandwidth and latency nor this is considered as required feature. This hinders the wide spread use of an adaptive coding mode selection and thus reduces the quality of many Internet phone calls. o Not a single standardized codec supports varying complexities to support devices with low resources. Hoene Expires February 17, 2010 [Page 6] Internet-Draft Requirements of ACS August 2009 o Standardized codecs do not support any functionality for self- observation and self-monitoring. Also, they do not provide information about how well they encoded and decoded the audio content under a given set of coding parameters and packet loss rates. However, this information is important for the transport in order to rightly adapt the codec's transmission parameters. o Packet losses occur in the Internet, the transmission time of packets and the playout time varies and the coding mode is changed in response to changed available transmission bandwidths. All these things cause the audio stream to be temporally distorted. The codec shall support concealment algorithms to limit the perceptual distortion. However, none existing standardized codec support the concealment of the adjustment of the playout time. Also, standardized PLC work on extrapolation of previous audio segments and do not support the interpolation. Lastly, often one cannot distinguish between delaying the playout time and packet loss because the missing packet might still arrive. Thus, an algorithm that uses the same extrapolation for packet loss concealment and time stretching might be beneficial. o None of the standardized interactive speech and audio codec supports mechanisms to decrease the packet rate. Usually, packet rates are reduced by putting multiple speech frames into on RTP packet. However, the codecs do not take advantage of the high algorithmic delay that can be utilized then. Thus, they work less efficient in situations of congestion. 2. Usage Scenarios The ACS should be optimized towards real-time communications over the Internet. It should support applications like collaborative network music performance, high-quality teleconferencing, wireless audio equipment, low-delay links for broadcast applications, network sound servers for using multimedia applications remotely, telepresence (enterprise) and the digital living room (consumer), and other. The ACS shall be general enough to support multiple and quite diverse network conditions. For example, if network latency is low and bandwidth is plenty, it can be used for quasi-simultaneous music transmissions allowing distributed ensemble performances. It is also applicable interactive hifi quality audio transmission. If the network connection worsens, the transmission quality degrades to (wide-band) interactive speech transmission. As a last resort, it emulates a high-delay, half-duplex push-to-talk like communication service. Hoene Expires February 17, 2010 [Page 7] Internet-Draft Requirements of ACS August 2009 In the following, we enlist four main scenarios and describe their quality requirements. 2.1. Scenario 1: Person-to-person calls (VoIP) The classic scenario is that of the phone usage to which we will refer in this document as Voice over IP (VoIP). Human speech is transmitted interactively between two Internet hosts. Typically, beside speech some background noise is present, too. The quality of a telephone call is traditionally judged with subjective tests such as those describe in [ITU-T P.800]. The ACR scale used in MOS-LQS sometimes might not be very suitable for high quality, then - for example - the MUSHRA [ITU-T BS.1534-1] rating can be applied. A telephone call is considered good if it has a maximal mouth-to-ear delay of 150ms [ITU-T G.107] and a speech quality of MOS-LQS 4 or above. However, interhuman communication is still possible if the delay is much larger. This scenario does not include the use case of using a VoIP-PSTN gateway to connected to legacy telephone systems. In those cases, the gateway would make an audio conversion from broadband Internet voice to the frugal 1930's 3.1 kHz audio bandwidth. Interconnections to the PSTN will most likely stick with its legacy codecs to avoid transcoding. 2.2. Scenario 2: High quality interactive audio transmissions (AoIP) In this first scenario we consider a telephone call having a very good audio quality at modest acoustic one-way latencies ranging from 50 and 150 ms [ITU-T G.107], so that music can be listened over the telephone while two persons talk interactively. The Absolute Category Rating (ACR) (refer to ITU-T P.800) can be used, too. However, it might be more efficient to measure quality with the MUSHRA tests given in [ITU-T BS.1534-1], which is intended for intermediate audio qualities. Also, for today's teleconferencing and videoconferencing systems there is a strong and increasing demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 kHz. This rising demand for high quality audio is due to the following: Hoene Expires February 17, 2010 [Page 8] Internet-Draft Requirements of ACS August 2009 o Conferencing systems are increasingly used for more elaborate presentations, often including music and sound effects which occupy a wider audio bandwidth than that of speech. For example, Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect are based on an IP based transmission and benefit from a IP optimized ACS. o The new "Telepresence" video conferencing systems, providing High Definition video and audio quality to the user, are giving the experience of being in the same room by introducing high quality media delivery (such as from Cisco). o The emerging Digital Living Rooms will likely be interconnected and might require a constant acoustic transmission at high qualities. 2.3. Scenario 3: Ensembles performing over a network (MMoIP) In some usage scenarios, users want to act simultaneously and not just interactively. For example, if persons sing in a chorus, if musicians jam, or if e-sportsmen play computer games in a team together they need to acoustically communicate. We call it the Make Music Over IP (MMoIP) scenario. In this scenario, the latency requirements are much harder than for interactive usages. For example, if two musicians are placed more than 10 meters apart, they can hardly keep synchronized. Empirical studies [Gurevich2004] have shown that if ensembles playing over networks, the optimal acoustic latency is around 11.5 ms with targeted range from 10 to 25 ms. In addition to the MUSHRA tests, the recommendation [ITU-R BS.1116] can be used for audio transmissions that just have minor impairments. 2.4. Scenario 4: Push-to-talk like service (PTT) In spite of the development of broadband access (xDSL), a lot of users would only have service access via PSTN modems or mobile links. Also, on these links the available bandwidth might be shared among multiple flows and is subjected to congestion. Then, even low coding rates at about 8 kbps are too high. If transmission capacity hardly exists, one still can degrade the quality of a telephone call to something like a push-to-talk (PTT) like service having very high latencies. Technically, this scenario takes advantage of bandwidth gains due to disruptive transmission Hoene Expires February 17, 2010 [Page 9] Internet-Draft Requirements of ACS August 2009 (DTX) modes and very large packets containing multiple speech frames causing a very low packetization overhead. The quality requirements of a push to talk like service have been hardly studied. The OMA lists as a requirement of a Push To Talk over Cellular service a transmission delay of 1.6 s and a MOS values of above 3.0 that typically should be kept [OMAPoCReq]. However, as long as an understandable transmission of speech is possible, the delay can be even higher. For example, [OMAPoCReq] allows a delay of typically up 4s for the first talk-burst. Also, [OMAPoCReq] describes a maximum duration of speaking. If a participant speaking reaches the time limit, the participant's right- to-speak shall be automatically revoked. If the quality of a telephone call is very low, then instead of listening-only speech quality the degree of understandability can be chosen as performance metric. For example, objective tests of the understandability use automatic speech recognition (ASR) systems and measure the amount of correctly detected words. In any case, the participant shall be informed about the quality of connection, the presence of high delays, the half-duplex style of communication, and its (limited) right-to-speak. For example this can be achieved by a simulated talker echo. 3. High-Level Requirements Based on the four scenarios, we list the following high-level requirements that the ACS should fulfill. 3.1. Low cost and licensing free The codec shall be affordable by all humans having Internet access. Thus, one of the key requirements is patent/licensing free technology. However, it cannot be seen as "legally binding requirement" but rather as a desired working goal. Typically, one cannot verify 100% whether a codec is totally free of unknown IPRs. Some patents may be overlooked. It can also be assured that the known IPRs are "license-free" and "free from the need to sign licensing agreement(s) before use" (The ability for any user to get the codec and use it without signing any paperwork). If one is practicing potentially patented technologies, there is no real mechanism to protect oneself from a patent troll at claims license fee for a standardized ACS. We have to assume that there is a Hoene Expires February 17, 2010 [Page 10] Internet-Draft Requirements of ACS August 2009 certain probability that the designed ACS is covered by patents what the IETF is not aware of. Thus, one has to define proper procedures on how to cope with IPR claims even if the ACS is already standardized. Because of the lack of financial income, the codecs design, testing and standardization process must be cost effective, too. A cheap approach is needed to characterize the ACS, which might include tests having volunteer participants. For example, codecs can be provided to thousands of users in public to test them. Also, potential performance comparisons must not be as precise and proven as beyond any doubts because nobody wins or loses IPR fees if one solution wins or fails. 3.2. Reliable on the Internet The ACS must be optimized towards acoustic real-time communications over the Internet, and must have the flexibility to adjust to the environment it operates in. Based on the quality of the end-to-end speech packet transmission, the codec should adapt its quality and delay to achieve an optimal benefit for the user. As most Internet transport, it should be used with a wide range of condition allowing a high reliability regardless the networking condition. The reliability of the audio transmission should be high, even in cases of low and varying bandwidth. This implies that the codec is used on top of a transport protocol that implements a congestion control algorithm and that the ACS adapts to changes of available bandwidth. For example, if the available transmission bandwidth is too low to allow the codec to transmit audio at a high quality, the application can lower the sampling, bit or frame rate of the stream at the cost of higher algorithmic delay or a degraded audio quality. 3.3. Quality The ACS must provide a quality/bitrate trade-off that is competitive with other state-of-the-art codecs. Also, the codec must have a very low algorithmic delay so that it can support the typical requirements of its users. The speech and audio quality of the ACS should not be significantly worse than existing standardized codecs, if measures on the ACR scale. Hoene Expires February 17, 2010 [Page 11] Internet-Draft Requirements of ACS August 2009 4. Technical Requirements 4.1. Audio content At all bitrates the ACS must deliver speech in any language at good quality. The ACS must be tested for different speakers and at least with two languages and should support tonal languages as well. Frequently, speech needs to be transmitted not only without background noise but also at conditions including car, office and street noise. Background signals shall be considered not as the noise but as a part of the signals that convey information. Background signal can include background music at a SNR of 25 dB, office noise at a SNR of 20 dB, car noise at a SNR of 15 dB, babble Noise at a SNR of 25 dB, interfering talker at a SNR of 15 dB and street noise at a SNR of 20 dB. At high bitrates the quality must be excellent for any audio signal, especially music. Stereo is considered as a must. Also, for high quality audio conferencing, reverberant input signals should be considered for testing the modes. The speech and audio signals might have varying loudness. The transmission shall support a wide range of dynamics. The nominal input level of -36 dB, -26 dB and -16dB with respect to the overlapping bandwidth limit (OVL) point (-20 dBm0). 4.2. Quality At a given operational mode, the ACS must not have perfect quality and must not perform better than any other standardized codec. However, considering the most common network conditions, the ACS shall perform better than any combination of existing codecs most of the time. 4.3. Reliability and congestion control The acoustic transmission should be reliable and robust. The ACS shall be not only robust against packet losses but also for periods of low bandwidth. The mean availability of the audio transmissions, calculated over all users, might be one of the metrics for assessing the performance of an Internet audio codec. The ACS should adapt to the current network situation. Also, the codecs of ACS themselves must be adaptable, because switching among Hoene Expires February 17, 2010 [Page 12] Internet-Draft Requirements of ACS August 2009 multiple codecs is difficult to negotiate and unlikely to work well in situations of inter-operation. Responding to congestion is a more complex issue and out of the scope of this document. However, it shall be defined on how to use existing congestion control protocols like DCCP and TCP. The ACS shall provide the mechanisms that congestion control requires from the codec (i.e. bitrate/framerate adaptability). Because of the interactive nature of the acoustic transmission, the bidirectional transmission of audio content can be used for transmitting the required feedback and implementing a control loop. As such, it can be considered as a requirement that the acoustic transmission should be always bidirectional--even if the backward channel just sends "compressed silence". 4.4. Coding bit rate The ACS must be capable of running at bitrates below 10 kbps. At low bitrates it must deliver good quality for clean, noisy or hands-free speech in any language. At high bitrates the quality must be excellent for any audio signal, including music. The bitrate must be adjustable in real-time. The bit rate can go up to 128 kbit/s per channel or more. The bitrate must be adjustable in real-time and at a fine granularity. Variable bit rates depending on the content should be supported. 4.5. Sampling rate The codec must support multiple sampling rates, ranging from 8 kHz to full band. Switching between sampling rates must be carried out in real-time. 4.6. Complexity The ACS should have a complexity that is adjustable in real-time, where a higher complexity setting improves the quality/bitrate trade- off. As a lower limit, the ACS shall run on hosts that common in developing countries. These may include OLPC XO-1s or other low-end (refurbished) computers (refer to Computer Aid International) and smart phones like those based on Texas Instruments Open Multimedia Application Platform (OMAP), which include both a host ARM CPU and one or more DSP. Hoene Expires February 17, 2010 [Page 13] Internet-Draft Requirements of ACS August 2009 On those devices, the ACS must not be capable of running at highest quality but at least at 8 kHz sampling rate. 4.7. Latency To maintain a good quality of services requiring interactivity, it is necessary to maintain the overall delay as low as possible. But the delay requirement tends to have less importance in applications involving VoIP, possibly combined with other media and/or in heterogeneous network environment. A trade-off must be found between low delays and flexibility (scalability, ability to operate in various conditions with many types of signals etc.). In interactive scenarios, the codec should be capable of running with an algorithmic delay of no more than 30 milliseconds. For the making music scenario, the algorithmic delay must be between 3 to 9 ms. Still, given the speed of light as the fundamental limit of speed of information exchange, distributed ensembles can perform only regionally if latency budget of 25 ms must be kept. Typically, an optical fiber has a refractive index of 1.46 and thus in an optical fiber bits travel about 5136 km one-way in 25 ms. The total codec delay consists of the algorithmic delay and the processing delay. Algorithmic delay includes the frame size delay plus any other delays inherent in the algorithm (look-ahead, noise suppression and error correcting codes for algorithm purposes and any algorithmic decoding delay). Processing delay is the additional delay caused by implementation with a finite speed processor. 4.8. Packet rate The ACS must support a variable and dynamic changeable packet rate. Putting several frames into one packet is useful for packet grouping, which in turn is very useful for bandwidth adaptation and network usage efficiency. This is because of the fact that a lot of bandwidth is used for protocol packet headers like those of Ethernet, IP, UDP, and RTP and thus to overhead at the MAC layer. If even IP header compression is applied, still many layer 2 protocols introduce an additional overhead that is not compressed [Hoene2005]. Classically, it is usually specified in the RTP payload specification, not in the codec specification itself. In general, a codec can take advantage of a larger frame size. This is especially true for a transform codec, where a larger frame means better Hoene Expires February 17, 2010 [Page 14] Internet-Draft Requirements of ACS August 2009 frequency resolution. The gain is somewhat smaller time-domain codec especially for > 20 ms frames. However, in larger packets the inter- frame dependencies can be adjusted on the fly to choose a trade-off between bitrate and amount of error propagation. It may even be possible to just make use of more inter-frame correlation for frames 2...N in a packet of N frames and get most of the benefits it would get from a larger frame size. Thus, the ACS codec should support large frame sizes (up to a MTU). 4.9. Packet loss resilience The codec must be capable of running with little error propagation, meaning that the decoded signal after one or more packet losses is close to the decoded signal without packet losses after no more than two additional packets. The codec must have a packet loss resilience that is adjustable in real-time, where a lower packet loss resilience setting improves the quality/bitrate trade-off. Also, the codec may add inter-frame redundancies to achieve better loss robustness. 4.10. Frame erasure concealment The ACS must have a packet loss concealment algorithm. The PLC must be standardized to know how well the decoder can cope with packet losses in cases when the transmission parameters must be adjusted. However, the ACS may implement a PLC that performs better than the standardized PLC. The purpose of standardizing the PLC (and the other concealment algorithms) is to guarantee a certain quality level over a range of conditions. For good results, a PLC operates on decoder-internal parameters and states, which requires tight algorithmic integration. So the PLC is as much part of a decoder as any other decoder module. The above also applies to time compression/stretching methods for handling network jitter and other kind of concealment algorithms (as mentioned below). 4.11. Jitter compensation and playout buffer The ACS must cope with jitter. It must be able to receive the out of order de-packetized frames and present them in order for decoder consumption. It must be able to receive duplicate speech frames and only present unique speech frames for decoder. It must be able to handle clock drift between the encoding and decoding end-points. Hoene Expires February 17, 2010 [Page 15] Internet-Draft Requirements of ACS August 2009 The playout buffer should minimize the buffering time at all times while still conforming to the minimum performance requirements. If the limit of jitter induced concealment operations cannot be met, it is always preferred to increase the buffering time in order to avoid growing jitter induced concealment operations. 4.12. Playout adjustments The ACS should support time scale modifications especially for jitter compensations such as time stretching and time shrinking because on the Internet jitter is the norm not a special case. Because the operations going on in time scale modification algorithms are similar as those for the PLC, these operations should be combined into a single algorithm. Also, the ACS shall be able to determine a desired length of a time scale modification (so it can e.g. leave out or add one or more pitch periods), to keep a 'backup' decoder state of the previous frame or to add one more frame length of decoding latency - otherwise you can not compress the voice of the previous packet and for stretching its suboptimal. In general, the use of a high-quality time scaling algorithm is recommended. The amount of scaling should be as low as possible, scaling should be applied as infrequently as possible, and oscillating behavior is not allowed. 4.13. Concealment of mode switches The ACS should also support the concealment of distortions caused by switching coding modes [Hoene2005]. Also, the negative effect of switching the coding mode shall be low. For example, the transmission and coding mode might change several times (up to 5Hz) per second after getting feedback from the decoder. 4.14. Extrapolation Sometimes, it is not possible to distinguish between a packet that arrives too late and packet that is lost and needs to be concealed. The decision on whether to conceal the loss or whether to conduct time stretching cannot be made yet. Thus, the ACS should support a general extrapolation of the audio signal which allows for late decision on whether to play out a delayed packet or whether to use a loss concealment operation Hoene Expires February 17, 2010 [Page 16] Internet-Draft Requirements of ACS August 2009 4.15. Interpolation If a packet n has not arrived but the previous packet n-1 and the following packet n+1, when the packet n shall be interpolated using the frame of the previous and following packets. 4.16. DTX The codec must be capable of using Discontinuous Transmission (DTX) where packets are sent at a reduced rate when the input signal contains only background noise. 4.17. Testing The testing of ACS and the quality characterization shall be performed with real network profiles such as with [TIA-921] or those given in the appendix [TS.26114-830], not with fixed set of "average distributed errors and losses". Later do not clearly reflect the Internet nature. Also, test vectors might be provided to check the correctness of the implementations. 4.18. Licensing and source code The usage of ACS should not require paying royalties and signing NDA. At the time of standardization it should be available for royalty free (RF) and at reasonable and non-discriminatory terms (RAND). The codec should be available as open source allowing implementation under BSD, LGPL and/or GPL. The codec specification description and implementation shall be based on a bit-exact fixed-point modular ANSI-C code using basic operators set provided in the ITU-T Software Tool Library to follow. In addition, an interoperable floating-point implementation can be provided. The source code shall be normative because of a number of reasons. One is ease of implementation (either using the reference code directly, or being able to use it to validate the ported code). Another is that it assures that the characterization tests actually measure the standard's performance. Even if it is not officially normative, readily available reference code becomes de facto normative, since most implementers will simply use the code and ignore the text in the RFC. Hoene Expires February 17, 2010 [Page 17] Internet-Draft Requirements of ACS August 2009 4.19. Versioning and software updates In order to cope with changes in the bitstream format, which might be required due to errors in the specification or - more important - due to newly claimed IPR, it must be possible to update the ACS online. Also, it must be indicated, which bitstream format is going to be used. 4.20. RFC Type It should become a standard, not an experimental RFC. 4.21. Side channel Congestion control should be must for all Internet applications also for the ACS. [RFC3550] suggests in Chapter 10 somewhere that the RTP profile should care for rate adaptation. Thus, the ACS should take advantage of a feedback loop for variable coding parameter control in order to allow a wide range of operation and to adapt to the the current available bandwidth and processing power. Congestion control per se is outside the review of this group, but providing the hooks for a congestion-control mechanism to interact with the codec is quite important. For example, running this codec on a TFRC-enabled or DCCP RTP stream - TFRC and DCCP need to be able to adjust (via the application) the bitrate of the codec in order to implement congestion control and perhaps adjust packetization periods/packet-rates. A side channel for adaptation can be added. This would make sense because in usage scenarios audio is always transmitted in both directions. Adding a control channel would give a real advantage to existing codec designs. Alternatively, such as side channel can be also added with alternative solutions, such as handling that communication in SIP/SDP and in RTP/RTCP. 4.22. Layered coding The ACS can support a layered encoding like in G.729.1 and G.718. Layered coding can be seen as a method for computationally efficient transcoding. Layered coding make sense in the conferencing environment as such stripping should be done at the sender after Hoene Expires February 17, 2010 [Page 18] Internet-Draft Requirements of ACS August 2009 encoding. Then, for all receivers the encoding has to be done only once. However, for bidirectional transmissions, you do not need layered encoding as most codecs now are VBR, its enough already to adapt codec (at the source) to the bandwidth. Also, layered coding comes at additional cost (about 10% of the coding rate) 4.23. Interoperability with PSTN The ACS might be developed to be interoperability to existing PSTN systems. Especially interoperability with 2G and 3G mobile radio systems is desirable. Also, the interoperability with G.722.2 @ 12,65 kb/s and with G.722 (for DECT devices) are of particular interest. 4.24. Conferencing and speech recognition A teleconference server should be able to mix the audio signals at lower complexity than decoding + encoding. The ACS shall be capable of support automatic speech recognition. 4.25. Self-testing support ACS should support means of testing the quality of a connection by feedback loops and quality feedbacks. 4.26. Self-awareness The ACS should be aware on how well it can transmit acoustic content at various coding parameters and packet loss rates. 5. Out of scope 5.1. Multichannel 5.1 is worth supporting but that would most likely be through multiple independent channels/pairs, so that's probably not that much of an issue. 5.2. Repacketization The ACS needs not to support repacketization in a network because this would violate the end-to-end semantic of the Internet. 5.3. Support for circuit-switched transmissions The ACS needs not to support circuit-switched transmission. Hoene Expires February 17, 2010 [Page 19] Internet-Draft Requirements of ACS August 2009 5.4. Support of packet networks other than the Internet The ACS needs not to support other packet networks (VoATM, private networks) beside the Internet. 5.5. Support of streaming The ACS needs not to support multimedia streaming (e.g. video + audio involving bit-rate tradeoff), multicast content distribution (offline/online) and message retrieval systems. 5.6. Random packet losses The usage of random packet losses to measure the concealment performance is meaningless because it does not reflect the nature of the Internet. Thus, the codec needs not be optimized nor tested using these criteria. Instead, real packet loss and delay traces should be considered. Also, short and long bursts of packet losses, which occur during due to handoffs, fast fading, congestion events, and route changes, should be considered. 5.7. Packet loss differentiation The ACS cannot assume that the quality of packet transmission changes one per packet basis. For example, in layered coding the core layers cannot expect to be less subjected to packet losses than enhancement layers. 5.8. Robustness against bit errors The ACS needs not to support bit errors because they are quite seldom on top of Ethernet. This is especially true as long as UDP-Lite is not supported widely. 5.9. IRS and other kind of bandwidth filters The ACS must not consider bandwidth filters like the IRS because they are based on the traditions of circuit-switched connections. 5.10. Support of voice band data, fax and DTMF The ACS needs not to support voice band data such as fax or DTMF. Instead, alternative ways of communication or other RTP payload format should be considered. Hoene Expires February 17, 2010 [Page 20] Internet-Draft Requirements of ACS August 2009 5.11. Idle noise The generation of idle channel noise should not be used to indicate that the call is still active. Instead, in case of transmission problems an acoustic notification can be given. 5.12. Tandem coding The ACS needs not to be optimized for tandem coding conditions because one can assume an end-to-end transmission of IP packets. Tandem coding might only be used for PSTN gateways and for conference bridges. 5.13. FEC RTP support of Forward Error Correction (FEC) needs not to be considered. Also, support of adding "redundant speech frames", which have been transmitted in preceding packets, in a RTP packet is not required. Instead, the redundancy can be added by the encoder which does this in a more efficient way. 6. Security Considerations To do. 7. IANA Considerations To do. 8. References 8.1. Normative References [ITU-T BS.1534-1] "BS.1534 : Method for the subjective assessment of intermediate quality levels of coding systems", ITU-T Recommendation BS.1534-1 (01/03). [ITU-T G.107] "G.107 : The E-model, a computational model for use in transmission planning", ITU-T Recommendation G.107 (04/09). [ITU-T P.800] "P.800 : Methods for subjective determination of transmission quality", ITU-T Recommendation P.800 (08/96). [ITU-R BS.1116] "BS.1116 : Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems", ITU-R Recommendation BS.1116 (10/97). Hoene Expires February 17, 2010 [Page 21] Internet-Draft Requirements of ACS August 2009 [OMAPoCReq] "Push to talk over Cellular Requirements", Open Mobile Alliance, Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC- V1_0-20060609-A.pdf [TIA-921] TIA-921-A Document Information: "Network Model for Evaluating Multimedia Transmission Performance Over Internet Protocol", Publisher: Telecommunications Industry Association, Publication Date: Jun 18, 2008 [TS26.114-830] 3GPP TS 26.114 V8.3.0, "IP Multimedia Subsystem (IMS); Multimedia telephony; Media handling and interaction", Rapporteur: Per Froejdh, Version 8.3.0, 2009-06-12, RTS/TSGS-0426114v830. 8.2. Informative References [A2DPV10] Bluetooth SIG, "Advanced Audio Distribution Profile", Audio Video WG, adopted specification, revision V1.0, May 22th, 2003. [celt-draft] J-M. Valin, T. Terriberry, G. Maxwell, C. Montgomery, "Constrained-Energy Lapped Transform (CELT) Codec", Internet draft, draft-valin-celt-codec-01, work in progress, July 13, 2009. [Gurevich2004] Gurevich, M., Chafe, C., Leslie, G., and Tyan, S., "Simulation of Networked Ensemble Performance with Varying Time Delays: Characterization of Ensemble Accuracy", Proceedings of the 2004 International Computer Music Conference, Miami, USA, 2004. [Hoene2005] Hoene, C., and Karl, H., and Wolisz, A., "A perceptual quality model intended for adaptive VoIP applications", International Journal of Communication Systems, Wiley, August 2005. [SG16 314-WP3] ITU-T SG16, "Agenda and list of documents for Q9/16", Temporary Document 314-WP3, Received on 2008-04-22 From Rapporteur Q9/16. [silk-draft] K. Vos, S. Jensen, K. Soerensen, "SILK Speech Codec", Internet draft, draft-vos-silk-00.txt, work in progress, July 6, 2009. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. Hoene Expires February 17, 2010 [Page 22] Internet-Draft Requirements of ACS August 2009 9. Acknowledgments The authors like to thank the various contributors taking part at the discussion on the Codec BOF mailing list in the period till September 2009. Also, this document is based on the SILK [silk-draft] and CELT drafts, the internal requirement documents of ITU-T G.718 [SG16 314- WP3] and the 3GPP document [TS26.114-830]. The author likes to thank Henry Sinnreich for his valuable feedback and support. Funding for this draft has been provided by the University of Tuebingen within the "Projektfoerderung fuer Nachwuchswissen- schaftler". This document was prepared using 2-Word-v2.0.template.dot. Hoene Expires February 17, 2010 [Page 23] Internet-Draft Requirements of ACS August 2009 Author's Address Christian Hoene University of Tuebingen WSI-RI Sand 13 72076 Tuebingen Germany Phone: +49 7071 2970532 Email: hoene@ieee.org Hoene Expires February 17, 2010 [Page 24]