Guidelines for ToBI Labelling (version 3, March 1997) copyright (1993) The Ohio State University Research Foundation By Mary E. Beckman & Gayle Ayers Elam 0. Preface 0.1. What are the "Guidelines for ToBI Labelling"? ToBI (for Tones and Break Indices) is a system for transcribing the intonation patterns and other aspects of the prosody of English utterances. It was devised by a group of speech scientists from various different disciplines (electrical engineering, psychology, linguistics, etc.) who wanted a common standard for transcribing an agreed-upon set of prosodic elements, in order to be able to share prosodically transcribed databases across research sites in the pursuit of diverse research purposes and varied technological goals. Silverman et al. (1992) and Pitrelli et al. (1994) describe the motivation for and development of the ToBI system. If you ask for this handbook in hard copy, those papers will be appended as Appendix B. Appendix A (which is included both in the hard copy and in the ASCII file version of this labelling guide) is "The ToBI Annotation Conventions", the definitive summary statement of the symbols and marks used in ToBI transcriptions, and of the conventions that we have agreed upon for their use. The rest of this labelling guide is a more detailed description of the system, with reference to accompanying utterances of two types: example utterances to illustrate points made in the text and exercise utterances to give labellers practice on the points made in the text. These utterances are set off in the text of the labelling guide using the following typographic conventions. EXAMPLE <>: orthographic transcription tonal transcription and/or break index values EXERCISE <>: orthographic transcription Each example utterance is also referred to in the text by its basename within pairs of angle brackets -- e.g., the first example utterance is <>. We have chosen the examples and arranged the exercises with the aim of leading new users through the system in a self-taught training course, trying to choose utterances in each of the six practice sets that show only phenomena that have been introduced up to that point. The utterances that accompany this labelling guide can be obtained in two formats: as digitized computer files with electronic record of the f0 contour from the Ohio State University web and ftp distribution site (see section 0.2) or as an audio tape with paper record of the f0 contour (see section 0.3). 0.1.1. Notice of copyright and restrictions on use The "Guidelines for ToBI Labelling" document and associated material are copyrighted. The text cannot be copied or distributed in any format unless this paragraph is included. The utterances accompanying the guide are available to any interested user, but only for non-commercial use. The National Science Foundation and the Ohio State University make no warranty and accept no liability associated with the use of these materials. These materials may be obtained only as described in Sections 0.2 and 0.3, and are not to be redistributed by other user sites. Users may not redistribute these materials from their own sites, but should instead tell interested people how to obtain their own copy from the distribution site. 0.1.2. Acknowledgements The "Guidelines for ToBI Labelling" and the accompanying utterances were developed in the Ohio State University Linguistics Laboratory with partial support from the National Science Foundation, and the Ohio State University continues to support the labelling guide by providing a distribution site for the electronic records (described in Section 0.2). Colin Wightman generously provided the distribution site for the electronic records for version 2.0 of the labelling guide in his lab at the New Mexico Institute of Mining and Technology. Jennifer Venditti provided LaTeXing and various other editing expertise for this earlier version, which we have relied on in producing this new one. Kim Silverman and John Pitrelli developed the original transcriber script, on which we based the primary shell scripts for viewing the examples and doing the exercises. David Talkin helped in innumerable ways, such as by developing the scripts for the cardinal examples. Harald Singer developed an alternative electronic format for version 2.0, and Stefanie Jannedy set up the web page for it and for the ftp site. 0.2. Getting and using the digitized utterances and f0 tracks If you have waves(tm) (an Entropic Research Laboratory product) or a similar computer display system, obtain the speech files, electronic record of the f0 contour, and label files by ftp from the Ohio State University distribution site. 0.2.1. Getting the digitized utterances and f0 tracks There are two options for obtaining the ToBI materials depending upon how much disk space users have available. For those with sufficient disk space there is a single large tarfile for convenience. This option requires having about 40 MB available during the installation process; the full materials occupy about 20 MB once the installation is completed and the tarfile is removed. If you do not have enough space to have both the complete tarfile and all the installed files at the same time, use the second option. There are three smaller tarfiles which together contain all the materials contained in the single large tarfile. That is, they contain the speech files, f0 records, and label files divided into three parts by order of occurrence in the Guidelines. In addition to the single large or three smaller tarfiles, you will need to get the "essentials" tarfile, which is about 2.5 MB and contains an ASCII version of "The Guidelines for ToBI Labelling", and the scripts and tools for displaying the f0 tracks and labels. If you are reading this page over the WWW, click here to access the tarfiles. Download the README-file first for descriptions of the tar files and of the directory structure that they will set up on your home system. Otherwise use ftp to get these tarfiles. On your home system, enter the command: "ftp portia.ling.ohio-state.edu" (or use the internet address for portia, which is 128.146.172.225) from the directory where you would like to have the materials (the installation process will create a directory called TOBI-TRAINING where all the files will be put). When prompted, type the login name "ftp" and your user name on your home system as a password. Change directory to the TOBI directory ("cd pub/TOBI"). If you now list the files available ("ls"), you will see several directories as listed below. Users should feel free to explore these directories. DOCS - contains documentation files such as this labelling guide. TARFILES - contains compressed sets of files for easy retrieval. TOOLS - contain shellfiles and tools for transcription. Further descriptions of the ToBI ftp site at portia.ling.ohio-state.edu and some guidelines can be found in the README file. If you run into serious problems send email to the ToBI site managers at tobi@ling.ohio-state.edu. The "Guidelines for ToBI Labelling", and the scripts and other tools that you will need to look at the examples are stored in a compressed tar file called essentials_tobi_release_3.tar.Z. To get this file, change to the TARFILE directory ("cd TARFILES"). Since tarfiles are binary files, enter the command "binary". Now type "get essentials_tobi_release_3.tar.Z". All the utterance files for the training materials have been placed in one large tarfile and in three smaller files as described above. These also are stored in the TARFILES directory. You can transfer the complete set of speech materials by typing "get complete_tobi_release_3.tar.Z". When the transfer is complete, return to your home system with the command "quit". Alternatively, to get the smaller files with part of the training materials only, type "get essentials_tobi_release_3.tar.Z" This will give you the Guidelines for ToBI Labelling, and the scripts and tools that you will need. Then to get the actual examples and exercises speech files, f0 contours, and label files, get one or more of the "part" files. You can take as much as you have room for and then delete things after you are through with them to make room for the next set of examples and exercises to work through. Part 1 has the files for Section 1 to Practice 2, Part 2 has the files for Section 2.6 to Practice 4, and Part 3 has the files for Section 3.2 to Practice 6. Transfer any of these files by typing one of the commands "get part1_tobi_release_3.tar.Z", "get part2_tobi_release_3.tar.Z", or "get part3_tobi_release_3.tar.Z" respectively. When you have all the files you want, return to your home system with the command "quit". Back on your home machine, make sure the tarfiles are in whatever directory you would like the ToBI files to reside under. The installation process will create a directory underneath the one in which you put the tarfiles and all the ToBI material will go in the new directory. To install the files, first enter the commands "uncompress essentials_tobi_release_3.tar.Z" and "uncompress complete_tobi_release_2.0.tar.Z" (or uncompress the relevant "part" file). Once the files are uncompressed (it will take a few minutes), enter the commands "tar -xvf essentials_tobi_release_2.0.tar" and "tar -xvf complete_tobi_release_2.0.tar" (or tar -xvf the relevant "part" .tar file) to extract all of the subdirectories and files. Don't forget to delete a tarfile once you're happy that all of its contents got installed correctly. Summary instructions are listed below. 1) ftp portia.ling.ohio-state.edu (or ftp 128.146.172.225) Name: ftp Password: username on home system 2) binary 3) cd pub/TOBI/TARFILES 4) get essentials_tobi_release_3.tar.Z (Guidelines for ToBI Labelling, scripts, tools) 4a) get complete_tobi_release_3.tar.Z or, if disk space is a concern, chose a relevant subset of 4b) get part1_tobi_release_3.tar.Z (Section 1 to Practice 2) get part2_tobi_release_3.tar.Z (Section 2.6 to Practice 4) get part3_tobi_release_3.tar.Z (Section 3.2 to Practice 6) 5) quit 6) uncompress essentials_tobi_release_3.tar.Z 6a) uncompress complete_tobi_release_2.0.tar.Z or relevant subset of 6b) uncompress part1_tobi_release_3.tar.Z uncompress part2_tobi_release_3.tar.Z uncompress part3_tobi_release_3.tar.Z 7) tar -xvf essentials_tobi_release_3.tar 7a) tar -xvf complete_tobi_release_3.tar or 7b) tar -xvf (relevant parts)_tobi_release_3.tar 8) rm essentials_tobi_release_3.tar 8a) rm complete_tobi_release_3.tar or 8b) rm (relevant parts)_tobi_release_3.tar) The directory structure that you should recover by the untarring all the files is described below. The top level directory is TOBI-TRAINING. It includes an ASCII version of this labelling guide (called "labelling_guide_v3.ASCII"), two subdirectories called EXAMPLES and PRACTICE (which hold the speech, f0, and label files), and two script files called "examples" and "exercises". The last two are waves(tm) scripts for displaying the examples and exercises and for labelling the exercises. These two scripts assume this directory structure. Non-waves(tm) transcriptions of the examples are given in the ASCII file Nonwaves-transcriptions, and more detailed instructions about how to use the scripts and useful shortcuts for the mechanics of labelling using the waves(tm) scripts are given in README-transcriber. TOBI-TRAINING EXAMPLES (where speech, f0, and "answers" are kept) AND1.breaks (break index label file) AND1.d (speech) AND1.f0 (f0) AND1.misc (misc label file) AND1.tones (tones label file) AND1.words (words label file) . . . PRACTICE (where user transcriptions are kept) I-mean.breaks I-mean.words I-mean.tones . . . examples (script for displaying examples and "answers") exercises (script for practice labelling) labelling_guide_v2.ASCII (ASCII version of Guidelines and Conventions) Nonwaves-transcriptions (ASCII version of non-waves(tm) labelling) README-transcriber (instructions and shortcuts for scripts) After you have untarred the files, you will have to type a few commands to use the utterances as intended. To make the scripts executable and to protect the speech files and "answer" label files which are kept in the directory EXAMPLES from being overwritten by mistake, type the following three commands at the unix command line from within the TOBI-TRAINING directory. chmod +x examples chmod +x exercises chmod -w EXAMPLES There are also a few other tools included in the "essentials" file. These are compressed tarfiles which must be uncompressed and untarred (as above) to be installed. They are not strictly necessary for working through the Guidelines for ToBI Labelling, but are helpful tools to have. "cardinals.tar.Z" contains the files necessary for displaying cardinal examples of ToBI label categories, "transcriber.tar.Z" contains the files necessary for transcribers to transcriber their own data, and "checker.tar.Z" contains the files necessary to invoke the John Pitrelli's checking program, which checks transcriptions and reports errors. "cardinals.tar.Z" allows the user to display and play cardinal examples of ToBI label categories by pushing buttons in a menu display. If this tool is installed, the button menu with these examples is called up automatically each time the "examples", "exercises", or "transcriber" (see next tool description) scripts are invoked. Read the README-transcriber file for more information. Install by uncompressing and untarring as described above. cardinals.tar.Z (for displaying cardinal examples of ToBI) When uncompressed and untarred yields: README-transcriber aux_examples/ (additional examples of ToBI labelling) cardinals docard Labellers who wish to label their own data should use the script "transcriber" which is included in "transcriber.tar.Z". This uses the same format as the training materials. Cardinal examples are available if they are installed. Read the README-transcriber file for more information. Install by uncompressing and untarring as described above. transcriber.tar.Z (for transcribing your own examples) When uncompressed and untarred yields: breakindexmenu miscmenu tonemenu transcriber (script for doing transcriptions) wordsmenu Labellers should check their transcriptions of their own data to see that they have a "legal" ToBI transcription. The script "check-transcription" checks the label files for consistency and adherence to the ToBI conventions for labelling. Read the README-checker file for more information. Install by uncompressing and untarring as described above. checker.tar.Z (for checking transcriptions) When uncompressed and untarred yields: README-checker check-and-behead-breaks.awk check-and-behead-misc.awk check-and-behead-tones.awk check-and-behead-words.awk check-transcription (script for checking transcriptions) check-transcription.awk 0.2.2. Using the digitized utterances and f0 tracks Now you are ready to use the two waves(tm) scripts. Both of them take as their argument(s) the basename(s) of the utterance(s) you want to display. For example, as you are reading about example utterance <> in the labelling guide, you can listen to the speech and look at the associated transcription by typing: examples jam1 This will call up xwaves with two data windows to display the speech waveform and f0 trace, and a third window for the ToBI labels of our "answer" transcriptions. This script is set up to just display the information and play the speech; it does not allow the user to change the labels. In order to practice transcribing one of the example utterances (say the first exercise in PRACTICE ONE), type: exercises amelia-p2 This will display the speech waveform and f0 with only the word labels and placeholders for the break index labels in the labelling window. You can then use the labelling menus to add the tones and substitute break index values for the place holders in the break index section of the label window. Both of these scripts can be used to display several utterances in a series. For example, to queue up the first four examples in the Guidelines, type: examples jam1 cough made1 made2 After you have finished looking at <>, push the CONTINUE button in the waves(tm) control panel. The example <> will then be the displayed. A note on where the label files are stored: The "answer" label files (basename.tones, basename.breaks, basename.misc) are kept in the directory EXAMPLES along with the speech, f0, and word labels files (basename.d, basename.f0, basename.words). The label files that the user creates when labelling with the script "exercises" are stored in the directory PRACTICE. A note on multi-transcriber sites: Setting up a site to be a multi-transcriber site is fairly straightforward. The main idea is that instead of all labellers working with the "exercises" script and having their labels saved in the directory PRACTICE, each labeller will have a personal "exercises" script and directory where the labels will be stored. Take the name USER as our demonstration (where any name can substitute for USER). For each user, make a separate directory within the top level directory TOBI-TRAINING, copy the "exercises" script to a personal copy for the user, edit the "user" script, and then invoke the script "user" exactly as one would invoke the "exercises" script (make sure the "user" script is executable -- "chmod +x user" if necessary.) Additionally, one may want to copy the break index placeholders into the USER directory. 1) mkdir USER 2) cp exercises user 3) Change line in script "user" which says: PRACTICEDIR=PRACTICE to say: PRACTICEDIR=USER 4) cp PRACTICE/*.breaks USER/ 5) chmod +x user (if necessary) 6) E.g., start Practice 1 by typing: user amelia-p2 0.2.3. A less interactive electronic version that can be used on a Mac Version 2.0 of the labelling guide has been converted to another electronic format that can be fetched to a Mac for perusal and playback. The conversion to this format was done by Harald Singer, and it is available on the Ohio State University Linguistics Laboratory web site. Go to: http://www.ling.ohio-state.edu/Phonetics/E_ToBI/singer.tobi.html 0.3. Getting and using the audio recording and paper f0 records To get an audio tape of the utterances and a printed paper copy of the labelling guide and f0 tracks, send your request along with a check for $25.00 made out to The Ohio State University to: ToBI Labelling Guide, c/o Mary Beckman Ohio State University, Linguistics Dept. 222 Oxley Hall, 1712 Neil Ave. Columbus, OH 43210-1298 USA (The $25.00 just barely covers the cost of making copies of the tape and booklet and of the mailing to a North American location.) On the audio tape each utterance is played twice in a row. The utterances occur in the order that they are listed in the Labelling Guide text, with two more repetitions if an utterance is mentioned again later in another section. However, it is far easier to use the utterances if you can play each one as many times as you want and if you can zero in on some section of an utterance at will, and we recommend that you find some way of doing so. For example, you might use a tape recorder with a recordable tape loop device (and loops of several lengths). Or if you have a Kay DSP 5500 or some other kind of computer system with fast A/D capabilities, digitize each utterance from the tape into a buffer where you can play repeatedly while looking at the paper record of the f0 contour and the accompanying labels. To transcribe the example utterances, you will need to use the non-waves(tm) conventions described in Section 9 of The ToBI Annotation Conventions. The last section of the booklet is a listing of an ASCII file containing the non-waves(tm) format labelling of each example and exercise utterance corresponding to the waves(tm)-format labelling displayed on the sheet with the f0 contour. They are given in alphabetical order by basename. This ASCII file and another ASCII file containing the orthographic labels and field placeholders of all of the exercise utterances file can be obtained by anonymous ftp from the Ohio State University Linguistics Laboratory by doing the following: On your home system, enter the command: "ftp portia.ling.ohio-state.edu" (or use the internet address for portia, which is 128.146.172.225). When prompted, type the login name ftpand your user name on your home system as a password. Change directory to the TOBI directory ("cd pub/TOBI"). If you now list the files available ({\bf\tt ls}), you will see several directories: DOCS - contains documentation files such as this labelling guide. TARFILES - contains compressed sets of files for easy retrieval. TOOLS - contain shellfiles and tools for transcription. To get them, change to the DOCS directory ("cd DOCS") and transfer the files by entering the commands "get Nonwaves-transcriptions" and "get Nonwaves-exercises-templates". When the transfers are complete, return to your home system with the command "quit". 0.4. Future editions and a disclaimer If you have comments on this Labelling Guide -- particularly, if you have suggestions for improvements or better example utterances you would like to give to us -- we would be very grateful if you would direct the commments to us at: e-mail: tobi@ling.ohio-state.edu other-mail: ToBI Labelling Guide, c/o Mary Beckman Ohio State University, Linguistics Dept. 222 Oxley Hall, 1712 Neil Ave. Columbus, OH 43210-1298 USA This e-mail address is also the place to send us your e-mail address if you want to be added to our list of "subscribers" to be notified of any future editions of the Labelling Guide. The ToBI labelling system was originally developed to cover the three most widely used varieties of spoken English -- namely, general American, standard Australian, and southern British English. We do not claim to cover other varieties. Indeed, we have already determined that ToBI proper does not adequately cover many other British varieties such as the Glasgow dialect, and modified variants need to be developed by users who want to use it in transcribing utterances in these other dialects. By the same token, we must stress that ToBI was not intended to cover any language other than English, although we endorse the adoption of the basic principles in developing transcriptions systems for other languages, particularly languages that are typologically similar to English. More general comments about using the ToBI system for other dialects of English or about adapting ToBI labelling principles to develop comparable systems for the transcription of other languages may also be addressed to the tobi e-mail address listed above for forwarding to appropriate interested members of the larger ToBI group. 1. Overview and some basics 1.1. The basic parts of ToBI A ToBI transcription for an utterance consists minimally of a recording of the speech, an associated electronic or paper record of the fundamental frequency contour, and (the transcription proper) symbolic labels for events arranged in four parallel tiers. (Other tiers can be added for the needs of particular sites -- see Section 4.) The four tiers of labels, arranged in the order that they appear in the default labels window for the examples and exercises programs, are: (1) a tone tier (2) an orthographic tier (3) a break-index tier (4) a miscellaneous tier The tone and break-index tiers represent the core prosodic analysis. The tone tier is the part of the transcription that corresponds most closely to a phonological analysis of the utterance's intonation pattern. It consists of labels for distinctive pitch events, transcribed as a sequence of high (H) and low (L) tones marked with diacritics indicating their intonational function as parts of pitch accents or as phrase tones marking the edges of two types of intonationally marked prosodic units. The inventory of pitch events and their definitions are based on autosegmental analyses, in particular the analysis of Pierrehumbert and her colleagues (see Pierrehumbert & Hirschberg, 1990, and the references cited in it) with some modifications toward such alternative analyses as that of Ladd (1983). In example utterance <>, there is a production of the question "Will you have marmalade, or jam?" with two pitch accents (the L* tones), two phrase accents (the H- tones), and a H% boundary tone. EXAMPLE <>: Will you have marmalade, or jam? L* H- L* H-H% The break-index tier marks the prosodic grouping of the words in an utterance by labelling the end of each word for the subjective strength of its association with the next word, on a scale from 0 (for the strongest perceived conjoining) to 4 (for the most disjoint). These categories of association strength, or `break indices' are based on work by Mari Ostendorf, Patti Price, Stefanie Shattuck-Hufnagel, and their associates (see, e.g., Price et al., 1991). We equate the two highest break indices with prosodic groupings that are marked intonationally. For example, break index 3 after the word "marmalade" in utterance <> corresponds to the end of the intermediate phrase indicated by the H- phrase accent. EXAMPLE <>: Will you have marmalade, or jam? 1 1 1 3 1 4 The orthographic tier is arguably not part of any core prosodic analysis, except inasmuch as the labels on this tier can be used to interface the transcription to dictionary entries which do indicate such things as which syllable is likely to be most stressed in each word, prosodic information which is not otherwise included in the ToBI system. The orthographic tier is a straightforward transcription of all of the words in the utterance, in ordinary English orthography. When using waves(tm) and a transcriber script, or any similar computer labelling system, the convention is to align each orthographic label to the end of the word. The miscellaneous tier, like the orthographic tier, can include many events that are arguably not part of prosody per se. However, many events that are typically marked on this tier are important for interpreting the analyses on the tone tier and break-index tier, because they disrupt the smooth rhythm of the utterance or interrupt the intonation contour. This tier is essentially a `comment' tier that can be used to mark events such as the cough in example utterance <>. Except for very few exceptions (most notably, the label `disfl' often stands alone to flag the occurrence of a perceived disfluency of some type), labels on this tier come in pairs, to mark the beginning and end of each event interval. If it were not for the disruption of the cough labelled on the miscellaneous tier here, the tone transcription would have to be parsed as either unfinished or ill-formed. EXAMPLE <>: Will you have marmalade ... L* L* 1 1 1 1p cough< cough> 1.2. Guiding principles As should be obvious from the preceding examples, ToBI does not try to transcribe all aspects of prosody, or even all aspects that are amenable to symbolic transcription. In deciding what to include and what to leave out, we were guided by three principles. First, we wanted to be able to distinguish in our transcription all of the categorically distinct intonation patterns and prosodic units of the language (or rather of the three intonationally similar dialects that we claim to cover -- see Section 0.4 above). Second, we felt we should not transcribe aspects of prosody which are more amenable to quantitative measures than to the categorical divisions of a symbolic transcription. Finally, we did not want to squander the user's energies in transcribing even categorical aspects of prosody which are predictable from other parts of the transcription or from auxiliary tools such as dictionaries. The categorical aspects of prosody which we try to capture completely (by the first principle) are of two types. The first is the prosodic structure -- the rhythm of more and less stressed words alternating with each other, and the grouping of words into prosodic constituents of various sizes -- and the second is the intonation pattern -- the sequence of contrastive pitch events that we call pitch accents, phrase accents, and boundary tones. An example of the noncategorical aspects of prosody which we leave out (in accordance with the second principle) is the local tempo of each word in the utterance, which we feel could be more accurately and directly captured by some quantitative measure such as normalized segment duration (e.g., Campbell, 1992) than by any symbolic transcription such as an arbitrary division into, say, categories `1', `2', and `3' (for `slow', `medium', and `fast' tempi). An exception to this principle is the marking for each phrase of the point of highest fundamental frequency associated with an accent (HiF0), which we use as a measure of pitch range in order to facilitate research on the relationship between pitch range and discourse structure (see, e.g., Grosz & Hirschberg, 1992, and references therein). We anticipate being able to do away with this marking when we have developed automatic tools for detecting accent-related peaks directly from the fundamental frequency contour in conjunction with the tone tier transcription. A categorical aspect of prosody which we leave out (in accordance with the third principle) because it should be fairly predictable is the marking of the stressed and unstressed syllables within each word. By this level of stress we mean the word-internal alternation between more and less stressed syllables where the relative prominence of any pair of syllables is fairly fixed and can be thought of as inherent to the word's dictionary entry. For example, if the first and third syllables in the word "marmalade" are not pronounced with more prominence than the second, native speakers will judge the vowels in these two syllable to be mispronounced. (That is the first and third syllables should not have reduced vowels, whereas the second one should.) Since such word-internal rhythms are thus a fixed part of the word's pronunciation, we leave this specification out. That is, for example, in the transcription of utterances <> and <>, we have not marked the first and third syllables as relatively more stressed than the second syllable, since this aspect of the prosodic structure would be marked in any dictionary entry for the word, so that users of ToBI-transcribed databases could interface the orthographic tier with an online dictionary to fill in this information. 1.3. The marking of stress -- Pitch accents and prominence If the stress patterns within words are largely predictable from the dictionary entries for the word, what about other levels of stress? It has been recognized for some time now (e.g. Bolinger, 1972) that other aspects of the stress pattern cannot be predicted from the grammar with anything like the confidence with which we can predict the more stressed syllables within a word. Indeed the factors predicting the prominence of a word relative to other words in the same sentence is a matter of much current debate (see e.g. Hirschberg, 1993), and is one of the issues which we hope ToBI transcribed databases will be most useful in helping to resolve. Example utterance <> illustrates the unpredictability of prominences above the word, with three different productions of the same sentence -- "Marianna made the marmalade" -- each of which has a different stress pattern. In the first production, there are two syllables that are relatively more prominent than any other, the accented syllables in the words "Marianna" and "marmalade". In the second production of the sentence, on the other hand, there is only the one relatively more prominent syllable in "Marianna", and "marmalade" has been `deaccented'. This level of stress is marked in the ToBI system by directly transcribing the pitch accent on the tone tier. Thus, in the transcription of the first production in the example, there are H* accents marked for both "Marianna" and "marmalade", whereas in the second production there is only the L+H* accent marked on "Marianna". (The third production, like the first, also has accents on "Marianna" and "marmalade", but it has a different stress pattern because both of these accents are nuclear stresses, whereas in the first production only "marmalade" has a nuclear accent. We will describe this higher level of stress in more detail in the next subsection.) EXAMPLE <>: Marianna made the marmalade. in three productions 1) H* H* L-L% 2) L+H* L-L% 3) L+H*L-H% L* H* L-L% Note that there is another difference between the first production and the last two: the second and third productions begin at a much lower fundamental frequency than the first. This is due to the distinction, marked on the tone tier, between a single-tone H* pitch accent and a bitonal L+H* pitch accent. This contrast is independent of the difference in stress pattern, which depends on the pattern of pitch accent PLACEMENT and not on the type of pitch accent. To see this, compare the first two productions of the sentence in <> with the second two productions. (These first two sentences are the same as productions (1) and (2) in <>.) EXAMPLE <>: Marianna made the marmalade. in four productions 1) H* H* L-L% 2) L+H* L-L% 3) L+H* !H* L-L% 4) H* L-L% The stress patterns are the same, but the choice of H* versus L+H* pitch accent type is the opposite. (For the relationship between the second pitch accent and the first in production (3) and the diacritic `!' that marks this relationship, see Section 2.8 below. The somewhat less low beginning in the third production is also dicussed in Section 2.2.) The same stress patterns are illustrated again in the third and fourth productions in <> with yet another pitch accent type, this time a L* pitch accent (with a following rise into H- phrase accent and H% boundary tone). EXAMPLE <>: Marianna made the marmalade. in four productions 1) L+H* !H* L-L% 2) H* L-L% 3) L* L* H-H% 4) L* H-H% In transcriptions using waves(tm) label files (or any similar computer labelling system), the stress that comes from associated pitch accents can be parsed from reading the tone tier, since the waveform is used to place the mark for a pitch accent somewhere in the syllable that is phonologically associated to the accent. In the non-waves(tm) transcription conventions, the stress is marked even more explicitly in the symbolic string, by putting an asterisk in the orthographic transcription just before the vowel of each accented syllable. 1.4. The marking of stress -- Intonational phrasing and prominence Above the level of contrast between pitch-accented versus unaccented words, native speakers of English can distinguish another level of stress contrast, that between the last accented word of a phrase and any preceding accent. In the first production in utterance <>, for example, the word "marmalade" feels more prominent than "Marianna". In the last production of the sentence, on the other hand, "marmalade" does not feel necessarily more prominent than "Marianna". The sentence has been divided into two intonational phrases, so that each of these words is the last accented word in its own phrase. (This level of prominence is often called the `nuclear stress' or `nuclear accent' of the phrase.) Note that the level of prominence need not be marked explicitly, since the word with nuclear stress is defined positionally; it is the last accented word, or the accented word (if there is only one in the phrase). Thus the prominence contrast between a nuclear accent and a mere (prenuclear) accent can be read from the transcription of the accents on the tone tier relative to the boundaries marked between the phrases. EXAMPLE <>: Marianna made the marmalade. in three productions 1) H* H* L-L% 1 1 1 4 2) L+H* L-L% 1 1 1 4 3) L+H*L-H% L* H* L-L% 4 1 1 4 There are two separate markings indicating the boundaries of an intonation phrase; one is the sequence of phrase accent and boundary tone on the tone tier, and the other is the 4 on the break-index tier. The break indices are numbered from 0 (for least disjuncture) to 4 (for most pronounced disjuncture). The numbering captures the hierarchical nature of these prosodic groupings. At the highest level of the break index hierarchy and at the next lower level, the sense of disjuncture between adjacent words is connected closely to the intonation pattern. The boundary after "Marianna" in the third production of the sentence in <> is one at the highest level in the break index hierarchy transcribed in ToBI. This level is marked tonally by a boundary tone (H% or L%) at its end (and sometimes at its beginning, too, in which case it is %H). The next lower level (break index 3) is marked by a phrase accent (H- or L-) at its end. An intonation phrase contains one or more intermediate phrases, and the end of an intonation phrase is by definition also the end of an intermediate phrase (break index 3). This fact is reflected on the tone tier in the requirement that there be a sequence of phrase accent (for the last intermediate phrase) followed by a boundary tone at the end of every intonation phrase. The last production of the sentence in <> illustrates this nicely with clear reflexes of the tone string in the fundamental frequency contour. Note first the fall from the peak for the L+H* nuclear pitch accent to the L- phrase accent for the first intermediate phrase, followed by the small rise in fundamental frequency to the H% boundary tone at the intonation phrase boundary. Utterance <> illustrates the next lower level of disjuncture, that between two intermediate phrases that are grouped into one intonation phrase. In the second production of the sentence "`I' means insert", there is a fall from a H* nuclear accent into a L- phrase accent, but there is no subsequent boundary tone, since this in not an intonation phrase boundary. EXAMPLE <> -- `I' means insert. in two productions 1) H* H* L-L% 1 1 4 2) H* L- H* L-L% 3 1 4 Note that the first production of the sentence in <> contrasts with this second production in its stress pattern in the same way as the first and third productions of <>. The notion of nuclear accent is defined relative to the intermediate phrase. The contrasting productions in <> illustrate the same contrast in one versus two nuclear accents with L* pitch accents and a H- phrase accent at the boundary between the two intermediate phrases in the production with two nuclear accents. (The *? on the "made" in the first production illustrates a very common type of ambiguity about accent placement that is discussed below in Section 2.9.) EXAMPLE <>: Marianna made the marmalade. in two productions 1) L* *? L* H-H% 1 1 1 4 2) L* H- L* H-H% 3 1 1 4 1.5. What lines up with what? The conventions for placing labels when using the waves(tm) labelling system are prescribed in the ToBI Annotation Conventions so that labellers can use tools such as John Pitrelli's checker program to check for inadvertent omissions and grammatical errors. To quickly summarize, the break index label is placed at or just after the word label. Phrase accent and boundary tone labels are placed on or just before the corresponding 3 or 4 break index label. Pitch accents are placed somewhere within the accented syllable, preferably within the interval that can be identified with the syllable's vowel. In the non-waves(tm) transcription conventions, the orthographic, tone, and break index labels are ordered within each line so that such a transcription could be generated fairly quickly by merging and sorting a set of waves(tm)-format label files. 2. More on the tone tier 2.1. Tones and fundamental frequency As noted above, one of the basic parts of a ToBI transcription for an utterance is an electronic or paper record of the fundamental frequency contour. The transcription of events on the tone tier is closely linked to this record. In the case of pitch accents, the labeller can make this link explicit by choosing to place the label for the pitch accent specifically at the f0 maximum or minimum that realizes the starred tone of the accent, if this f0 event is within the interval of the accented syllable nucleus. (If the maximum or minimum does not actually occur within the syllable nucleus, there are optional conventions for marking the maximum or minimum as well as the accented syllable using the symbols `<' and `>', for a late or early f0 event, respectively -- see Section 4.2 in "The ToBI Annotation Conventions".) There is a more practical connection, as well, inasmuch as most transcribers find the fundamental frequency contour an invaluable aid in making the analysis of the intonation pattern that is embodied by the transcription on the tone tier. In interpreting the f0 contour to make the tonal transcription, it is important to keep in mind that several non-tonal aspects of an utterance can also strongly influence the fundamental frequency pattern. One of the most ubiquitous of these influences is the way in which consonant segments in the utterance interrupt the smooth course of the f0. Voiceless stops such as [p] and [t] and voiceless fricatives such as [f] and [s] create `holes' in the f0 contour just by being voiceless. Moreover, it is not possible usually to read the intended pitch during a voiceless consonant by interpolating from the last f0 value before voice offset to the first f0 after voice onset because obstruent consonants (stops, fricatives, affricates) all cause dramatic perturbations in the fundamental frequency contour over and above any interruption of voicelessness per se. As an `intrinsic' characteristic of its voiceless specification, a voiceless obstruent is usually associated with a dip into the consonant constriction and a dramatic fall starting from a much higher frequency just after the consonant release. Even voiced obstruents disturb the f0 contour; a voiced stop or fricative can be associated with a fall into and rise out of an often quite-deep valley during the consonants constriction. Utterance <> illustrates some of these effects. There is a dip in the f0 around 1.9 s into the file for the [d] at the beginning of "difference" and the sharp fall around 5.29 s right after the [p] in "pink". (To be sure, the perturbation caused by the [p] here is very small compared to many cases of voiceless obstruents that we have seen.) EXAMPLE <>: what's the difference among my long memory H* !H* L-L% L+H* !H* H-H% your blond baby and the pink carpeting L+H* *? !H* L-H% L* L* H* L-L% In interpreting such `intrinsic' segmental effects, it is important to note the actual voicing of the consonant, and not simply its phonemic status. For example, phonemically voiced obstruents in stressed syllable initial position for many speakers are not always really voiced. Note, for example the /b/ of "blond" at about 3.95 to 4.0 s in <>, which is voiceless unaspirated and has f0 perturbing characteristics more like those of the /p/ of pink. Also, the consonant /t/ in American and Australian English is usually a voiced flap (a short [d]-like segment) when it begins an unstressed syllable, as in the /t/ of "carpeting" in example utterance <>. Similarly, /h/ is often voiced between vowels. Thus, the perturbation caused by these two phonologically voiceless consonants is often like that for a /d/ or a /v/, rather than like a true [t], as shown by the /h/ in example utterance <> (at around 3.04s). EXAMPLE <>: The pink carpeting. H* H* L-L% EXAMPLE <>: Give him a hand with that. H* L-L% Example utterance <> gives another environment where flapping is common; see the flapped /t/ across the word boundary at around 1.34s. The flapping here is also important for transcribing break indices (see Section 3.2). EXAMPLE <>: Don't hit it to Joey. H* L*+!H L-L% Another kind of problem in interpreting the f0 contour comes from shifts into voice qualities other than normal modal phonation. For example, for most speakers, subglottal pressure falls very sharply at the very end of an utterance. If the cross-glottal pressure difference becomes very weak, there may no longer be good glottal closings -- i.e. the phonation may become quite breathy -- so that even fairly robust pitch-tracking algorithms can easily fail. For some speakers this switch to breathy voice might happen even earlier if the utterance has a long low-pitched stretch corresponding to a L- phrase accent. Or, a speaker might break into creaky voice in such a region. In fact, many speakers break into creaky voice in almost any region with very low fundamental frequency. Since creaky voice is typically characterized by very irregular glottal periods (i.e., the fundamental frequency is physically not well-defined), pitch-tracking algorithms often do not do well during these portions of the utterance, creating a messy `spattering' of values, like that seen in the f0 trace between 4.95 and 5.08 seconds in <>. Here the creak is due to the L*. EXAMPLE <>: Will you have marmalade, or jam? L* H- L* H-H% The pitch tracker can also completely fail, and give no f0 values at all, as in the region of the L- in the second production of <> after about 3.4s. EXAMPLE <>: Marianna made the marmalade. second production 2) L+H* L-L% In these two examples (as in many other occurrences of the same tone types in many of the example utterances in this labelling guide), the creaky voice is reliably interpreted by native speakers as a very low pitch value for some low tone. However, creaky voice does not automatically mean a very low L tone. Creaky voice can also occur as one common manifestation of a glottal stop, a segment which in English often occurs phonetically as a way to set off a word beginning with a stressed syllable that has no onset consonant. For example, the word "airline" in <> begins phonetically with a glottal stop realized as creaky voice. EXAMPLE <>: And set training and experience standards H* H* H- H* H* L-L% for airline inspectors and mechanics. H* H* L- H* L-L% Nor are breathy voice and creaky voice the only source of pitch-tracking errors. Even in parts of the utterance with normal modal voicing, pitch-tracking algorithms can sometimes go wrong because of fluctuations of amplitude or because of the vowel's resonance characteristics. A perfectly ordinary period-to-period oscillation in amplitude can cause a halving of the estimated fundamental frequency value, as illustrated in the region between 4.8 and 4.93 seconds and again between 5.3 and 5.45 seconds in example utterance <>. (Compare this to <>, which is exactly the same utterance, pitch-tracked with somewhat different assumptions about the signal parameters which the pitch-tracking program uses in its consistency-checking algorithm.) Or, if the first formant is much higher than the fundamental, the pitch tracking program might take the amplitude of harmonics that it amplifies as an intervening glottal pulse, effectively doubling the pitch, as in the region between 14.07 and 14.18 seconds in example utterance <>. Transcribers must therefore learn when to trust their ears to catch such misparsings in the fundamental frequency track (or to use an alternate record of the fundamental frequency contour, such as the narrow-band spectrogram). When all of these perturbing effects are taken into account, however, the fundamental frequency contour becomes a valuable aid in transcribing the events on the tone tier. EXAMPLE <>: Jim builds a big daisy-chain. H* H* L-L% EXAMPLE <>: Jim builds a big daisy-chain. H* H* L-L% EXAMPLE <>: Then I don't know if I can explain H* L+H* it to you. L-L% 2.2. Some familiar contours, and the contrast between H* and L+H* The inventory of events that are transcribed on the tone tier are five pitch accents, two phrase accents, and two boundary tones (plus downstepped counterparts of pitch accents and phrase accents with H tones). The summary statement in Appendix A lists the symbols for all of these tones and defines their use. In the previous sections, we already illustrated several familiar intonation patterns involving these tones. For example, the first productions of the sentences in example utterances <> and <> were instances of the `declarative contour' which is an intonation phrase containing one or more H* pitch accents and ending in a sequence of L- phrase accent and L% boundary tone -- i.e. (H*) H* L- L%. (When there is more than one accent, particularly when there is one relatively early and one relatively late accent, this contour is often called the `hat pattern'.) The last production in utterance <> illustrated a sequence of L- phrase accent followed by H% boundary tone that is sometimes called the `continuation rise'. The first production in utterance <> was an example of the `yes-no question contour', consisting of one or more L* accents followed by a H- phrase accent and H% boundary tone -- i.e. (L*) L* H- H%. The productions in <> and the first two productions in <> also illustrate one of the more difficult contrasts in pitch accent type -- that between the two types of `peak accent' in which the peak is timed to occur on the accented syllable (H* versus L+H*). These two pitch accents are alike in that both have high fundamental frequency targets timed to occur on the accented syllable. They are alike also in that the actual timing of the f0 peak that realizes the high tone can vary depending on the phonetic length of the syllable and on the neighboring tones. In longer syllables just before a L- phrase accent, the peak tends to come fairly early in the syllable, whereas in short syllables with no immediately following tone target, the peak for the high tone can be quite late, sometimes after the actual acoustic end of the syllable. This is illustrated in the hat pattern utterance in <>. The peak for the high tone of the first H* on "word" comes rather late (in the last third of the syllable), whereas the peak for the high tone of the second H* comes very early in "word" before the L- low tone target (during the first quarter of the syllable). How then do the two pitch accents differ? EXAMPLE <>: Your word is your word. H* H* L-L% The essential difference is what happens before the high tone. The leading L tone in L+H* is meant to transcribe a rise from a fundamental frequency value low in the pitch range that cannot be attributed to a L* pitch accent on the preceding syllable or to a L- phrase accent or L% boundary tone at a preceding intermediate-phrase or intonation-phrase boundary. For H*, by contrast, there is at most a small rise from the middle of the speaker's voice range (unless, of course, the H* follows soon after some low tone such as a L* pitch accent or L- phrase accent). Example utterance <> is a minimal pair illustrating this contrast. EXAMPLE <>: Marianna won it. in two productions 1) H* L- L% 2) L+H* L- L% In the English intonation system as described by Pierrehumbert & Hirschberg (1990), H* and L+H* have distinct meanings, which make the latter more likely to occur in a contrastive context such as the one evoked by the second production of the sentence in <>. In theory, this contrast between H* and L+H* can occur anywhere within a phrase. However, the distinction is difficult to make when the accented syllable is the first in the utterance, as in the second production of the sentence in <>. These three productions are examples of almost exactly the same patterns as exemplified by the three productions in <>. However, because the word "Anna" has no unstressed syllables before the main stressed one, it is difficult to realize the low tone for the nuclear accent on the first word in the second production. In cases such as this, where the evidence for L+H* comes from (theory-dependent) intuitions about meaning rather than from any clear low pitched region in the fundamental frequency contour, the ToBI Annotation Conventions prescribe H* instead. (The *? on the "married" in the first production illustrates a very common type of ambiguity about accent placement that is discussed below in Section 2.9.) EXAMPLE <>: Anna married Lenny. in three productions 1) H* *? H* L-L% 2) H* L-L% 3) H* L-H% L+H* L-L% Even when there is a long enough stretch between the beginning of the utterance and the accent, L+H* can be difficult to distinguish from H* because the categorical distinction in meaning is not always matched by a categorical distinction in the f0 level of the low tone. (The mapping of phonetic continua onto discrete oppositions is a well-known problem in segmental phonology as well.) Utterance <> above illustrates this. The L tone of the L+H* in the third production is not so low as that in the second production. When such utterances are taken out of context, it is possible for even intonational experts to be confused, and in fact, another transcriber with long experience in transcribing English pitch accents questioned our transcription of this as L+H*. (We are confident in the transcription, and did not mark it as X*? -- see Section 2.9 below -- but only because we know the context.) The last productions in <> and <> are very similar to another type of contour where one needs to be especially careful in choosing between H* and L+H*. In both of these sentences, the nuclear stress for the second intonation phrase occurs late enough that the low-pitched region of the L+H* (nuclear) pitch accent could be distinguished even if there were no H% boundary tone intervening between the L+H* pitch accent and the L- phrase accent for the preceding phrase. In the very similar contours of example utterances <> and <>, on the other hand, there is no H% boundary tone, and one must play close attention to the timing in order to decide whether the accent in the second phrase should be transcribed as H* or L+H*. (Note that the first utterance in <> also probably illustrates grouping at the level of the intermediate phrase and not a full intonation phrase; see Section 2.4 for the difficulty of telling these levels apart in this context). EXAMPLE <>: But Marianna knows noone. L+H* L-L% L+H* L-L% EXAMPLE <>: 1) That one's for Marianna. H* L- L+H* L-L% 2) Give me the brown one for Marianna. H* H* L-L% H* L-L% The first response alternative in example utterance <> illustrates another idiomatic intonation contour which might be confused with L+H*. This is the `surprise-redundancy' contour described by Sag & Liberman (1975). Here the preceding low pitched region comes from a L* pitch accent on a prenuclear accented word. The second response alternative shows the subtle way in which this rising sequence differs from L+H*. The simple interpolation from the L* to the H* is more gradual than the steep rise within the L+H* accent, although the difference can be very subtle when there are only a few syllables between the two accents in the L* H* sequence, as it is here. EXAMPLE <>: Who's it for? Mary's mother. It's for Mary's mother. L* H* L-L% L* H* L-L% *? L+H* L-L% ******************************************************************** PRACTICE ONE -- H* versus L+H*, L* H* L- L%, L* H- H%, and other accents in familiar contours ******************************************************************** Transcribe these exercises using the exercises script. _______________________________________________________________________ EASY: EXERCISE <>: Amelia. (two productions) EXERCISE <>: Marianna's mother. EXERCISE <>: A new mole. EXERCISE <>: Anna married Lenny. [Compare to last production in <>] EXERCISE <>: That's what I thought. EXERCISE <>: He's lazy and crazy and stupid. _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: Are you going to visit your mother when you're in Nashville? EXERCISE <>: My mother lives in Memphis. EXERCISE <>: Heavy rain possible. High around 70. [Transcribe only the second sentence for now, concentrating on the "seventy".] EXERCISE <>: Are you gonna wear your wellingtons? [Concentrate on the nuclear accent and following tones. (We know that there must be a prenuclear accent of the same type as the nuclear one, but we're not sure where it is.)] EXERCISE <>: Eileen is leaving. (two productions) EXERCISE <>: My classmate who lives in a treehouse was written up in Atlantic. _______________________________________________________________________ DIFFICULT: EXERCISE <>: So, what did you dream? [Don't worry about the tune on "So" for now.] EXERCISE <>: Keep the thermometer under your tongue. [Transcribe only the "under your tongue" for now.] EXERCISE <>: So, a lotta times they fail. [Concentrate on transcribing the pitch accents here, and don't worry about transcribing the "So".] EXERCISE <>: And what happens is, when you... [Concentrate only on the first clause (before "when").] EXERCISE <>: You know what I mean? EXERCISE <>: But anyway, if you can't see that then I don't know if I can explain it to you. [Note that the f0 tracker has doubled the pitch on the word "can", and that your transcription of the tones nearby should take this into account.] 2.3. The timing of the phrase accent and "upstep" The examples so far also illustrate two important points about the phrase accent. First the phrase accent is unlike the boundary tone in that it is not necessarily localized at the phrase edge. Rather, when the nuclear accent is far from the end of the intermediate phrase, the phrase accent fills in the space in between it and the phrase edge, creating a long flat valley for L- realized over a long stretch, as in the first speaker's production in example utterance <> (the region between 11.06 and 11.64 seconds), or a long plateau-like region for H- realized over a long stretch, as in the second speaker's production in example utterance <> (the region between 13.3 and 13.9). Second, the H- phrase accent triggers an "upstep" (a local raising of the pitch range to the end of the phrase), so that a following H% boundary tone is realized as a second rise at the end of the plateau-like region. This second point can also be seen clearly in example utterance <> starting at 13.9 seconds. Compare the much lower f0 target for the H% boundary tone in example utterance <> at 11.78, where the H% occurs after a L- phrase accent and therefore is not realized in such an upstepped pitch range. EXAMPLE <>: Anna may know my name, and yours too. Anna may know our names? H* L-H% H* H* L-L% L* H-H% (Some experienced transcribers may want to call the pitch accent on "Anna" in the first sentence a L+H*. We remind them of the annotation guidelines that say in effect "Whenever there might be any doubt whatsoever, such as on the absolute utterance initial syllable, choose H* rather than L+H*.") The summary statement on ToBI conventions prescribes that, in a waves(tm) label file, the phrase accent (or phrase accent and following boundary tone) should be marked at a point at or just before the end of the last segment in the word ending the intermediate phrase (or full intonational phrase) and always before the related break-index mark. The conventions say that the phrase accent should be placed here even when the nuclear accent occurs quite early and the phrase tone is realized over a long period of time, as in these two example utterances. Note that when the nuclear accent is close to the end of the intonation phrase, it is impossible to discern any inflection point between the high f0 target for the H- phrase accent and the even higher f0 target for the upstepped H% boundary tone. The upstepped boundary tone after "jam" in example utterance <> illustrated the smooth single rise that results in this case. Example utterance <> illustrates the full paradigm of combinations of phrase accent and following boundary tone that can occur at the end of an intonation phrase. Note that because of the upstep of the pitch range after the H- phrase accent, the L% boundary tone of a H- L% sequence does not have an absolutely low f0 target, just a lower one than that of the upstepped H% boundary tone. The contrast between a H- L% and a H- H% sequence is particularly salient when the preceding nuclear pitch accent is H*, as in the two sentences in example utterance <>. EXAMPLE <>: 1) Is that Marianna's money? H* H* L-H% 2) That's Marianna's money. H* L-L% 3) That's Marianna's money. H* H-L% 4) Is that Marianna's money? L* L* H-H% EXAMPLE <>: My name is Marianna. in two productions 1) H* H-H% 2) H* H-L% The productions in <> of the two combinations with final L% boundary tone illustrate another potential difficulty, and highlight the importance of listening to the speech and not just looking at the f0 record when doing the intonational analysis for the tone tier transcription. When an intonation phrase is not the last one in an uninterrupted stretch of speech and it ends with a L% boundary tone, it is difficult to distinguish from an intermediate phrase ending with the corresponding phrase accent just by examining the f0 contour. That is, the pitch differences between L- L% sequence and a mere L-, or between a H- L% sequence and a mere H-, are very subtle at best. Here the transcriber must rely on the subjective sense of degree of disjuncture, which is probably cued by such other things as the amount of preboundary lengthening or the degree of final lowering in the case of L- L% versus L-. (Note that the difference here must be transcribed also on the break index tier -- Section 3.) Example utterances <> and <> illustrate this difficulty. In <>, looking just at the f0 contour, we might have L- L% and a full intonation phrase boundary or just L- and a mere intermediate phrase boundary between the nuclear H* pitch accents on "probably" and "pleasantest". The ambiguous durational cues (two experienced transcribers argued even over whether the boundary should be before or after the "the") supports the notion that this must be the latter, a mere intermediate phrase boundary. By contrast, the strong sense of pause (caused by the lengthening on "and"?) in the tonally identical stretch between the accents on "shortest" and "probably" support a full intonation phrase. In <> the two productions contrast the L* H- intermediate phrases typical of a list in the first production with the ambiguous H* H- or H* H- L% plateaus of the second. EXAMPLE <>: Definitely the shortest and probably the pleasantest H* L- H* L-L% H* L- H* way to go is through the park. L- L+H* L-L% EXAMPLE <>: 1) Let's see I need oregano 'n marjoram 'n some H* H* L-L% L* H- L* H- fresh basil okay? L+H* !H* L- H* H-H% 2) Oh I don't know it's got oregano 'n marjoram H* !H* !H* L-L% H* H- H* H- 'n some fresh basil. H* H-L% The f0 patterns on "oregano" and "marjoram" illustrate also another difference between the H- and L- phrase accent, particularly in the contexts of unlike tones on the preceding nuclear accent -- i.e. in the context of L* H- versus, say, H* L-. When the nuclear accent is on an early syllable in the last word in the phrase, the L- of a H* L- sequence seems to kick in very immediately with a sharp fall that typically begins during or just after the accented syllable. In the analogous situation, the rise from a L* to a H- begins as early, but the f0 change is much more gradual. Here, for example, the f0 seems to be rising continuously from about a third of the way into the accented syllable all the way to the end of the phrase. In this case, there is no real inflection point leveling out into a plateau. The last clause of the first production in <> also illustrates anew the difficulty mentioned above in connection with utterance <> in Section 2.2. What is the best analysis of the fall to a low level immediately after "marjoram" and subsequent rise to a high f0 on "fresh"? How can we distinguish, say, a sequence of L* H* from the L+H* that we have transcribed? One thing to note is that, since accented syllables must be stressed, other characteristics of a syllable must be compatible with a tonal analysis that puts a pitch accent on it. The words "and" ("'n") and "some" here do not sound stressed at all. Both have been reduced to the point that they have syllabic nasals as their nuclei. This supports the analysis of L+H* on "fresh" over an analysis of L* H*, even though the fall from "marjoram" looks so much steeper than the gradual rise from "and" back up to the H tone on "fresh" that the f0 pattern may seem more compatible with a L* on "and". Note, however, that there may be mistracking due to breathy voice on "and". Also, the "some" shows a strong perturbation from the initial voiceless [s] that obscures how low the intended f0 is later in the syllable. 2.4. Difficult combinations of nuclear pitch accent and following phrase accent In most of the examples so far, nuclear H* has occurred before L-, where the following fall in pitch makes it easy to discern the pitch accent, and nuclear L* has occurred only before H-, where it was easy to spot from the immediately following rise in pitch. But the choice of pitch accent type is independent of the choice of following phrase accent, and there is nothing to preclude H* from occurring before H- or L* before L-. The second production in <> illustrated the first case of this `stylized high-rise' contour (Ladd, 1980), which is becoming more and more familiar to contemporary American English speakers. The combination of L* and following L- is also not rare. There are two situations where this sequence is typically encountered. The first is illustrated in <>, and the first sentence in <>. This L* L- H% pattern is typical of such vocative tags. The second sentence in <> shows that tag questions can have this contour too. However, tag questions can also take a H* L- L% intonation pattern (the third sentence in <>), which seems to be precluded on the vocative tag for pragmatic reasons (see Beckman & Pierrehumbert, 1986). EXAMPLE <>: Oh don't nuzzle me you marmalade-nose. X*? L- H* !H* L- L* L-H% (Section 2.8 will explain the `!' diacritic in the second pitch accent, and Section 2.9 will explain the X*? accent on "Oh".) EXAMPLE <>: 1) Where are you going, Willy? H* L- L* L-H% 2) He won't be going, will he? H* H* L- L* L-H% 3) He won't be going, will he? H* H* L- H* L-L% The f0 contour for the L* L- H% vocative tag contour can be confused with a longish postnuclear stretch in the sequence H* L- L%, as shown in example utterance <>. As with medial L- intermediate phrase boundary versus L- L% intonation phrase boundary discussed above, the transcriber may have to rely entirely on the subjective impression of greater versus lesser disjuncture to capture this difference between an intermediate phrase boundary at a vocative tag and no boundary. (Note again that the difference here must be accompanied by different symbols on the break index tier.) EXAMPLE <>: 1) Anna will win, Manny. H* L- L* L-H% 2) Anna will win Manny. (She won't lose him). H* H* L-L% The other situation in which one often sees a L* nuclear accent and following L- phrase accent is in the `contradiction contour', an intonational idiom illustrated in <> and <>. This contour is discussed at length in Sag & Liberman (1975) and chapter 3 of Ladd (1980). The L* L- H% sequence starting at the nuclear syllable is like the contour in the vocative tag, but this is not the only essential component of this intonational idiom. Crucially, there must be a fall from an early prenuclear H* pitch accent (or from an initial %H boundary tone -- see next section) onto a nearby L* accent. If the L* nuclear accented syllable is far from the beginning of the utterance (as is the case with the nuclear accent on "incurable" in <>), there might be another L* on some prenuclear syllable with relatively prominent secondary stress (e.g., the fourth syllable of "elephantiasis" in <>). Note that <> also illustrates the possibility of having two pitch accents on one word when there is more than one full stressed syllable (see Section 2.9 for more examples). EXAMPLE <>: Ah Gloria you're not ugly. H* L* L-L% H* L* L-H% EXAMPLE <>: Elephantiasis isn't incurable. H* L* L* L-H% 2.5. The initial %H boundary tone The contradiction contour also illustrates another phenomenon that we have not discussed so far -- namely, the possibility of a boundary tone marking the initial as well as the final boundary of an intonation phrase. Utterance <> is an example. Here the event that provides the high pitch for the early fall onto the L* tone cannot be an accent, since the first syllable of "bananas" is reduced (i.e. completely unstressed and hence unaccentable). EXAMPLE <>: Bananas aren't poisonous. %H L* L* L-H% In the intonational analysis assumed in the ToBI system, the final boundary tone is mandatory, whereas an initial one is not. The initial boundary tone differs from the final ones also in that it seems to be limited to absolute utterance-initial position, and in that it is always high. Thus, unlike the final boundary, where there is a paradigmatic choice between L% and H%, the phrase-initial boundary tone contrasts merely with the absence of a boundary tone. That is, %H contrasts with the default (unmarked) initial pattern, which in absolute utterance-initial position tends to start in the middle part of the speaker's pitch range (as opposed to beginning of utterance-medial intonation phrases, where the pitch simply continues from the value at which the previous phrase ended). This utterance-initial midrange pitch value is illustrated in <>, where the first and second productions show a rise from the mid value to H* and a fall from the same default mid value to L*, respectively. The third production then contrasts with the second in that it has an initial %H boundary tone. These two examples also illustrate the typical effect of having an initial %H boundary tone in the surprise-redundancy contour. The one with the initial %H has a greater vividness, conveying either more surprise or more insistence that this is the information that the hearer should really already know. EXAMPLE <>: You need a loan. In three productions 1) H* H* L-L% 2) L* H* L-L% 3) %H L* H* L-L% The ToBI conventions prescribe that %H be an analysis of last resort. That is, like L+H*, which is used instead of H* only when there is no other possible explanation for the low pitch before the peak (see Section 2.2, above), %H is used only when no other plausible explanation for an initial high pitch. It should be marked only when a high-pitched beginning for an utterance cannot be attributed to a H* accent on the first few syllables in the utterance -- i.e., when the first word itself does not appear to be accented or when its accented syllable occurs too far into the word to account for the initial high target. Thus it should not be used in <>, where the high pitch at the beginning of the phrase "You're not ugly" is attributed to a H* accent on the first syllable "You're". In <>, similarly, although the main stress of the word "elephantiasis" clearly is at the L* accent on the fourth syllable, we have the option of analyzing the earlier high pitch as another pitch accent earlier in the word, since the first syllable has a lexical `secondary stress' (i.e. is rhythmically more prominent than the surrounding syllables). *********************************************************** PRACTICE TWO -- phrase accent and boundary tone contrasts *********************************************************** The following examples are practice utterances for the phrase accent and boundary tone contrasts discussed in the last few sections. Transcribe these exercises using the exercises script. _______________________________________________________________________ EASY: EXERCISE <>: Does Manitowoc have a bowling alley? EXERCISE <>: For Marianna's mother. EXERCISE <>: Would you like some cream? EXERCISE <>: No, I think I'll wear my hiking boots. [Don't worry about transcribing the tune on "No" for now.] EXERCISE <>: You lost your voice. EXERCISE <>: Oh nothing special, you know flour and butter and sugar. [Transcribe just the second part, after the "you know".] EXERCISE <>: Good evening radio audience. [Don't worry about the transcription of "radio".] EXERCISE <>: I thought it was good. EXERCISE <>: Legumes are a good source of vitamins, and of protein as well. EXERCISE <>: Legumes are a good source of vitamins, but not the best. [Transcribe only the first part, up through "vitamins".] EXERCISE <>: Legumes are a good source of vitamins, and so are greens. [Transcribe only the first part, up through "vitamins".] EXERCISE <>: I was wrong, and Stalin was right. I was wrong. _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: A friend of mine um works for NASA. EXERCISE <>: I thought it was good? [Play <> for contrast.] EXERCISE <>: They've eaten the pigs. (two productions) EXERCISE <>: I need flour and sugar and butter and oh I don't know. [Transcribe only the part up through "butter" for now.] EXERCISE <>: Yes I would uh like the information on the flight leaving from uh Philadelphia to Atlanta. [Concentrate just on the parts "like the information" and "Philadelphia to Atlanta".] EXERCISE <>: Good afternoon. Information Services. EXERCISE <>: Mostly they just sat around and knocked stuff. You know, the school, other people. [Concentrate for now on the second sentence, starting at "You know..."] EXERCISE <>: I'm not going to drive to school today. EXERCISE <>: There's a spoon in here. _______________________________________________________________________ DIFFICULT: EXERCISE <>: I've told you a million times! It's for Mary's mother. [Compare <>; don't agonize too much over the tones around "for" in the second sentence.] EXERCISE <>: Well I mean, would you hire somebody that doesn't have no experience? EXERCISE <>: That's right at the traffic light. (two productions) 2.6. Pitch accent timing, and the L*+H pitch accent The examples so far have illustrated both possible phrase accents, both boundary tones, and three of the five types of pitch accent -- the low accent (L*), the plain `peak accent' (H*), and the `rising peak accent' (L+H*). We have also discussed the timing of the f0 peak in the two types of peak accent, pointing out that it is somewhat variable; in particular, that it occurs somewhat later relative to the segments of the accented syllable when it is the accent at the beginning of a `hat pattern' contour and relatively earlier before L- (see Section 2.2). Such differences in timing are not distinctive, and seem to be related to the phonetics of pre-boundary lengthening. For example, we might think of the relatively earlier placement of the peak in the latter case as a matter of lengthening the part of the syllable after the nuclear pitch accent peak in order to accommodate the L- phrase accent within the intermediate phrase (see, e.g., Silverman & Pierrehumbert, 1989, for a discussion of such phonetic accounts). These phonetic differences in timing can be ignored in the transcription on the tone tier. There is another difference in the timing of apparent peak accents, however, that must not be ignored, because it is distinctive. Both the small rise from mid pitch that is usually seen with an utterance-initial H* accent and the definitive rise from low pitch that is necessarily seen to transcribe a L+H* accent contrast phonologically with another accent type that involves a rise from low pitch into a peak that occurs much later, making the low tone align with the accented syllable. This is the `scooped' accent L*+H, illustrated in the first production of <>. The second production in this example utterance is of the contrasting `rising peak' accent L+H*. These two pitch accents have very different meanings, as described by Ladd (1980) and Ward & Hirschberg (1985), and the difference in timing here is a phonological difference that is represented in the ToBI system by the contrasting specifications of L*+H versus L+H*. That is, phonologically, both of these accents are a L plus a H, but in the `scooped' accent, the L is the starred tone (associated to the accented syllable) rather than the H. The associated phonetic difference is that the rise is much later in the `scooped' accent, and it is the timing of the minimum f0 relative to the segments of the associated syllable that is salient. EXAMPLE <>: Only a millionaire. in two productions 1) H* L*+H L-H% 2) H* L+H* L-H% Because it is the L target in the `scooped' accent that is associated to the stressed syllable, and not the H, the high pitch target is specified only as occurring somewhat later than the L, and the timing of the peak f0 relative to the segments is not controlled. If the stressed syllable is long, the rise to the peak might be accomplished entirely within the accented syllable. But if the stressed syllable is short, the peak may occur one or more syllables later. This is illustrated in example utterance <>, which shows a relatively fixed rise relative to the low f0, which makes the peak occur within the last part of the long syllable "Stein" but two syllables later relative to the short accented first syllable in "rigamarole". EXAMPLE <>: Stein's not a bad man. L*+H L-H% Rigamarole is monomorphemic. L*+H L-H% Note that although the crucial difference between L+H* and L*+H is the timing of the low pitched portion, some speakers produce a secondary difference, whereby the L of L*+H is consistently somewhat lower in the pitch than the L of L+H*. This is particularly apparent in the first accents of the two productions in <>. This also means that L*+H is not nearly so confusable with H* as is L+H*. There is also considerable interspeaker differences. Some speakers have rather mid-level L tones even in L*+H. (This fact will be relevant when you transcribe <> and <> in PRACTICE THREE.) Other speakers have very low L tones even in L+H*. This does not affect the relative heights of the L's in L*+H versus L+H*. EXAMPLE <>: There's a lovely one in Bloomingdale's. in two productions: 1) L*+H L*+!H L-H% 2) L+H* L+!H* L-L% Another thing to note in the sequence of accents in these two productions is the introduction of another set of symbols for the second (the nuclear) accents. These new symbols are actually the same accent types as the first accent in their respective utterances. The extra `!' in the symbols for the nuclear accents is a diacritic to denote the way in which the second accent peak is lower than the preceding peak. This lowering of the second peak is due to a process called `downstep', which is defined as a categorical compression of the pitch range that reduces the f0 targets for any H tones subsequent to the specification of the downstep -- i.e. the counterpart of the `upstep' triggered by the H-. We will describe downstep and what triggers it in more detail in Section 2.8 below after introducing the last remaining pitch accent type in the ToBI analysis, transcribed as H+!H*. Finally, as in deciding how low the f0 must be to count as L+H* rather than H*, transcribers should be aware of slight interspeaker differences in the timing of the L tone in differentiating L+H* from L*+H. Our impression is that American speakers (such as the speaker of <> do not always make L*+H rise as late as most RP British speakers do. The second (downstepped) L*+!H on the word "Bloomingdale's" in the first production, in particular, might seem quite early to a British transcriber. Note, however, that there is a very low pitch level throughout the [b] and the [l], and the f0 does not begin to rise until the voicing begins in the [u], making the peak occur considerably after the [m] release. This is quite late for a nuclear L+H* before a L- (cf. our comments above in Section 2.2.), as can be seen by comparing this rise to the rise in the comparably downstepped nuclear L+H* in the second production. In the second production, the rise begins before the [b] and is completed well before the release of the [m]. 2.7. The H+!H* pitch accent The nuclear accent in the second production in example utterance <> illustrates this pitch accent type. It is characterized by a fall from a preceding higher pitch onto a lower pitch level on the accented syllable. This accent type corresponds to the type called H+L* in Pierrehumbert's original system. The substitution of the letters `!H' for `L' in the name of the pitch accent reflects the fact that the pitch target on the accented syllable is only somewhat lower than the preceding H tone target; it is not so low as the f0 target for the plain `low' accent (L*) or for the L tone of the `scooped' accent (L*+H), or even for the L of the `rising peak' accent (L+H*). The renaming of this pitch accent type was intended to make the analysis somewhat more concrete and intuitive for the transcriber. EXAMPLE <>: You want an example? How about Mother Theresa? H* H* H-H% H* *? H* L-L% You want an example? Mother Theresa. H* H* H-H% H+!H* L-L% (The *? in the first production indicates uncertainty about whether that word is accented -- see Section 2.9.) 2.8. Downstep Downstep is a phonologically triggered compression of the pitch range that lowers the f0 targets for any H tones subsequent to a downstep trigger. In Pierrehumbert's model of intonation, downstep is said to be triggered by any bitonal pitch accent. In example utterance <> discussed above, for example, the progressive reduction of the second L+H* or L*+H peak relative to the preceding one would be analyzed as an automatic consequence of the fact that these two pitch accent types are composed of two tones, L plus H. In the ToBI system, this compression of the pitch range is marked by having alternative names for accents which are used for the first downstepped high tone target after the downstep trigger. Thus in the first production in example utterance <>, the second `scooped' accent is transcribed with L*+!H rather than L*+H to denote that a downstep has occurred. And similarly in the second production in this example utterance, the second `rising peak' accent is transcribed with L+!H* rather than L+H*. When there are more than two such bitonal accents in a row, each accent triggers another instance of downstep, so that each subsequent accent peak is reduced yet again relative to the immediately preceding one. This is illustrated in example utterance <>. Example utterance <> shows that it is not just pitch accents which are affected by downstep. The !H- phrase accent here is reduced to a mid level by the downstep triggered by the preceding L+H* nuclear pitch accent. (Note the characteristic mid-tone tail, as the downstepped !H- phrase accent triggers a subsequent upstep of the L% boundary tone.) EXAMPLE <>: There's a lovely yellowish old one. H* L+H* L+!H* L+!H* L-L% EXAMPLE <>: Marianna. L+H* !H-L% Transcribers who are familiar with Pierrehumbert's system will recognize that ToBI differs in this explicit marking of the reduced pitch range directly on the first H tone affected by the downstep. In Pierrehumbert's system, downstep is not explicitly marked because it is redundant to the specification of the trigger in the preceding bitonal accent. Pierrehumbert's system differs from ToBI in yet another way; it includes a sixth pitch accent type, H*+L, which bears the same relationship to H+L* (ToBI's H+!H*) as L*+H does to L+H*. That is, the fall to a slightly lower pitch target occurs after the accented syllable instead of into the accented syllable. Typically, the endpoint of this fall is no lower than the pitch target of a subsequent downstepped H tone, and the contrast between H* and H*+L thus hinges on recognizing the downstep triggered by the H*+L. Many first time transcribers find this comparatively abstract analysis unintuitive and therefore difficult. In the ToBI system, therefore, we have eliminated H*+L in favor of marking the downstep directly on the first reduced H tone. Thus H* in ToBI corresponds to both plain H* and the downstep triggering H*+L. Users of databases transcribed with the ToBI system who need to analyze the data in terms of the intonational categories in Pierrehumbert's system, can recover each H*+L tone by searching for a downstepped !H* or !H- marked immediately after a H* (or !H*) accent. For example, in utterance <> the second production is a plain `hat pattern' (H* H* L- L%) whereas the first is a `downstepped hat', which would be transcribed as H*+L H* L- L% in Pierrehumbert's system. The second production in utterance <> illustrates another very familiar intonation pattern, the `calling contour', which in Pierrehumbert's system would be transcribed with H*+L H- L%. EXAMPLE <>: That's really illuminating. in three productions 1) H* !H* L-L% 2) H* H* L-L% 3) Transcribe this one in PRACTICE THREE EXAMPLE <>: Anna. in two productions 1) L* H-H% 2) H* !H-L% A fact to note about downstep is that it is local to an intermediate phrase. Each new intermediate phrase represents a new paradigmatic choice of pitch range, at which downstep can be reset. This is illustrated in <>, where the intermediate phrase boundary after the "yellowish" allows a new choice of pitch range, so that the peak on "old" is not downstepped relative to that on "yellowish", unlike in <>. (See Section 2.9 to read about the X*? symbol marking the peak on "old". It indicates that there is an accent on "old" but we are not completely certain which type of accent it is. It could be simply H* after an unexpectedly steep rise from the preceding L-. Or it could be L+H*, with a less steep rise than expected.) EXAMPLE <>: It's lovely and yellowish, and it's an old one. L+H* L+!H* L- X*? L-L% Note, however, that the peak on "yellowish" looks downstepped relative to that on "lovely". This is due to the relationship between the pitch ranges chosen for the two intermediate phrases. The topic structure of a discourse is marked in part by the choice of pitch range for the succession of intermediate phrases; large topics begin with expanded pitch range and end with very reduced pitch range (see, e.g., Brown, Curie, & Kenworthy, 1980; Hirschberg & Pierrehumbert, 1986). This often creates an effect of `paragraph intonation' in which the relationship among successive phrasal pitch ranges mimics the phrase-internal relationship between preceding peaks and subsequent following lower downstepped peaks. The successive pitch ranges in utterances <> through <> in PRACTICE TWO above illustrate `paragraph intonation' over a longer time frame. Sometimes it is not easy to tell the difference between two phrase-internal accents with the second downstepped relative to the first and two intermediate phrases with the second phrase in a lower pitch range relative to the first. Example utterance <> illustrates such a difficult case. EXAMPLE <>: There are many intermediate levels. L+H* L+!H* L+!H* L-L% ******************************************* PRACTICE THREE -- L*+H, H+!H*, and downstep ******************************************* Transcribe these exercises using the exercises script. _______________________________________________________________________ EASY: EXERCISE <>: There's a lovely yellowish old one. [Compare to <>.] EXERCISE <>: That's really illuminating. [Transcribe third production now; first two are examples from 2.8.] EXERCISE <>: Eileen's pro-English. EXERCISE <>: Eileen's pro-English. [Compare to <>.] EXERCISE <>: Marianna. (two productions) EXERCISE <>: Becoming windy. EXERCISE <>: Okay to get from home to the station. EXERCISE <>: But uh in fact I have to go along the main road for a little ways it's probably about three hundred yards. EXERCISE <>: Legumes are a good source of vitamins, but not the best. [Repeated exercise from PRACTICE TWO. Now transcribe the second part, after "vitamins".] _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: A friend of mine works for NASA. [Compare to <>.] EXERCISE <>: You give him an inch, he takes a mile. EXERCISE <>: That's really illuminating. (three productions) EXERCISE <>: It would be nice to be able to go right out the back door and into the park cause it's actually right behind the house. EXERCISE <>: I need flour and sugar and butter and oh I don't know. [Repeated exercise from PRACTICE TWO. Transcribe only the part after "butter".] EXERCISE <>: ... and denies speculation that Chief of Staff, John Sununu, is meddling in the region's environmental affairs. EXERCISE <>: We have a lean mini-noodle with beans. Well, we have a lean mini-noodle dish. EXERCISE <>: We have a lean mini-noodle with beans. We have a lean mini-noodle dish. EXERCISE <>: John Romanelli, John Romanelli, please return to the ticket counter. EXERCISE <>: Do you really think it's that one? (two productions) EXERCISE <>: Your word is your word. [Compare to <>.] _______________________________________________________________________ DIFFICULT: EXERCISE <>: Do you really think it's that one? (two more productions) [Don't agonize too much over the tones around "Do you" in the second production.] EXERCISE <>: Heavy rain possible. High around 70. [Repeated exercise from PRACTICE ONE. Transcribe the first sentence now, concentrating on the "rain".] EXERCISE <>: My classmate who lives in a treehouse was written up in Atlantic. [Compare <> in PRACTICE ONE.] EXERCISE <>: Mostly they just sat around and knocked stuff. You know, the school, other people. [Repeated exercise from PRACTICE TWO. Concentrate now on the first sentence, the part before "You know..."] EXERCISE <>: If he can then there's no argument about it. (two productions) EXERCISE <>: Sublime mnemonic rhyme and free meter. EXERCISE <>: Sublime mnemonic rhyme and free meter. EXERCISE <>: And what happens is: when you... when you buy my business, and you try to run my business, it's really hard for you to run my business. So a lotta times they fail. [Concentrate particularly on the "When you buy my business", and don't worry about the preceding interrupted "when you..."] EXERCISE <>: A lot of people have done this; they sell their business, and they have... If something goes wrong, and they have the first rights to buy it back. [Interviewer: Oh, really?] [You've already transcribed parts of this in earlier practice sets. Here concentrate on filling in the missing pieces up through "something goes wrong", leaving for now the "and they have..."] 2.9. Uncertainty about accent placement and accent type In addition to conveying topic structure, pitch range variation is used for many discourse purposes. For example, a lower (or higher) pitch range than surrounding phrases can be used to set off a stretch of speech as an aside or a parenthetical. This is illustrated in example utterance <>. Also, expanded pitch range can convey extra liveliness or involvement, as illustrated in the much larger pitch range on the phrase "Now be careful" in <>. EXAMPLE <>: Capote died Saturday at the Bellaire home of L+H* !H* L- H* H* L+H* L- Joanne Carson (estranged wife of talkshow host L+H* L-L% L+H* L- H* *? Johnny Carson), and she was among those who H* L-L% L+H* !H* !H* eulogised him. H+!H* L-L% EXAMPLE <>: Okay now chop the onions... Now be careful. H* L- H* H+!H* H- L* L+H* L-H% Okay, chop the onions, and put them into that bowl. L+H* L-L% H+!H* H- H* H+!H* L-L% Because speakers can vary their pitch ranges seemingly without limit to convey discourse organization or degree of involvement, and because downstep can happen many times within a single phrase, sometimes it is difficult to tell whether a tone is H or L, even when one is sure a tone is there. The accents on "smoke" and "yeah" in the asides in example utterance <> illustrates this. Or, in very reduced pitch ranges, it can be difficult to tell whether a syllable is accented or not. Utterance <> illustrates this. In the very reduced pitch range after the second downstep on "else", it is impossible to know how many accents there are. EXAMPLE <>: Can I smoke? <> X*? H-H% Yeah? <> No, it doesn't have to be; you can close it. EXAMPLE <>: He sold it to somebody else, they bought the H* !H* !H* *? whole company, and he made lots of money on *? -X? *? *? the business... H* L-L% In the first type of uncertainty, ToBI prescribes that the transcriber use the notation X*? to simply mark the clear presence of an accent, without forcing an arbitrary commitment to the accent's type. Thus, the accent on "smoke" should be transcribed as X*? rather than as L* or H*. (X*? should not be used to mark uncertainty between the L- and H- phrase accents or between the L% and H% boundary tones. There the transcriber should instead mark X-? for the phrase tones and X%? for the boundary tones.) In the second type of uncertainty, when the transcriber is not certain even that there is a pitch accent (as, for example, on the "bought" in utterance <>), the mark *? should be used instead of X*?. In addition to very compressed pitch ranges, there are several particular tone sequences which are prone to inducing uncertainty about the presence of accent. One such case is the downstepped H* !H* !H* ... sequence just illustrated. In many cases, words after the first H* in such sequences are ambiguous between being accented with !H* and being `deaccented' (i.e. being in the postnuclear low stretch in a H* L- L% sequence). This is not always ambiguous, however. Utterance <> illustrates a clear contrast between downstepped and deaccented. EXAMPLE <>: Anna married Lenny. in two productions 1) H* L-L% 2) H* !H* !H* L-L% Another case of inherently ambiguous tone sequences which occurs very commonly is when there is a long stretch of speech in a `hat pattern' contour. Utterance <> illustrates this. The word "off" sounds very prominent, giving a strong subjective impression of accent. However, because the word lies in the plateau between the first H* on "peeled" and the nuclear H* pitch accent on "Hawaii", it is difficult to tell whether "off" also bears a H* accent, or just the preceding "peeled". The word "host" in the phrase "talkshow host Johnny Carson" in <>, the word "married" in the first production of <>, and the word "Mother" in the first production in <> in Section 2.7 above are three more illustrations of this very common ambiguity. The first production in example utterance <> (given in Section 1.4 above) illustrates the analogous situation with a L*; "Marianna" probably has a L* accent (note the dip down into it from the mid pitch level that begins the sentence) and "marmalade" clearly has a L* accent, but what about the "made" in between? EXAMPLE <>: [Ever since the roof of a 19-year old Aloha Airlines Boeing 737] peeled off over Hawaii last April, ... H* *? H* L- H* L-L% (See example utterance <> in the next PRACTICE for the whole context of this phrase.) In cases such as these it is better to err on the side of conservatism and mark the word with *? or nothing. In particular, the transcriber should take care not to let grammatical expectations guide the marking of accents. If we find ourselves giving in to such thoughts as "This is a content word and therefore probably is accented", we preclude the use of our transcriptions to test whether content words are indeed likely to be accented. A final source of uncertainty is particularly true of transcribing sentences in isolation extracted from the context in which they originally occurred. This is uncertainty about accent type due to unfamiliarity with a particular speaker's normal speaking range for a particular style of speech. For example in utterance <>, the nuclear pitch accent on "hurt" is probably L*; the pitch is lower than the "neutral" value at the beginning of the utterance. However, 200 Hz is very high for a low, and unless one knows from experience that this speaker has a very high-pitched voice, one might be tempted to transcribe this utterance with a H* nuclear accent. EXAMPLE <>: But would it hurt you? *? X*? H-H% Utterance <> also illustrates this point. Here we have transcribed each of the accents in the second speaker's response as L*, since we know from many other examples in this labelling guide that this speaker's normal range is higher than the first speaker's voice. This utterance also exemplifies an intonation pattern we have not shown before: a sequence of all low tones, for all of the accents, the phrase accent, and the boundary tone. EAMPLE <>: Here's your Chateaubriand, ma'am. H* L+H* L- L* L-H% I don't eat beef. L* L* L* L-L% 2.10. When something is accented that you would not expect It is important also to not fall into the obverse reasoning and hesitate to mark an accent simply because the accented word is not a content word. Example utterance <> is a nice illustration of this point. Here, the pitch pattern unequivocally supports a nuclear `peak' accent on the "and". There is a suggestion of accent on the same word in <>, but less clearly; here the glottalization at the beginning of the word is the main cue that it is accented (recall that stressed vowel-initial syllables are set off by glottal stops -- Section 2.1). EXAMPLE <>: ...design improvements, and a schedule... H* H* L-H% H* L- H* L-L% EXAMPLE <>: Hennessy is widely respected for his legal H* L- H* !H* L-L% L+H* !H- scholarship and his administrative abilities. H* H- H* *? H* L-L% Another thing to watch out for is that in very emphatic speech, a word with two fairly strong syllables can bear two pitch accents. Utterance <> is such a case, where our normal expectation is that only the most stressed final syllable in "understand" will be accented. Here, however, the first syllable is also accented, so that the phrase "to understand" can realize the `surprise-redundancy' contour L* H* L- L%. EXAMPLE <>: I'm simply trying to get you to understand. H* L* H- L* H- L* H- L* H* L-L% Example <> gives another case of such double accents, apparently without the impetus of realizing any particular intonational idiom. EXAMPLE <>: from Philadelphia to Dallas L+H* !H* L- H* L-L% This phenomenon of double accents, and the apparently related phenomenon of `stress shift', have been examined extensively by Stefanie Shattuck-Hufnagel and her colleagues (see Ross, Ostendorf, & Shattuck-Hufnagel, 1992) in a corpus of newscasts which they have transcribed using something like the ToBI system. Some of the exercises involving this phenomenon are from this study. It is apparently fairly common. ***************************************************************** PRACTICE FOUR -- uncertainty about pitch accent placement or type ***************************************************************** Transcribe these exercises using the exercises script. _______________________________________________________________________ EASY: EXERCISE <>: Legumes are a good source of vitamins, and so are greens. [Repeated exercise from PRACTICE TWO. Transcribe the second part, after "vitamins".] EXERCISE <>: He sold the business to somebody else EXERCISE <>: State law now requires public construction projects to set aside 1% of their budgets for artwork. [Concentrate particularly on the part from "to set aside" on.] EXERCISE <>: I know we've gotta do it but I don't know how to do it. [There isn't an intermediate phrase break between "how" and "to". You'll learn how to transcribe sequences such as this in Section 3.] EXERCISE <>: Do you have a lean mini-noodle dish? EXERCISE <>: D'you have a lean mini-noodle dish? _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: How'd your operation go? Don't talk to me about it; I'd like to strangle the butchers. EXERCISE <>: He's a physicist 'n works at NASA. EXERCISE <>: And Ballaga seems determined to stay the environmental course. EXERCISE <>: plenty of room to flex environmental muscles [We had trouble on the accents around "environmental" too, so don't agonize over their type.] EXERCISE <>: Ever since the roof of a 19-year old Aloha Airlines Boeing 737 peeled off over Hawaii last April, sweeping a flight attendant to her death, attention has been focused on the older aircraft. [Don't agonize too much over the "Ever since the roof" part. We found the tonal analysis really hard there too.] _______________________________________________________________________ DIFFICULT: EXERCISE <>: And I had registered for Spanish, simply because I'd taken it for five years in high school. EXERCISE <>: I'll buy it back from you for like two million, because ya done ran it into the ground, you're having problems, like you... you're not gonna make it, and you go bankrupt, so I'm gonna buy it back from you for like, for next to nothing. EXERCISE <>: 'coz he was like, a millionaire [Don't worry about the type of the boundary (if there is one) after "like".] EXERCISE <>: Hewlett-Packard has announced it's buying Massachusetts-based Apollo computer. 2.11. The point of highest f0 The only other label on the tone tier which we have not discussed is the transcription of HiF0, the point of highest fundamental frequency associated with a pitch accent within an intermediate phrase. HiF0 is used currently as a rough measure of the phrase's pitch range. It is transcribed only for intermediate phrases in which there is an accent with a H component -- i.e. a H*, L+H*, L*+H, or H+!H* accent. Thus, for example, in <> (given above) one should not transcribe HiF0. The summary statement in the Annotation Conventions (Appendix A) offers the following advice about HiF0: Transcribers should take reasonable care to choose a point in time that reflects the target of the H for the accent. In several cases this will mean choosing some point other than the actual f0 maximum. For example, sometimes the highest f0 value in an accented syllable reflects the `intrinsic' effect of a voiceless consonant and will thus be a poor estimate of the speaker's choice of pitch range. More seriously, in a phrase where the highest accent-related f0 occurs in a H* H- H% sequence, choosing the absolutely highest value for HiF0 will artifactually inflate the pitch range estimate by the amount of the upstep on the H%. In such cases, we recommend that the syllable's amplitude contour be used to pinpoint HiF0 within the candidate region. For an example, see <> from PRACTICE TWO. HiF0 would be at the amplitude peak for "good" at about time 2.61. 3. More on the break index tier 3.1. The break index tier relative to other tiers The other core part of the prosodic transcription proper is the break index tier. If we think of the tone tier as a marking of the speech signal mediated primarily by our interpretation of the analysis of the f0 contour, the analogous way to think of the break index tier is as a marking of the speech signal as mediated primarily by the rhythmic and segmental analysis implicit in the orthographic tier. The summary statement of ToBI conventions describes this relationship as follows: Break indices represent a rating for the degree of juncture perceived between each pair of words and between the final word and the silence at the end of the utterance. They are to be marked after all words that have been transcribed in the orthographic tier. All junctures -- including those after fragments and filled pauses -- must be assigned an explicit break index value; there is no default juncture type. Thus, the events on the break index tier are labels of the utterance's prosodic grouping -- that is, each label denotes a boundary of some kind of constituent which ends at the word that the transcriber has marked on the orthographic tier. The convention for placing break index marks in a waves(tm) label file is that the number should be associated with a point in time at the end of the marked word as indicated by the label in the orthographic tier. It should be located exactly at, or slightly to the right, of this word marker, so that break indices can be unambiguously associated with other tiers. There are 5 break indices, numbered 0 through 4, roughly in order of lesser to greater degree of perceived separation between the marked word and following material. The break indices are meant to be a label of the SUBJECTIVE strength of the boundary. However, this does not mean that there are no objective criteria for marking the boundaries, or that the five labels form a uniform five-point scale. For example, the lowest-level break index (0) is defined in terms of connected speech processes, such as the flapping of word-final /t/ and /d/ before a following vowel-initial word in many American and Australian dialects, processes that prosodically group words together into `clitic groups' -- larger compound-word-like constituents above the level of the word (see Section 3.2). At the other end of the scale, the two highest break indices (3 and 4) are defined in relationship to the prosodic constituents (intermediate phrases and intonation phrases) that are assumed by the marking of phrase accents and boundary tones on the tone tier (see Section 3.3). Mainstream phonological theory might lead us to expect that these intonational constituents and the lower-level clitic group constituents will form a strict hierarchy (see Selkirk, 1980; Nespor & Vogel, 1986). The numerical scale of break index values reflects a mild bias in favor of such strictly hierarchical models (see the discussion in Price et al. 1991). Rather than building the expectation rigidly into its transcriptions, however, ToBI provides two regular mechanisms for denoting mismatches between different cues to subjective boundary strength. First, break index 2 denotes a mismatch between the constituency prescribed by the tonal transcription and the sense of disjuncture due to pauses and pause-like phenomena (see Section 3.4). Second, there is a diacritic `p' that can be appended to break indices 1, 2, and 3 to convey some sort of prosodic disfluency -- for example, an abrupt cutoff after a false start or a perceptible prolongation or pause which sounds as if the speaker were hesitating while searching for the next word (see Section 3.5). These two provisions should allow transcribers to avoid the circularity of basing a theory about the nature of the prosodic hierarchy upon the transcription of databases that might be used to explore such issues as the relationship between intonational constituents and pause (see, e.g., Woodbury, 1993, who proposed that pauses can be placed independently of intonational boundaries when the discourse structure requires the indication of competing segmentation strategies for topic structure versus rhetorical structure). 3.2. Break indices 0 and 1 Except in more deliberate speech styles, such as the information-packed style of radio news announcers, the break index value that will be encountered most frequently is probably 1. The ToBI conventions define break index 1 negatively, as the label to be used for "most phrase-medial word boundaries", as contrasted with the marked phrase-medial cases transcribed by break index 0. Break index 0, conversely, is defined with positive criteria as the value "for cases of clear phonetic marks of clitic groups; e.g. the medial affricate in contractions of `did you' or a flap as in `got it'." Since the other break indices are also defined by positive criteria (markings on the tone tier -- see Sections 3.3 and 3.4), we can think of break index 1 as the `default' (although, of course, there is no real default index in the sense of having a value that need not be marked because it is understood). We have already seen many examples of break index 0 in previous example utterances. For example, in example <> in Section 2.10 above, there are three cases of 0 break index: the flapped /t/ on the two instances of the word "to" after "trying" and "you" and the palatalization of the /t/ at the juncture between "get" and "you" all are examples of connected-speech processes that we take as criteria for break level 0. EXAMPLE <>: I'm simply trying to get you to understand. 1 3 0 3 0 0 3 4 Example utterance <> illustrates yet another such connected-speech process: the apparent deletion of the vowel in "of" after "kinds", to make a phonotactically impermissible /zv/ word-final cluster. EXAMPLE <>: What kinds of planes... 1 0 1 4 Note that in some cases the phenomena denoted by break index 0 are so frequently encountered in particular types of sequences, that orthographic conventions have developed for marking them. For example, the flapping of the /t/ and consequent cliticization of the word "to" onto a preceding auxilliary verb "got" is sometimes indicated in writing by "gotta". Or the deletion of the initial /h/ and vowel of "have" in sequences such as "would have" can be indicated by spelling it "would've". In such cases, the transcriber has the alternative of marking the prosodic grouping by the choice of label on the orthographic tier instead. For example, by labelling the word as "gotta" rather than "got to" the transcriber has eliminated the word boundary where a 0 label might be placed on the break index tier. 3.3. Break indices 3 and 4 Break indices 0 and 1 form a natural progression with indices 3 and 4. These two break index strengths are equated with the intonational categories of intermediate (intonation) phrase and (full) intonation phrase. Thus, whenever the tonal analysis indicates a L- or H- phrase accent, the transcriber should decide where the end of the intermediate phrase marked by this tone label is and place a 3 on the break index tier to align with the orthographic label for the last word in the intermediate phrase. Similarly, whenever the tonal analysis indicates a L% or H% boundary tone, the transcriber should place a 4 on the break index tier at the end of the last word in the intonation phrase. In actuality, the ordering of these two analyses is sometimes reversed. This is particularly the case with the L% boundary tone; the transcriber might be convinced of the percept of a 4 versus a 3 level boundary before deciding that there must be a L-L% or H-L% sequence as opposed to merely a L- or H- to be marked on the tone tier. Recall from the discussion in Section 2.3 that there may be little or no difference in f0 values between the end of a mere L- and a L-L% sequence or between a mere H- and a H-L% sequence; A L% following a L- is in the bottom of the speaker's pitch range just as a L-, whereas a L% following a H- is upstepped to the same level as the preceding phrase accent. In such cases, the analysis is necessarily more subjective; the transcriber must rely on the percept of degree of disjuncture with less help from the f0 contour. Some pertinent examples from earlier sections are repeated here. EXAMPLE <>: Anna may know my name, and yours too. Anna may know our names? H* L-H% H* H* L-L% L* H-H% 1 1 1 1 4 1 1 4 1 1 1 1 4 EXAMPLE <>: Definitely the shortest and probably the pleasantest H* L- H* L-L% H* L- H* 1 3 1 4 1 3 1 way to go is through the park. L- L+H* L-L% 0 1 3 1 1 1 4 EXAMPLE <>: 1) Let's see I need oregano 'n marjoram 'n some H* H* L-L% L* H- L* H- 1 4 1 1 3 0 3 0 1 fresh basil okay? L+H* !H* L- H* H-H% 1 3 4 2) Oh I don't know it's got oregano 'n marjoram H* !H* !H* L-L% H* H- H* H- 1 1 1 4 1 1 3 0 3 'n some fresh basil. H* H-L% 0 1 1 4 EXAMPLE <>: Oh don't nuzzle me you marmalade-nose. X*? L- H* !H* L- L* L-H% 3 1 1 3 1 1 4 When using waves(tm) label files, a 3 or 4 break index label and the corresponding phrase accent or boundary tone are placed together at the orthographic label, with the break index label coming last if the labels on the three tiers cannot be absolutely synchronized. *********************************************** PRACTICE FIVE: break indices 0, 1, 3, and 4 *********************************************** You have already transcribed the tones on the following. Now transcribe the break indices. _______________________________________________________________________ EASY: EXERCISE <>: Does Manitowoc have a bowling alley? [See PRACTICE TWO for tones.] EXERCISE <>: How'd your operation go? Don't talk to me about it; I'd like to strangle the butchers. [See PRACTICE FOUR for tones.] EXERCISE <>: I was wrong, and Stalin was right. I was wrong. [See PRACTICE TWO for tones.] EXERCISE <>: Oh nothing special, you know flour and butter and sugar. [See PRACTICE TWO for tones. Transcribe just the second part, after the "you know".] EXERCISE <>: That's what I thought. [See PRACTICE ONE for tones.] _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: You know what I mean? [See PRACTICE ONE for tones.] EXERCISE <>: We have a lean mini-noodle with beans. Well, we have a lean mini-noodle dish. [See PRACTICE THREE for tones.] EXERCISE <>: Mostly they just sat around and knocked stuff. You know, the school, other people. [See PRACTICE TWO for tones.] _______________________________________________________________________ DIFFICULT: EXERCISE <>: If he can then there's no argument about it. (two productions) [See PRACTICE THREE for tones.] EXERCISE <>: State law now requires public construction projects to set aside 1% of their budgets for artwork. [See PRACTICE FOUR for tones.] EXERCISE <>: But anyway, if you can't see that then I don't know if I can explain it to you. [See PRACTICE ONE for tones.] 3.4. Break index 2 As noted in the previous section, each 3 on the break index tier must correspond to the marking of a phrase accent for the intermediate phrase on the tone tier, and each 4 must correspond to the marking of a boundary tone. The implication is that any other interword juncture will be something that can be transcribed on the break index tier with either a 0 or a 1. However, the subjective impression of boundary strength does not always allow such a neat correspondence. In the course of developing the ToBI transcription system, we encountered several utterances in which we felt a strong sense of disjuncture at a boundary between two words where the pitch pattern showed no evidence of the necessary tonal events for either of these two levels of intonational constituency. We also encountered the converse case: utterances in which the pitch pattern at a boundary between two words clearly indicated an intermediate or intonation phrase boundary with none of the preboundary lengthening or other cues that support the subjective sense of a strong disjuncture. Break index 2 was devised to mark cases of these two types of `mismatch' between the subjective boundary strength and the intonational constituency. These two types are described in the ToBI Annotation Conventions as follows: a strong disjuncture marked by a pause or virtual pause, but with no tonal marks; i.e. a well-formed tune continues across the juncture. OR a disjuncture that is weaker than expected at what is tonally a clear intermediate or full intonation phrase boundary. Example utterance <> illustrates the first type of mismatch, and example utterance <> illustrates the second. In <>, the smooth sequence of apparent downstepped peak accents with no clear intervening phrase accent suggests that the words "six", "southern", "iraqi", and "cities" all belong to the same intermediate phrase, yet there is an intonation phrase sized pausing between each adjacent pair of these words. In <>, the clear tonal markings for at least an intermediate phrase boundary are unaccompanied by any clear preboundary lengthening, making some transcribers uncomfortable in labelling this juncture with a 3. EXAMPLE <>: The Pentagon reports fighting in six southern L+H* L- L* H-H% H* !H* 1 1 3 4 2 2 2 iraqi cities. !H* X*? L-L% 2 4 EXAMPLE <>: uh Quincy. Could I have the number to uh H* L- H* !H* L-L% 4 2 1 1 1 1 1 1 4 Shore Cab. *? H* H-L% 1 4 Break index 2 was devised for cases where the mismatch between the tonal marking and the disjuncture is not accompanied by any sense of hesitancy or disfluency. When 2 is used in the first way (to indicate a stronger sense of disjuncture than 1 even while producing a coherent contour for an uninterrupted intermediate phrase), it can have the rhetorical effect of careful deliberation, as in the <> example. In the opposite case (when 2 is used to mark intermediate phrase boundaries which do not have a very strong sense of disjuncture) the speaker may be speaking quickly to hold the floor or to convey a sense of urgency, while using the tonal marks necessary to convey attentional focus on several closely placed words. We suspect that both types of 2 will be explained ultimately by a better understanding of the complexities of discourse structure, an understanding that can best be achieved by the transcription and analysis of many occurrences in natural dialogue. 3.5. The p diacritic (and the %r tone label) [Christine Nakatani and Elizabeth Shriberg contributed greatly to the preparation of this and the following sections.] There are other cases of mismatch between tone tier and segmental rhythm, however, where break index 2 does not seem to be appropriate. For example, in utterance <>, the pauses after "Baltimore", "which", and "leave" do not have the feel of a speaker striving for an effect of judicious deliberation, as in the "six southern Iraqi cities" phrase of the <> example, but rather sound disfluent, as if the speaker were hesitating as he searches for the next word. Such cases can be distinguished from fluent cases of 2 by the use of the p diacritic. EXAMPLE <>: Display all the flights from Baltimore to Dallas 1 1 1 1 3- 2p 0 4 which leave after 4:00 p.m. 2p 2p 3p 2p 4 The p diacritic is used in conjunction with a break index 1, 2, or 3, to indicate a disfluency in the timing or separation of words across a break. The notation `p' was chosen initially to denote the prolongation of the hesitation pause with break indices 2 and 3, but we have since extended the diacritic's usage to cover also abrupt cutoffs before restarts and repairs, which are often but not necessarily separated from the disfluent stop by a pause. In this case, the appropriate break index is 1. Thus the inventory of combinations of break index and p diacritic is: 1p -- an abrupt cutoff before an actual repair, or as if stopping to permit a repair or restart of some kind 2p -- a hesitation pause or prolongation of segmental material where there is no phrase accent perceived in the intonation contour 3p -- a hesitation pause or a pause-like prolongation where there is a phrase accent in the tone tier. The p diacritic is not used with break index 4, because it is difficult to reliably identify hesitations between two full intonational phrases. Example utterances <> and <> illustrate the use of the diacritic with break indices 1 and 3. Example <> also had an occurrence of 3p. Note the presence of the phrase accent distinguishing this interword juncture from the surrounding cases of 2p. EXAMPLE <>: um But I had I mean the stuff he knows is kind of 0 0 1p 3 1 4 1 1 1 4 1 1 1 amazing 'coz he does a lot of um environmental 3 1p 1 1 X 0 1 4 1 impact stuff 2p 4 EXAMPLE <>: I want to see the cheapest flight from Atlanta 1 1 1 1 3p 1 1 1 3 to Baltimore 1 4 In general the p diacritic should be used conservatively, and should not become a substitute for 2. A good test for appropriateness is to imagine whether the break would have been the same if the speaker were asked to repeat the utterance with the same intonation, but more `fluently'. If the break were the same upon repetition, it should probably not get the p. Note also that the prolongation of segmental material for a 2p label can physically occur at the beginning of a word rather than at the end, as in example <>, where the hesitation lengthens the [l] of "least" rather than the vowel of "the". EXAMPLE <>: Between Boston and Denver I'd like to a flight that 3 1 1 4 1 1 3p 1 1 1 takes the least amount of stops to get to Boston 3p 2p 1 0 1 4 1 1 0 4 Closely associated with these definitions of 1p, 2p, and 3p in the break index tier is the tone tier label %r, for restarting with a brand new intonation contour when the the last contour was interrupted without being finished by some disfluency. This is most common at a `repair', where the speaker abruptly stops and begins again with the intended or `repaired' material, as in example utterance <>, already cited above, and in example <>, below. EXAMPLE <>: um But I had I mean the stuff he knows 0 0 1p 3 1 4 1 1 1 4 H* H* L- H* !H* L-L% H* !H* L-L% is kind of amazing 'coz he does a lot of um 1 1 1 3 1p 1 1 X 0 1 4 L+H* L- %r H* !H* L-L% environmental impact stuff 1 2p 4 H* H* H-L% EXAMPLE <>: What are the plane sizes for these flights and 1 1 1 1 4 1 1 4 1 H* L* H-H% H* H* H-L% do they ha(ve)- do are there any other flights 2p 1 1p 1p 1 1 1 1 1 %r H* !H* that have s- connections 1 1 1p 4 %r H* L-L% As with the use of the p diacritic, one should be conservative in using %r. It is needed only if there is good evidence that a new intonational phrase has begun after disfluent pause, evidence such as a notable change in f0 range or amplitude. It should not be used in cases such as the "had" after the first 1p in <>, which continues with a fluent H* accent in the same pitch range (unlike the H* !H* on "he does" after the second 1p in this utterance, which is in a new pitch range). Nor should %r be used in example utterance <>, where after the speaker stumbles and pauses momentarily around the end of "what is the", the intonation on "abbreviation" continues as if there had been no interruption. EXAMPLE <>: What is the b- abbreviation n under 0 1 1 1p 3- 3 3p H* H- L+H* L- H* !H- the category d c mean 1 1 1 1 4 H* H* H* L-L% Especially, %r should not be used after a 3p, where the (re)start of a new intonation contour is already implicit in the break index for the intermediate phrase. 3.6. Ordinary uncertainty. In addition to these two well-defined types of `uncertainty' due either to conflicting evidence about boundary strength (break index 2) or to the interruption of fluent prosodic production at repairs and hesitations (the `p' diacritic), there will be cases of ordinary garden-variety uncertainty for other reasons. For example, (as we have already discussed above in Section 2.3) the f0 contour for an utterance-medial intonation phrase that ends with a L% boundary tone is often difficult to distinguish from a mere intermediate phrase. In such cases, where the transcriber cannot decide from other cues whether the tonal analysis should be L- versus L-L% (or H- versus H-L%), the break index marking is also necessarily ambiguous. The ToBI conventions prescribe that in such cases of transcriber uncertainty, the higher-level boundary should be chosen, and uncertainty marked by appending the `-' diacritic. Thus, in <> given above in 2.3, if no decision can be made between L- and L-L%, the correct break index marking is `4-'. The same convention applies at lower levels of the hierarchy. For example, if the transcriber thinks that a word-final /d/ has been pronounced as a flap, joining the word it ends into a close prosodic unit with the following word, but is not certain that it is a flap and not just a rather short [d], then the correct break index marking is `1-'. A similar case involving /t/ is given in example utterance <>. Here it is not clear whether the /t/ at the end of the word "democrat" has been flapped, or not released. EXAMPLE <>: The chairman, Wendell Ford, democrat of Kentucky... L+H* L- L+H* !H* L- H* L+H* L-L% 1 3 1 3 1- 1 4 Examples <>, <>, and <> illustrate cases where tonal sequences evident in the pitch contour might seem compatible with several alternative analyses, some with and some without a medial intermediate phrase break. When such utterances are transcribed outside of their larger discourses, these contours might be highly ambiguous. EXAMPLE <>: A really rewarding day. L+H* L- H* L-L% EXAMPLE <>: We have a lean mini-noodle dish. L+H* L- L+H* L-L% (compare <> given above in PRACTICE THREE) The minus symbol associated with uncertainty in break index value cannot be used in conjunction with the p diacritic. Uncertainty about whether or not to use the p should be conveyed by using `p?'. 4. The miscellaneous tier (and other aspects of the marking of disfluencies) 4.1. The miscellaneous tier defined The miscellaneous tier is in essence a `comment' tier for the optional marking of events of any kind other than the standard words, tones, and disjunctures marked on the orthographic tier, the tone tier, and the break index tier. Many of the events labelled on the miscellaneous tiers are things that span longish intervals. In this, miscellaneous events are like the word events labelled on the orthographic tier. However, the two types of events are very different, in that a strict succession of miscellanous events is not essential to speech, whereas speech must be a succession of produced words (or pieces of words). Therefore, whereas the ToBI convention is to mark each event on the orthographic tier only at the end of the interval that the event spans, it prescribes that an event on the miscellaneous tier should in general be marked for both its end and its beginning, using the diacritics `>' and `<', respectively. Thus labels on the miscellaneous tier usually come in pairs, such as: breath< breath> laugh< laugh> cough< cough> Example <> in Section 1.1 illustrated the use of the miscellaneous tier to mark the cough that interrupts the utterance. EXAMPLE <>: Will you have marmalade ... L* L* 1 1 1 1p cough< cough> Another similar example is the laughter that interrupts the pitch contour in utterance <>. EXAMPLE <>: To me it this seems very obvious; to make it on 1 3 1 1 1 1 4 1 1 1 3 H* L- L+H* L-L% H* !H* L- laugh< >laugh laugh< to make it by hand is much more fun than to make it on 1 1 1 1 3 1 1 1 1 1 1 1 0 1 H* L+H* L- H* >laugh a computer. 1 4 L-L% Since such markings are useful for parsing the disruption of otherwise tonally well-formed intonation contours, we can think of them as a source of `disfluency'. Indeed, the ToBI Annotation Conventions encourage the marking of disfluencies, and suggest the use of `disfl<...disfl>' (or `disfl') as a general flag for them: In general, it is the assumption of the participants in the common transcription group that silences should be automatically detectable, at least to a first approximation, and that transcriber time should not be spent marking these by hand. Disfluencies, by contrast, are not automatically detectable, and the absence of markings for them makes it difficult to parse the tone and break index tiers. For these reasons, transcribers are urged to mark disfluencies on the miscellaneous tier using `disfl<' and `disfl>' (or `disfl' if the disfluency is extremely localized), and to provide these marks in the miscellaneous tier menu when using waves(tm)). However, it is often easier to determine that something is disfluent in some region than it is to determine exactly where the disfluency begins and ends. For this reason, the ToBI Annotation conventions specify that the marks can be used more like a disfluency flag rather than the demarcation of a precise region: ...the marks `disfl<' and `disfl>' (or simple `disfl') should be interpreted as rough pointers to the disfluent region and transcribers should not agonize over placing them precisely. Note that here the ToBI Annotation Conventions explicitly mention the use of a single mark, rather than a pair of marks for the beginning and end of a region. However, they specifically recommend this usage only for disfluencies, to encourage the marking of something that is typically very difficult to locate precisely in time. Transcribers should be careful about using a single (unpaired) label on the misc tier for anything other than marking the general location of a perceived disfluency, since in any other circumstance, the usual interpretation must be that the event is so localized that its beginning is virtually the same point as its end. Example utterance <> is an example of a disfluency marked in this way. EXAMPLE <>: show me the cheapest fare from Da- from 1 1 1 1 4 1 1p 1 H* L+H* !H* L-L% %r disfl< disfl> Philadelphia to Dallas excluding restriction 3 1 4 4 4 L+H* !H* L- H* L-L% L+H* L-L% H* L-L% v u slash one 1 3 1 4 H* !H* L- H* H* L-L% Although the miscellaneous tier is a general-purpose `comment' tier, we recommend that when transcribers at a particular site find themselves often adding comments that fit some particular pattern other than these, they consider defining another extra tier for that purpose. Christine Nakatani and Elisabeth Shriberg, both of whom have worked extensively on disfluencies in naturally spoken utterances, differentiate more finely, and suggest guidelines for other transcribers who wish to differentiate types of disfluencies in the same way. The following section is adapted from their suggested guidelines, and uses many of their examples. 4.2. Suggested guidelines for marking disfluencies Nakatani and Shriberg have identified several different types of events that they feel should count as disfluencies in ToBI. Not all of these need be marked on the misc tier in order to be recovered. In particular, mere hesitation pauses can be recovered from the use of the 2p or 3p marks on the break index tier and (in the case of many filled pauses) from the transcription of the filler material on the orthographic tier. Phenomena that might be flagged as disfluencies on the misc tier include such phenomena as stumbling over a word, or abruptly cutting off a word or phrase in midstream to make a fragment, as in <> cited above, or <> below. These are examples of the first of the major classes of disfluency which Nakatani and Shriberg identify, including what they call `phonetic error'. EXAMPLE <>: show ground transpor- ground transportation 1 1 1p 1 4 disfl:repair< disfl:repair> at atlanta 2p 4 The second of the three major classes is the hesitation pause. This includes both silent pauses as in the examples transcribed with 2p above, and filled pauses -- that is hesitation intervals during which the speaker holds the floor by producing hesitation noises or other material, as in <>. EXAMPLE <>: The weight on a six on a seven sixty seven is 1 0 1 1 2p 1 1 2p 1 3- 2p three thousand uh three hundred and twelve 1 2p 4 1 1 1 1 thousand pounds uh is that including passengers 2p 2p 4 1 1 3 4 Nakatani and Shriberg recommend that the spelling of hesitation noises be standardized so that later users of a ToBI transcribed database need search only for a limited set of `words' in recovering the disfluency. In particular, for standard American English, they recommend the use of only "um", "uh", or "mm". That is, transcribers should not invent other spellings such as "ah" or "uhhhh" to reflect differences in the quality of the reduced vowel or the duration of the syllable. With this stipulation, filled pauses of this sort would not need to be flagged on the misc tier, since they would be recoverable from the orthographic tier. A filled pause may be perceived as unaccented, and yet as constituting its own intermediate or intonational phrase. Normally each intonation phrase is required to have at least one pitch accent. In the case of filled pauses this criterion is relaxed; an unaccented filled pause in its own phrase can be labelled with the phrase accent (chosen from the full inventory) without the requirement that a pitch accent be marked on the filled pause. The last major class of disfluency is the class of repairs and fresh starts, which Nakatani and Shriberg define as "lexical self-corrections of parts of sentences and whole sentences, respectively". They give us utterance <> as an example of a repair, and <> as an example of a fresh start. (Here we have used the misc tier to mark these interpretations of the disfluencies.) These two examples also illustrate abrupt cutoffs resulting in word fragments. EXAMPLE <>: show me the cheapest fare from Da- from 1 1 1 1 4 1 1p 1 repair< repair> Philadelphia to Dallas excluding restriction 3 1 4 4 4 v u slash one 1 3 1 4 EXAMPLE <>: What are the plane sizes for these flights and 1 1 1 1 4 1 1 4 1 do they ha(ve)- do are there any other flights 2p 1 1p 1p 1 1 1 1 1 restart< restart> that have s- connections 1 1 1p 4 More detailed suggestions about how to flag repairs can be obtained by writing directly to Christine Nakatani (chn@das.harvard.edu) or Elizabeth Shriberg (ees@speech.sri.com). *********************************************************** PRACTICE SIX: break index 2, the p diacritic, disfluencies *********************************************************** Transcribe these exercises using the exercises script. _______________________________________________________________________ EASY: EXERCISE <>: Uh and then I go under a footbridge and into the park. EXERCISE <>: A lot of people have done this; they sell their business, and they have... If something goes wrong, and they have the first rights to buy it back. [Repeated exercise from PRACTICE THREE. Transcribe the phrase "and they have,..."] EXERCISE <>: I know we've gotta do it but I don't know how to do it. [Repeated exercise from PRACTICE FOUR.] _______________________________________________________________________ INTERMEDIATE: EXERCISE <>: Because I I mean, to make a map on computer is not n- nearly as much fun. EXERCISE <>: The advisor to f- fill out my schedule for the first semester said "Why don't you take Introduction... Intro... Introductory Linguistics." EXERCISE <>: There's a spoon in here. [Compare <> in PRACTICE TWO.] EXERCISE <>: The author of more than eight hundred state supreme court opinions (Hennessy is widely respected for his legal scholarship and his administrative abilities.) [This is the first part of <> in Section 2.10.] EXERCISE <>: Usually not, no. Nah. Usually they won't give you chances. _______________________________________________________________________ DIFFICULT: EXAMPLE <>: My learning experiences are on the job, so when I screw something up instead of s- spending all this money to go to college... When I screw up a job, that's my tuition for college. That's exactly, exactly how it works, there's no difference at all. EXERCISE <>: And what happens is: when you... when you buy my business, and you try to run my business, it's really hard for you to run my business. So a lotta times they fail. [Repeated exercise from PRACTICE THREE. You've transcribed most of the tones already. Now you're ready to worry about the break indices, particularly those around the first "when you..." and "when you try to run my business".] EXERCISE <>: Half the job is accomplished by just starting it. [Interviewer: Mm-hmm] So just start doing it, and you'll figure it out. [Interviewer: Yeah] You know what I mean? 5. References Beckman, Mary E., and Janet B. Pierrehumbert. (1986) Intonational Structure in Japanese and English, Phonology Yearbook 3, 255-309. Bolinger, D. (1972) Accent is predictable (if you're a mind reader). Language 48, 633-644. Brown G., K. Currie, and J. Kenworthy (1980) Questions of Intonation. Croom Helm. Campbell, W. (1992) Prosodic encoding of English speech. In Proceedings of the 1992 International Conference on Spoken Language Processing, Banff, Canada, 663-666. Grosz, B., and J. Hirschberg (1992) Some intonational characteristics of discourse structure. In Proceedings of the 1992 International Conference on Spoken Language Processing, Banff, Canada, 429-432. Hirschberg, J. (1993) Pitch accent in context: Predicting intonational prominence from text. Artificial Intelligence, 63(1-2). Hirschberg, J., and J. Pierrehumbert (1986) The intonational structuring of discourse. In Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics, New York, NY, 136-144. Hirschberg, J., and G. Ward (1992) The influence of pitch range, duration, amplitude and spectral features on the interpretation of the rise-fall-rise intonation contour in English. Journal of Phonetics, Vol 20, Number 2, 241-251. Ladd, D. R. (1980) The Structure of Intonational Meaning: Evidence from English. Indiana University Press. Ladd, D. R. (1983) Phonological features of intonational peaks. Language 59, 721-759. Nakajima, S., and J. Allen (1992) Prosody as a cue for discourse structure. In Proceedings of the 1992 International Conference on Spoken Language Processing, Banff, Canada, 425-428. Nespor, M. and I. Vogel (1986) Prosodic Phonology. Foris Publications. Pierrehumbert, J., and J. Hirschberg (1990) The meaning of intonation contours in the interpretation of discourse. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Plans and Intentions in Communication and Discourse (SDF Benchmark Series in Computational Linguistics), 271-311. MIT Press. Pierrehumbert, Janet B. and Shirley Steele. (1987) How Many Rise-Fall-Rise Contours? In Proceedings of the 11th International Congress of Phonetic Sciences 3, 145-147. Estonian Academy of Sciences, Tallinn. Price, P., Ostendorf, M., Shattuck-Hufnagel, S., and Fong, C. (1991) The use of prosody in syntactic disambiguation. Journal of the Acoustical Society of America 90, 2956-2970. Ross, K., M. Ostendorf, & S. Shattuck-Hufnagel (1992) Factors affecting pitch accent placement. In Proceedings of the 1992 Conference on Spoken Language Processing, Vol. 1, pp. 365-368. Sag, I., and M. Liberman (1975) The intonational disambiguation of indirect speech acts. In Papers from the 11th Regional Meeting, Chicago Linguistics Society, 487-497. Selkirk, E. (1981) On the nature of phonological representation. In T. Myers, J. Laver, and J. Anderson, eds., The Cognitive Representation of Speech, 379-378. North-Holland. Silverman, K. E. A., E. Blaauw, J. Spitz, and J. F. Pitrelli (1992) Towards using prosody in speech recognition/understanding systems: Differences between read and spontaneous speech. Proceedings of the Fifth DARPA Workshop on Speech and Natural Language, Harriman, NY, February. Silverman, K., and J. Pierrehumbert (1989) The timing of pre-nuclear high accents in English. In J. Kingston and M. E. Beckman, eds., Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. 72-106. Cambridge University Press. Ward, G., and J. Hirschberg (1985) Implicating uncertainty: The pragmatics of fall-rise intonation. Language 61, 747-776. Woodbury, A. (1993) Against intonational phrases in Central Alaskan Yupik Eskimo. Presented at Linguistics Society of America Annual Meeting, Los Angeles, CA. APPENDIX A The ToBI Annotation Conventions by Mary E. Beckman and Julia Hirschberg 1. Synopsis A ToBI transcription for an utterance consists minimally of a recording of the speech, an associated record of the fundamental frequency contour, and (the transcription proper) symbolic labels for events on the following four parallel tiers: 1. an orthographic tier 2. a tone tier 3. a break-index tier 4. a miscellaneous tier Conventions are specified for both simple text-based transcription using this system and for waves(tm) label files and formats to accompany a speech file and associated time-aligned analysis records for the utterance. We first summarize the conventions assuming a computer-based labelling system such as waves(tm) label files and formats. A final section (Section 9), provided by Jacques Terken and Mari Ostendorf, summarizes the added guidelines for adapting the conventions to simple text-based transcription. 2. The Orthographic Tier The orthographic tier will be used only for the transcription of orthographic words. In the waves(tm) label file, each word's orthographic form should be marked at the end of the final segment in the word, as determined by the labeller from the waveform or spectrogram record. That is, each orthographic word will be marked at its right `edge'. Individual transcribers will also determine whether and how to transcribe phenomena such as filled pauses (e.g., ``um'',``uh'') and whether to use contractions (e.g., ``gotta'') or not. There are several existing orthographic conventions for transcribing such phenomena, which labellers may want to consult. For example, the ATIS corpus conventions specify ``er'', ``mm'', ``uh'', and ``um'' as the allowable transcriptions for filled pauses. 3. The Break Index Tier Break indices represent a rating for the degree of juncture perceived between each pair of words and between the final word and the silence at the end of the utterance. They are to be marked after all words that have been transcribed in the orthographic tier. All junctures -- including those after fragments and filled pauses -- must be assigned an explicit break index value; there is no default juncture type. 3.1 Break Index Values Values for the break index are chosen from the following set: 0 for cases of clear phonetic marks of clitic groups; e.g. the medial affricate in contractions of `did you' or a flap as in `got it'. 1 most phrase-medial word boundaries. 2 a strong disjuncture marked by a pause or virtual pause, but with no tonal marks; i.e. a well-formed tune continues across the juncture. OR a disjuncture that is weaker than expected at what is tonally a clear intermediate or full intonation phrase boundary. 3 intermediate intonation phrase boundary; i.e. marked by a single phrase tone affecting the region from the last pitch accent to the boundary. 4 full intonation phrase boundary; i.e. marked by a final boundary tone after the last phrase tone. For example, a typical fluent utterance of the following sentence: Did you want an example? might have a `0' between `Did' and `you' indicating palatalization of the /d j/ sequence across the boundary between these words. Similarly, the break index value between `want' and `an' might again be `0' indicating deletion of /t/ and subsequent flapping of /n/. The remaining break index values would probably be `1' between `you' and `want' and between `an' and `example', indicating the presence of a mere word boundary, and `4' at the end of the utterance, indicating the end of a well-formed intonation phrase. In the waves(tm) break index label file, the number should be associated with a point in time at the end of each word, as indicated in the orthographic tier (Section 2). It should be located exactly at, or slightly to the right, of this word marker, so that break indices can be unambiguously associated with other tiers. 3.2 Uncertainty and Underspecification Transcriber uncertainty about break-index strength is to be indicated with a minus (`-') affixed directly to the right of the break index (e.g. `1-' to indicate uncertainty between `0' and `1'; `2-' to indicate uncertainty between `2' and `1'; and so on). The full ToBI transcription must include both break index values and tone values. However, to accommodate backward compatibility with previously labelled databases or to allow intermediate stages in the labelling process, a partial ToBI transcription may have only break index values or only tone values assigned. Underspecification of break index values may be indicated by a value of `X' at the word boundary in the break index tier. 3.3 Disfluencies The perception of an audible hesitation (for example, an abrupt cutoff or a prolongation) can be marked by the diacritic `p' immediately to the right of the break index (e.g. `3p'). This diacritic should be applied only to break indices of 1, 2, or 3. We expect that `1p' will be used for abrupt cutoffs, and `2p' and `3p' will be used to indicate prolongation, with `3p' suggesting hesitation after the onset of the tonal marks for an intermediate phrase. (See also Section 5.) 4. The Tone Tier Two types of tones are marked in the tonal tier: pitch events associated with intonational boundaries (phrasal tones) and pitch events associated with accented syllables (pitch accents). The basic tone levels are high (H) in the local pitch range versus low (L) in the local pitch range. 4.1 Phrasal Tones Phrasal tones will be assigned at every intermediate or intonation phrase: L- or H- phrase accent, which occurs at an intermediate phrase boundary (level 3 and above); note that this represents a return to the notation in Pierrehumbert (1980) L% or H% (final) boundary tone, which occurs at every full intonation phrase boundary (level 4) %H high initial boundary tone; marks a phrase that begins relatively high in the speaker's pitch range; the default initial boundary is in the middle of the range or lower, and will be left unmarked in the transcription. Transcribers should use %H only when a high pitch at the beginning of an utterance cannot be attributed to a H accent (H* or H+!H*) on the first or second syllable in the utterance (i.e., when the first word itself does not appear to be accented, or when its accented syllable occurs too far into the word to account for the initial H), and where the utterance contrasts with a possible rendition with a lower-pitched onset. Note that, since intonation phrases are composed of one or more intermediate phrases plus a boundary tone, full intonation phrase boundaries will have two final tones, e.g.: L- L% for a full intonation phrase with a L phrase accent ending its final intermediate phrase and a L% boundary tone falling to a point low in the speaker's range, as in the standard `declarative' contour of American English. L- H% for a full intonation phrase with a L phrase accent closing the last intermediate phrase, followed by a H boundary tone, as in `continuation rise'. H- H% for an intonation phrase with a final intermediate phrase ending in a H phrase accent and a subsequent H boundary tone, as in the canonical `yes-no question' contour. Note that the H- phrase accent causes `upstep' on the following boundary tone, so that the H% after a H- rises to a very high value. H- L% for an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker's range, producing a final level `plateau'. For convenience, labellers may prefer to mark the tones at a break index with value `4' in a single step, with H-H%, L-L%, H-L%, or L-H%. We recommend that ToBI label menus include these symbols in addition to the separate symbols for phrasal and boundary tones described above; two additional symbols for downstepped phrase accent/boundary tone combinations will be described below in Section 4.3. In the waves(tm) label file, the phrase accent and/or boundary tone associated with a phrase should be marked at a point at or just before the end of the last segment in the word ending the intermediate or full intonation phrase, and always before the related break-index mark; high initial boundary tones should be marked at the beginning of the phrase, where the H tone is observed and should always be located after the break-index marker for any preceding phrase. %r Will be used to mark the left edge of an intonation phrase which begins after a hesitation or disfluency. The `%r' notation is used to indicate a `contour restart' -- i.e. the initiation of a new intonational contour after a disruption. This diacritic should be used only in cases where the disfluency has caused a clear contour discontinuity. 4.2 Pitch Accents Pitch accent tones will be marked at every accented syllable. Lack of pitch accent assignment for a syllable will be interpreted as meaning that the syllable is NOT accented. The ToBI transcription allows for the following five types of pitch accents. (Transcribers labelling utterances in dialects other than standard American English, standard Australian English, or RP British English may need to add additional types. These should be described in a general introduction to the transcribed database.) H* `peak accent' -- an apparent tone target on the accented syllable which is in the upper part of the speaker's pitch range for the phrase. This includes tones in the middle of the pitch range, but precludes very low F0 targets. [Corresponds to H* and H*+L in Pierrehumbert's six-accent inventory.] L* `low accent' -- an apparent tone target on the accented syllable which is in the lowest part of the speaker's pitch range. L*+H `scooped accent' -- a low tone target on the accented syllable which is immediately followed by relatively sharp rise to a peak in the upper part of the speaker's pitch range. L+H* `rising peak accent' -- a high peak target on the accented syllable which is immediately preceded by relatively sharp rise from a valley in the lowest part of the speaker's pitch range. H+!H* a clear step down onto the accented syllable from a high pitch which itself cannot be accounted for by a H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase; should only be used when the preceding material is clearly high-pitched and unaccented. (Otherwise the accent is a simple !H*.) In a waves(tm) label file, the pitch accent tone label should be placed within the nucleus of the accented syllable (i.e. the syllable that is phonologically associated to the starred tone of the accent), and always before the orthographic label and the break index mark at the end of the word. If the F0 peak or valley for the starred H or L tone does not occur within the accented syllable, labellers who so wish may mark the early (or late) F0 event with `>' (or `<') pointing to the following (or preceding) pitch accent label. Thus, for example, if the F0 maximum for a L+H* occurs after the end of the accented syllable, a labeller may mark the time of the F0 peak with a `<' pointing back to the L+H* label. Implicit in our discussion of the five pitch accents is the notion that H* is the `default' accent type. So, if there is any uncertainty about how low the F0 is before the peak, as in some cases of possible L+H* near the beginning of an utterance, the transcriber should mark `H*' rather than `L+H*'. 4.3 Downstep Diacritic for Pitch Accents and Phrase Accents Downstepped (high) tones will be marked explicitly using: ! preceding the downstepped pitch accent peak or downstepped H phrase accent. Transcribers familiar with Pierrehumbert's full system should note that this eliminates the H*+L accent as a necessary downstep trigger within the system, since now the contrast between H* and H*+L will be marked by the absence versus presence of `!' on the following H tone. Note that, since it is the H tone in each case that is affected by the downstep, the `!' diacritic should immediately precede the affected H tone in a pitch accent or phrase accent. Note also that this diacritic is NEVER applied to the first H tone in a phrase. Some example uses of the downstep diacritic are: H* !H- L% for the downstepped high phrase tone in the ``calling contour'' that in Pierrehumbert's original system was analyzed as H*+L H- L% H* !H* L- L% for the ``staircase'' pattern that in Pierrehumbert's original system was analyzed as H*+L H* L- L% L*+H L*+!H L*+!H for the succession of downstepped peaks that would occur in a succession of scooped accents In light of our recommendations above that the possible tones of a level 4 break should be included as separate menu items, the possibility of the downstepped H phrase accent at full intonation phrase boundaries means that `!H-L%' and `!H-H%' should also be included as menu items. 4.4 Underspecification and Uncertainty The full ToBI transcription must include both break index values and tone values. However, to accommodate backward compatibility with previously labelled databases or to allow intermediate stages in the labelling process, a partial ToBI transcription may have only break index values or only tone values assigned. Underspecification of tonal values may be indicated by `*', `-', and `%' for a tonally unspecified pitch accent, phrase accent, and boundary tone, respectively. Note that this does not indicate uncertainty about the tonal value, but rather that the tonal values have yet to be assigned. On the tonal tier, two kinds of uncertainty may be indicated: uncertainty over whether an event of a particular type has occurred, and uncertainty over the tonal value of an event that clearly has occurred. Thus, for example, the labeller may be unsure whether a particular syllable is accented, or, knowing that it is accented, may be uncertain of the accent type. Uncertainty of the first sort (whether the event has occurred) is indicated by `*?', `-?', and `%?' for pitch accents, phrase accents, and boundary tones, respectively. Uncertainty of the second sort (over the tonal value of a clearly occurring event) is indicated by `X*?', `X-?', and `X%?'. Thus, for example: * means `This syllable is accented but the database does not yet have accent type transcribed.' *? means `I'm not sure whether this syllable is accented or not.' X*? means `I believe this syllable is accented but I am uncertain what accent type to assign.' A typical case where `*?' might be used is for a very strong syllable in a part of an utterance between a prenuclear H* and a nuclear H*, where the F0 contour is flat and high because of the preceding and following tones, making it difficult to detect intervening H* accents. A typical case where `X*?' might be used is a part of an utterance where the labeller cannot tell whether an accent is a L* accent or a H* accent in a compressed pitch range. 5. Miscellaneous Tier The miscellaneous tier will be used for any comments or markings (e.g., silence, audible breaths, laughter, disfluencies, and so on) desired by particular transcription groups. The only conventions ToBI specifies for this tier are that events in general should be labelled at their temporal beginnings and endings with labels of the form: event< ... event> These labels should be placed in the text transcription or in the waves(tm) label file to correspond as closely as possible to the temporal beginning and endings of the phenomena being described. So, a period of laughter plus speech might be indicated by marking the beginning and end of the laughter with: laughter< ... laughter> Single comments in the misc tier such as `bad pitch track' or `disfl' (for `disfluency') are also allowable. However, whenever a misc comment refers to a region and not just a particular point in time, mark the beginning and end of the region. For example, if the pitch tracking algorithm has made an identifiable error in a particular region, such as pitch doubling or pitch halving, the transcriber should consider giving this more specific information in the usual paired event label format. In general, it is the assumption of the participants in the common transcription group that silences should be automatically detectable, at least to a first approximation, and that transcriber time should not be spent marking these by hand. Disfluencies, by contrast, are not automatically detectable, and the absence of markings for them makes it difficult to parse the tone and break index tiers. For these reasons, transcribers are urged to mark disfluencies on the miscellaneous tier using `disfl<' and `disfl>' (or `disfl' if the disfluency is extremely localized), and to provide these marks in the miscellaneous tier menu when using waves(tm)). Since demarcating a disfluent region is considerably more difficult than merely recognizing its presence, the marks `disfl<' and `disfl>' (or simple `disfl') should be interpreted as rough pointers to the disfluent region and transcribers should not agonize over placing them precisely. Suggested conventions for further specification of particular types of disfluencies and their labels are provided in the ``Guidelines for ToBI Labelling''. 6. Pitch Range HiF0 In transcriptions using waves(tm) label files, local pitch range will be marked for each intermediate phrase (interval between level 3 boundaries) with this diacritic. To estimate a phrase's pitch range, mark a point within the pitch accent in the phrase which includes a `H' tone and which contains the F0 maximum for the phrase. That is, the accent containing the HiF0 mark should be one of H*, L+H*, L*+H, or H+!H*. Thus if an intermediate phrase contains only L* accents, HiF0 will NOT be marked for that phrase. Transcribers should take reasonable care to choose a point in time that reflects the target of the H for the accent. In several cases this will mean choosing some point other than the actual F0 maximum. For example, sometimes the highest F0 value in an accented syllable reflects the `intrinsic' effect of a voiceless consonant and will thus be a poor estimate of the speaker's choice of pitch range. More seriously, in a phrase where the highest accent-related F0 occurs in a H* H- H% sequence, choosing the absolutely highest value for HiF0 will artifactually inflate the pitch range estimate by the amount of the upstep on the H%. In such cases, we recommend that the syllable's amplitude contour be used to pinpoint HiF0 within the candidate region. 7. Redundancy Among Tiers There is some redundancy among tiers. For example, break index locations are redundant to the orthographic tier. Also, the occurrence of phrase accents and boundary tones on the tone tier is redundant to the presence of break index values `3' and `4' on the break index tier. In tonally underspecified databases, the marks `-' and `%' will be completely redundant to break index values `3' and `4'. Even in tonally specified databases, `-?' and `%?' will be redundant to break index values `3-' and `4-'. In order to save time and improve intertranscriber consistency, we recommend that labellers who use waves(tm) avail themselves of routines for automatically inserting redundant labels on either tier. 8. Files Associated with the Transcriptions 8.1 Speech File Formats Since utterances will be recorded and transcribed at different sites, and for different immediate research purposes, it seems unlikely that we can arrive at any simple guidelines for such matters as sampling rate. We recommend adoption of formats compatible with other corpora insofar as possible. 8.2 Transcription Label Files Each tier of ToBI should be representable in a simple text-based transcription, and as a separate label file in the waves(tm) label format. So, there will be separate label files for the orthographic tier, the break index tier, the tonal tier, and the miscellaneous tier. Such modularity allows partial transcription to be done and allows sites to add additional tiers as additional label files. All label files are of course aligned temporally via the waveform they label. This approach should also allow variation in display and access to different types of information. It is easy to provide software that supports labelling in such a format and that will generate summaries of prosodic information from such label files in a variety of formats. 9. Conventions for Non-waves(tm) Format by Jacques Terken and Mari Ostendorf Each line contains a number of fields. Fields are separated by markers to facilitate extraction of information. The format is as follows. field_1 ^field_2 $field_3 @field_4 ;field_5 The contents of the fields are as follows. Field_1 contains the orthographic transcription. The syllable(s) containing a pitch accent is/are marked by an asterisk (*) before the vowel. Field_2 contains the tonal transcription, including pitch accents, phrase accents and boundary tones. If a word contains more than one pitch accent, the association with asterisk-marked syllables in Field_1 is from left to right. Uncertainty about the occurrence or type of pitch accent is indicated in this field using the conventions described in Section 4.4. In addition, the accented syllable having highest pitch within an intermediate phrase can be marked by HiF0. The convention is that HiF0 is associated with an accented syllable containing an H (either H*, L+H*, L*+H or H+!H*) -- see Section 6. Finally, the convention with respect to phrasal accents is that a phrasal accent should be associated with the last word in the phrase, and that it is assumed to extend backwards until the last accented syllable in the phrase. This association convention is needed because the break index tier may not unambiguously indicate the location of an intermediate phrase boundary: ``At the break index tier, a 2 may signal a disjuncture that is weaker than expected at what is tonally a clear intermediate .. phrase boundary'' (See Section 3.1.) Field_3 contains the break index value. This value gives the strength of the break between the word on the current line and the word on the next line. By definition, the beginning of a file is the beginning of the utterance; that is, there is an implied line before the first line only containing a 4 in the $-field, i.e. the tone field. Field_4 contains the time markers associated with the break indices. Since in waves(tm) each tonal marker also has a time stamp, a possible extension is to have a list of time stamps rather than a single time stamp. Since phrasal accents and boundary tones are by definition associated with word boundaries, separate time stamps would be needed only for tonal markers containing an asterisk. The convention would be to associate the list of time stamps from left to right with tonal markers and the word boundary. If there are tonal markers associated with the word but only one time specified in Field_4, the time marked is by default the word boundary time. Field_5 contains miscellaneous information comments. For comments continuing on the next line there is an obligatory continuation marker ``;'' at the beginning of the line, as follows: than ^ $1 @500 ;this is so much comment about ;all kinds of things that it *eight ^h* $1 @600 ;continues on the next lines Thus, a typical line may be abstractly represented as: w(*)ord ^tonal_marker $break_index @(time_stamp) time_stamp ;comment An example of the fields of a non-waves(tm) transcription is shown below. (The neat organization in columns is purely for reading convenience and is not a requirement.) The waveform, F0 contour, and associated labels in a waves(tm) transcription are given in Appendix 10. it's ^ $1 @1.924903 ; l*ovely ^L+H* HiF0 $1 @2.303698 ; and ^ $1 @2.556273 ; y*ellowish ^L+!H* L- $3 @3.118653 ; and ^ $1 @3.234365 ; it's ^ $1 @3.406066 ; an ^ $1 @3.514313 ; *old ^X*? HiF0 $1 @3.733797 ; one ^L-L% $4 @4.074712 ; 10. Sample Utterance Note: If you have waves(tm), you can get the example utterance and its F0 and label files to look at and play. To do this: ftp portia.ling.ohio-state.edu login as anonymous, using your local user-id as your password. cd pub get ToBI.example.tar tar -xvf ToBI.example.tar