MICASE Transcription and Mark-Up Conventions
Explanation of the colors, punctuation, and spelling in the MICASE online files.
| SGML TAG or SYMBOL | MEANING/DESCRIPTION | APPEARANCE IN ON-LINE TRANSCRIPTS (HTML VERSION) |
|---|---|---|
| SPEAKER ID | ||
<U WHO=S1>,<U WHO=S2>,etc. |
Speaker IDs, assigned in the order they first speak. | S1: at the beginning of each turn or interruption/backchannel. |
<U WHO=SU>,<U WHO=SU-f>,<U WHO=SU-m>
|
Unknown speaker, without and with gender identified | SU: SU-f, SU-m |
<U WHO=SU-1> |
Probable but not definite identity of speaker | SU-1: |
<SS> |
Two or more speakers, in unison (used mostly for laughter) | SS: |
| PAUSES | ||
<PAUSE DUR=:05>
|
Pauses of 4 seconds or longer are timed to the nearest second. | <P: 05> |
| , | Comma indicates a brief (1-2 second) mid-utterance pause with non-phrase-final intonation contour. | , |
| . | Period indicates a brief pause accompanied by an utterance final (falling) intonation contour; not used in a syntactic sense to indicate complete sentences. | . |
| … | Ellipses indicate a pause of 2-3 seconds | … |
| OVERLAPS | ||
<OVERLAP>xxx</OVERLAP>
|
This tag encloses speech that is spoken simultaneously, either at the ends and beginnings of turns, or as interruptions or backchannel cues in the middle of one speaker’s turn. All overlaps are approximate and shown to the nearest word; a word is generally not split by an overlap tag. | Text of overlapping speech is in blue. |
| BACKCHANNEL CUES and FAILED INTERRUPTIONS | ||
Embedded utterance (<U> tag within a <U> tag) |
Backchannel cues from a speaker who doesn’t hold the floor and unsuccessful attempts to take the floor are embedded within the current speaker’s turn, and not shown as a separate line/paragraph. | [S3: Text of embedded speech is in orange and surrounded by orange square brackets.] |
Embedded and overlapped utterance (<OVERLAP> tag within an embedded utterance) |
Backchannel cues or unsuccessful interruptions that overlap with the main speaker’s speech. | [S3: Text of embedded speech that is overlapped is in blue and surrounded by orange speaker ID and square brackets.] |
| LAUGHTER | ||
<EVENT DESC=LAUGH> or<EVENT DESC=LAUGH WHO=S2>
|
All laughter is marked. Speaker ID not marked if current speaker laughs. |
<LAUGH>,<S8 LAUGH>,<SS LAUGH>,etc. |
| CONTEXTUAL EVENTS | ||
<EVENT DESC="WRITING ON BOARD">
|
Various contextual (non-speech) events are noted, usually only when they affect comprehension of the surrounding discourse. | <WRITING ON BOARD> |
<EVENT DESC="APPLAUSE"> |
<APPLAUSE> |
|
<EVENT DESC="AUDIO DISTURBANCE">,<EVENT DESC="BACKGROUND NOISE">
|
<AUDIO DISTURBANCE>,<BACKGROUND NOISE>
|
|
<EVENT DESC="SOUND EFFECT">,<EVENT DESC="GASP">
|
<SOUND EFFECT>,<GASP>
|
|
| READING PASSAGES | ||
<SEG TYPE="READING">xxx</SEG>
|
Used when part of an utterance is read verbatim. |
<READING>xxx</READING>
|
| FOREIGN WORDS | ||
<FOREIGN>xxx</FOREIGN>
|
Used for non-English words or phrases. | Italics e.g.: the mother says c’est quoi? and Annika says to parce que eh and then,… |
| PRONUNCIATION VARIATIONS | ||
<SEG TYPE="PRON" SUBTYPE="/seltik/">
Celtic
</SEG>
|
Used when an unexpected pronunciation is used that would affect comprehension of the surrounding discourse. Dialect or other phonological variations are generally not represented. | Pronunciation guide follows the word e.g.: …they asked the librarian for pictures of old Celtic <PRON: /seltik/> uniforms the basketball team, and it turns out that the project was he was supposed to find Celtic <PRON: /keltik/> costumes. |
<SIC>xxx</SIC>
|
Used when a speaker makes a mistake without self-correcting, and the error might otherwise appear to be a transcribing error. | (sic) follows the word. e.g.: despite the fact that that was the era of Women’s Liberation like i say on the cover of Newsweek, and Gloria Steinman (sic) and uh Betty Friedan… |
| UNCERTAIN or UNINTELLIGIBLE SPEECH | ||
| (xx) (words) |
Two x’s in parentheses indicate one or more words that are completely unintelligible. Words surrounded by parentheses indicate the transcription is uncertain. | i don’t (xx) whole (xx) analysis it just struck me… lemme not write it that way (lest it be confused) with C syntax… |
| NAMES | ||
| When participants’ names occur in a recording, they are changed to pseudonyms in the transcript, except in the case of most public colloquia (i.e. COL-prefixed files). In some cases, names of non-present people referred to in the recording are also changed. There is no SGML marking for names. | ||
| RULE or GUIDELINE | EXAMPLES | |
| GENERAL | Standard orthography is used for most words, even though they may not be fully pronounced, may be pronounced with a foreign accent, etc. In general, phonologically reduced forms are not represented, except as noted below. | |
| CAPITALIZATION | Only proper nouns (names, departments, course titles, organizations, etc.) are capitalized (in addition to acronyms; see below). Neither the beginnings of turns nor the pronoun ‘i’ are capitalized. | Dr Hales received his M-S and B-S degrees at Stanford in nineteen eighty-two. his PhD at Princeton in eighty-six under the Harold W Dodds Honorific Fellowship… oh, i i think i know what you’re getting to. |
| FILLED PAUSES, BACKCHANNEL CUES, EXCLAMATIONS, etc. | All hesistation and filler words, backchannel cues, and transcribable exclamations are spelled out, as shown on the right. | Hesitation/Filler Words/Backchannels: hm, hm’, huh, mm, mhm, uh, um, mkay |
| Yes/No Responses: yes: mhm, mm, okey-doke, okey-dokey, uhuh, yeah, yep, yuhuh no: uh’uh, huh’uh, ’m’m, huh’uh |
||
| Exclamations/Doubt/Misc.:ach, ah, ahah, gee, jeez, oh, ooh, oop, oops, tch, ugh, uh’oh, whoa, yay | ||
| CONTRACTIONS and LEXICALIZED REDUCED FORMS | All standard contractions of is, am, are, had, have, would, not are represented, including [noun + has been/have been/is]. | i’d, i’ve, i’m, i’ll, she’s, she’ll, he’s, they’ve, etc. that’ll, it’ll, there’re etc. |
| Different forms of modals + have are represented. | coulda, could’ve, couldn’t, couldn’ve, couldna, woulda, would’ve, wouldn’t, wouldn’ve, wouldna, shoulda, should’ve, shouldn’t, shouldn’ve, shouldna | |
| Lexicalized phonological reductions are limited to those listed on the right. | betcha, cuz, ’em (=them), gimme, gotta, hafta, kinda, lookit (as vocative only), lotsa, lotta, oughta, sorta, wanna | |
| ACRONYMS, ABBREVIATIONS, LETTERS AS VARIABLES | Acronyms are written in all caps. Three commonly abbreviated titles are left as abbreviations, but without periods. An acronym pronounced as a word is run together as one word. When an acronym is spelled out, it appears in all caps with hyphens between each letter (except PhD). | Exception: PhD (no hyphens, no period) Dr, Mr, Mrs (not spelled out) NASA, TOEFL C-I-A F-B-I E-L-I L-S-and-A |
| Letters used as variables in math and science are written in all caps with hyphens between modifying or adjoining elements. | X-Y axis N-squared,X-to-the-N-minus-one |
|
| HYPHENS | Standard hyphenation rules apply, as in the Chicago Manual of Style, where they exist. | pre-med, pre-calc, pre-law, mid-thirties mid-nineteen-ninety-nine, pre-Christian, non-Euclidean, non-native |
| NUMBERS | All numbers are fully spelled out as words. Standard hyphenation rules apply, with some additional guidelines: page numbers, course numbers, and room numbers are all hyphenated. | nineteen ten nineteen twenty-nine page one-fifty-seven Poli Sci one-sixty room thirty-twelve |
| REPETITIONS and REPAIRS | All repetitions of a word, partial word or phrase are transcribed. | it’s no longer than a than a, calendar year… |
| Truncated or cut-off words have a hyphen at the end of the last audible sound/letter. | so, come on up, grab yourself a ins- implement of destruction. | |
| An underscore at the end of a word indicates a false start in which a whole word is spoken but then the speaker re-starts the phrase. | well, it will be_ it’s sort of_ it’s a management human-resource kind of job… | |
| FOREIGN WORDS | Foreign words are spelled as in the original language when it uses a roman alphabet; otherwise, an approximate phonetic transliteration is used. | and see what, the Buddha, was s- was saying um, the tatha, gata, Sanskrit’s a really interesting language… |
| PRONUNCIATION VARIATIONS | As mentioned above, minor pronunciation variations are not represented in the spelling, with the exception of the contractions and lexicalized forms listed in this table. |
Find the number of words, speakers, or transcripts in any sub-category of MICASE (academic division, type of speaker, speech event).
Explanation of the colors, punctuation, and spelling in the MICASE online files.
The abbreviations we use for speaker and event information in MICASE Online and in the transcripts.
If you are interested in transcribing your own sound files, you can download and use this software.