“The text is a tissue of citations, resulting from the thousand sources of culture.”
- Roland Barthes
“The text is a tissue of citations, resulting from the thousand sources of culture.”
- Roland Barthes
Robert Kosara contrasts my version of the pay gap graphic with the NYT original and notes how small changes make a big difference in how a graphic reads.
But what Nathan's version is missing is the story. The additional data mostly adds confusion: move your mouse over the year in the lower right, and what do you see? Lots of points are moving around, but there doesn’t appear to be a clear trend. The additional categories are interesting, but what do they add?
Not much. When I was putting together the graphic, I was hoping for a clear trend — something so obvious that didn't have to be explained. Instead I got fuzzy results. And that's where I stopped. On the other hand, the NYT version explains those fuzzy results, namely the outliers, such as women CEOs who work for non-profits or the greater percentage of men in medical specialties like surgery.
In analysis, assuming the users are experts of their data, annotation is less important. It's about allowing them to stay nimble and ask/answer a lot of questions. Graphics that tell stories with data, however, already have something interesting to say.
by Herman Stehouwer & Sebastian Drude
As all linguistic field workers know, transcribing and further annotating audio and video recordings and other texts is a very expensive and time-consuming procedure. For a single hour of a recording of a lesser documented language it can take more than a hundred hours of expert time to create useful linguistic annotations such as “basic annotation” (a transcription and a translation) and “basic glossing”: additional information on individual units – usually morphs, sometimes words – such as an individual gloss (indication of meaning or function) and perhaps categorical information such as a part-of-speech tags (or its equivalents on the morphological level). More advanced glossing can take even longer.
Furthermore, information on the lexical units encountered in the texts need to be transferred to a lexical tool. After all, often one the goals of field work is to create a usable lexicon, describing the endangered language.
Currently, this work is supported best by tools like (The Field Linguist’s) Toolbox or the FieldWorks Language Explorer (FLEx), both without proper support for media-files. Many users have asked for support for advanced annotation tasks in ELAN, ideally using LEXUS to build, access and expand a lexical database. Making this possible is the objective of TLA’s newest project called LEXAN, a modular annotation support framework coupled to a new interface in ELAN. It will support different “annotyzers”, i.e. modules that produce annotation suggestions for the researcher, including machine-learning modules.
The “annotyzers” will work on a tier or set of tiers, the “source tier[s]”, as chosen by the user, and typically produce an additional tier or a group of tiers, the “target tier[s]”, with content generated based on the source tiers and additional data, e.g. lexical data.
A first annotyzer-like functionality of ELAN (without requiring interaction with a lexicon yet) would be the possibility to copy one entire source tier, for instance a detailed transcript, or a literal translation. The created target tier can then serve as a starting point for preparing another tier with similar but edited content, for instance a cleaner adapted version of the orthographic transcript, or an idiomatic free translation.
Similarly, a basic tokenizer would copy the individual words (recognized by spaces and perhaps hyphens or similar punctuation) on one source tier – containing an orthographical representation of a sentence – into separate annotation units on a new (target) word-tier which can then be corrected (e.g., cells can be joined in the case of composed words such as black board, or on the contrary split in the case of clitics which may orthographically be parts of more comprehensive words).
As a possible next step, already making use of interaction with a lexicon, an annotyzer would use the annotations on the word-tier to build an “intermediate” database of individual inflected word forms. Each entry in this database would have at least a field which contains the citation form of the lexical word for each given inflected word form, possibly together with a semantic label (lexical gloss) and a disambiguating homonym index in case that two lexical words with identical citation forms exist. Some of these fields would be obtained from the lexicon once the citation form has been determined, and the citation form itself and other information (such as a “complete gloss” of the inflected word form which includes semantic effects of inflectional categories and the like) could be written back to new target tiers in ELAN. Although much of this information would still have to be added by hand the first time an inflected word form occurs, this simple setting would already help to: a) create lexical entries for new lexical units, b) reduce writing when the form occurs a second, third etc. time, and c) encourage and support consistency.
Many users acquainted with Toolbox or FLEx would expect a “glossing” functionality like they know it from these tools of the future LEXAN. This would include a parser-module (generic or language-specific, pure string-matching or advanced with using the context, static or with learning capacities etc.) which would split up the individual inflected word forms on a source word-tier into individual morphs on a new target morph-tier. This morph-tier would then serve as a source for adding further target tiers with annotations such as glosses (indication of lexical meaning or functional/categorical effects) and perhaps part-of-speech-like tags (on the morpheme level). In the lexicon, this functionality would presuppose corresponding fields in all entries such as a part-of-speech label for each morph and a gloss, which are probably the most common fields in lexical databases in field research anyway (in addition to the citation and variant forms of the morph and possibly a way to distinguish different but related senses which are given as lexicographical definitions or translation equivalents). Again, correct parses and glosses would be stored in the intermediate database so that they can be re-used and referred to.
It is a well-known fact that general parsers work better for some and less well for other languages (for instance, usually morphological parsers score high with predominantly isolating and agglutinative languages and less good with inflectional and polysynthetic languages). It is also true that glossing schemes and set-ups are based on specific types of linguistic theories – for instance, the setting presented above (which corresponds to the default functionalities of Toolbox and FLEx) is clearly tied to an “item-and-arrangement” (less so “item-and-process”) reasoning on language structure. In principle, an infrastructure as the one proposed here should strive at being as interoperable with different linguistic theories as possible, which would imply that also “word-and-paradigm” theories could fruitfully use the tools and functionalities. The proposal of an “intermediate” database with one entry each for every individual (inflected) word form goes into that direction, allowing, for instance, characterizing forms with respect to their functional categories without assigning these categories to individual morphs. Of course, to be fully functional providing for arbitrary theories and language types, also complex (multiple-word) forms must be covered, which presupposes the development of modules (parsers and the like) that recognize syntactic structures and that are able to cope with, say, discontinuous word forms.
More sophisticated and complete annotations on the morphological, syntactic and even other levels (phonetic/phonological, intonational) can be added by additional annotyzers as corresponding modules become available – for instance, morphological or syntactic constituent structures or grammatical relations could be generated (semi)automatically and represented in corresponding tiers in ELAN.
by Przemek Lenkiewicz
In the AVATecH project we are currently ready to share our initial results with the research community. The first recognizers are tested by MPI researchers and their valuable feedback is recorded in order to help us further improve our work and deliver tools that can save a lot of researchers’ time.
In order to spread the word about AVATecH and get more researchers interested, we have created this short movie clip that introduces the principle ideas of the project and shows some of our results.
by Herman Stehouwer
Is there a need to limit certain aspects of statistical language models?
Is it necessary to pre-limit the size of the n-gram?
Is it useful to use linguistic annotation, within alternative sequence selection tasks?
According to a new study by Herman Stehouwer, the size of the n-gram can be completely flexible depending on the situation. The study also finds that the addition of certain linguistic annotations, specifically part-of-speech annotations and dependency-parses, did not aid the model in making decisions.
The study compares the ability of a language model to select the correct alternative from sets of alternatives in hundreds of experiments. These experiments where performed for three different alternative sequence selection tasks, for four different annotations (and also for no annotation), and for four different ways to combine the annotation with the text. The results of the study have been used to write the thesis “Statistical Language Models for Alternative Sequence Selection”. This thesis will be defended on the 7th of December at 18:00 in the Aula of Tilburg University.
Coinciding with the defense a colloquium on language modeling is organized with invited talks by Colin de la Higuera, Louis ten Bosch, and Antal van den Bosch. For more information on the colloquium you can send an e-mail to herman.stehouwer [at] mpi.nl or look at its website.
It's great to join the 2011–12 HASTAC Scholars Program. I have enjoyed reading about the projects many of the other new scholars are working on over the past couple of weeks and am looking forward to conversing with you all over the coming year.
by Aarthy Somasundaram & Han Sloetjes
In this new release of ELAN a completely new “Transcription Mode” and an improved “Segmentation Mode” are introduced. Both have been developed in close cooperation with ELAN users.
The Transcription Mode is built for high-speed transcription. Where the traditional Annotation mode can be seen as accuracy-oriented rather than productivity-oriented, the Transcription mode aims at increasing the speed and efficiency of transcription work. The user interface has been designed with convenient text entry in mind: the main element is a table containing the annotations of selected tier types, displayed in a vertical order. Each cell in the table represents an annotation (or a position where a depending annotation can be created). The segments (annotations) need to be created first, in the segmentation or annotation mode, after which text can be typed into the (empty) segments in this mode. Operation in this mode is very much keyboard oriented. Selecting an annotation plays the corresponding segment automatically and brings it into edit mode: ready for you to start typing. Press the TAB key to replay. After editing, hit ENTER (or use the navigation keys) to jump to the next annotation, to play that segment automatically and to start typing right away and so on… Activation of a cell will silently create child annotations if they don’t exist yet — merely clicking an empty cell (or moving there using the keyboard) creates an annotation and opens it for editing. All this brings down the transcription work to just listening and typing, making it easy for the transcriber.
“On-the-fly Segmentation” has been moved into the main window as the new Segmentation mode (instead of in a separate dialog). It is now easier to switch between tiers while the media is playing. Segments are created by keyboard strokes and can be modified by dragging with the mouse. This mode introduces a preliminary step-and-repeat playback mode.
Apart from that, some new multiple file processing functions have been added, like annotations from overlaps and annotation statistics. An option to add a group of tiers for a new participant has been implemented, as well as for deletion of multiple tiers in one action. Customization of the program has been improved by the introduction of new preference elements.
by Aarthy Somasundaram
Toward the end of last year a new version of ELAN has been released, containing lots of new features and improved functionalities, a new media player solution for Windows and fixes for a number of issues and bugs in previous versions.
A first implementation of interaction with LEXUS, the MPI developed web-based lexicon tool for creating and editing lexical databases, has been added. A new lexicon viewer allows the user to perform a look up for values in an online lexicon and to apply a value to the selected annotation.
ELAN has been facing many codec related problems, especially with mpeg-1 and mpeg-2 files. With the intention to eliminate a few of them, a new player, for Windows has been developed based on DirectShow (JDS, Java-Direct Show).
To use this player, it is necessary to select it first in the Platform/OS tab in the “Edit Preferences” window.
This version extends its support for controlled vocabularies with externally defined closed controlled vocabularies (located e.g. on the web). The list of supported file formats for importing controlled vocabularies has been extended with .txt and .csv. The file format of externally defined closed controlled vocabularies files is .ecv, which is close to eaf.
To make life easier and to increase the work speed of ELAN users, several improvements have been made to get things done with fewer steps and clicks. A few tier-based operations, like removing multiple annotations or annotation values from selected tiers or creating depending annotations recursively on all depending tiers, can be performed much faster and with more ease of use. Now it is also possible to automatically create depending annotations, when an annotation is created on a tier with dependent tiers. The merge transcriptions function is extended with options for appending one file to the other, making the merging process more versatile.
by Binyam Gebrekidan Gebre
The AVATecH project (Advancing Video Audio Technology in Humanities Research) aims at investigating, developing and applying advanced technology for semi-automatic annotation of collected audio-visual recordings used in humanities research. Currently, even the simplest annotations of, for example, recorded dialogs take too much time and effort. By making the annotation process more efficient through the use of automatic detectors, more data can be annotated more efficiently, allowing new possibilities for search and corpus analysis and better theory building.
Initial research will focus on the creation of detector components which, given media recordings, generate lists of segments and annotations. Such detectors can be invoked from within annotation tools such as the widely used and proven ELAN software and from a batch-processing framework, to process a number of recordings in one go.
The project is organized in two major phases:
1. First, low hanging fruit detectors will be identified that can operate on a selected collection of typical audio/video material. They will be integrated into ELAN and so that the developers can interact with researchers during the evaluation.
2. Second, more advanced and complex detector tasks will be tackled after the results of the low hanging fruit detectors have been evaluated.
The detectors developed will be made available via interactive annotation tools and batch processing. In this project, two Max Planck Institutes (the MPI for Psycholinguistics in Nijmegen and the MPI for Social Anthropology in Halle) and two Fraunhofer Institutes (the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS in Sankt Augustin and the Fraunhofer Heinrich Hertz Institute HHI in Berlin) are cooperating in different capacities. The Max Planck Institutes act as experts for the research driven questions resulting from an analysis of the AV material and for user-friendly interaction tools. The Fraunhofer Institutes act as experts for digital sound and video processing methods. More information on AVATecH can be found on the project’s homepage.