
A wide variety of options are available to annotate digital textual data; a first distinction is between automated and manual annotation. Only the former technique is included; manual annotation can easily be achieved through INCEpTION, a multi-platform open source tool (written in Java) for interactive annotation - the official documentation includes guides and video tutorials.


Further automated annotation tools (such as PyMUSAS for semantic annotations) will be documented in future versions of the compendium.

Stanza1CATLISM, 290-295#

Stanza uses universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats) for Part-Of-Speech annotations2 - accessible through pos, xpos, and ufeats respectively. Scripts [s5.19] (lines 33 and 39) and [s5.20] (line 34) both employ universal POS tags (pos).

Installing the tool#

Command [c5.35]#
pip install stanza

Importing the module and installing a language model#

Script [s5.18] #
1# Import the module
2import stanza
4# Define the Pipeline and download the language model if needed
5nlp = stanza.Pipeline("en")

Annotating textual data from .txt files into XML format#

Script [s5.19] #
 1# Import modules to open files using regexes; use regexes; use Stanza;  read and write XML files
 2import glob
 3import re
 4import stanza
 5from lxml import etree
 7# Initiate the Stanza module; define the language of the texts,
 8# and the types of information Stanza needs to add to each word
 9nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
10# Find all the .txt files in the current folder
11files = glob.glob("*.txt")
12# For each file, do:
13for file in files:
14    # Read the name of the file, delete the extension '.txt', and store
15    # it inside of a variable
16    filename = re.sub(".txt", "", file)
17    # Open, read, and process the contents of the file with Stanza
18    doc = nlp(open(file, encoding="utf-8").read())
19    # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
20    text = etree.Element("text")
21    # Initiate a counter that increases by 1 for every sentence, then
22    # for every sentence do:
23    for i, sentence in enumerate(doc.sentences, start=1):
24        # Create an <s> tag element to enclose the sentence, and assign
25        # the attribute 'n' containing as value the number of the sentence
26        s = etree.SubElement(text, "s", n=str(i))
27        # For every word in the sentence do:
28        for word in sentence.words:
29            # If the word is a punctuation mark:
30            if word.pos == "PUNCT":
31                # Enclose it inside of a <c> tag element, and add its POS and lemma as attributes
32                etree.SubElement(
33                    s, "c", pos=str(word.pos), lemma=str(word.lemma).lower()
34                ).text = str(word.text)
35            # If the word is not a punctuation mark:
36            else:
37                # Enclose it inside of a <w> tag element, and add its POS and lemma as attributes
38                etree.SubElement(
39                    s, "w", pos=str(word.pos), lemma=str(word.lemma).lower()
40                ).text = str(word.text)
41    # Add the sentences and the words inside of the 'text' element
42    tree = etree.ElementTree(text)
43    # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
44    # at the start of the file and graphically formatting the layout ('pretty_print')
45    tree.write(
46        filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
47    )

How to use script [s5.19]#

Copy/download the file inside the folder where the data to be annotated (in .txt format) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/corpus_txt_data/

At last, run the script from the terminal:


Annotating textual data from .txt files into verticalised XML format#

Script [s5.20] #
 1# Import modules to open files using regexes; use regexes; use Stanza;  read and write XML files
 2import glob
 3import re
 4import stanza
 5from lxml import etree
 7# Initiate the Stanza module, and define the language of the texts,
 8# and the types of information Stanza needs to add to each word
 9nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
10# Select all .txt files in the current folder
11files = glob.glob("*.txt")
12# For each file, do:
13for file in files:
14    # Read the name of the file, delete the extension '.txt', and store
15    # it inside a variable
16    filename = re.sub(".txt", "", file)
17    # Open, read, and process the file with Stanza
18    doc = nlp(open(file, encoding="utf-8").read())
19    # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
20    text = etree.Element("text")
21    # Initiate a counter that increases by 1 for every sentence, then
22    # for every sentence do:
23    for i, sentence in enumerate(doc.sentences, start=1):
24        # Initialise and empty list that will contain the lines of tagged text for the currently processed sentence
25        tagged = []
26        # Create an <s> tag element to enclose the sentence, and assign
27        # the attribute 'n' containing as value the number of the sentence
28        s = etree.SubElement(text, "s", n=str(i))
29        # For every word in the sentence do:
30        for word in sentence.words:
31            # Construct the line that contains the word as it appears in the text,
32            # its lemma, and its POS function, each one separated from the other using
33            # a tab
34            line = word.text + "\t" + word.lemma + "\t" + word.pos
35            # Add the line to the list of tagged lines
36            tagged.append(line)
37        # Collate all the tagged contents saved to the list 'tagged' and enclose them inside of the <s> element tag
38        s.text = "\n".join(tagged)
39    tree = etree.ElementTree(text)
40    # Write the resulting XML structure to a '.vrt' file named after the input filename, using utf-8 encoding, adding the XML
41    # declaration at the start of the file and graphically formatting the layout ('pretty_print')
42    tree.write(
43        filename + ".vrt", pretty_print=True, xml_declaration=True, encoding="utf-8"
44    )

How to use script [s5.20]#

Copy/download the file inside the folder where the data to be annotated (in .txt format) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/corpus_txt_data/

At last, run the script from the terminal:


Example of data extracted with script [s5.19]#

Example [e5.28]#
 1<?xml version='1.0' encoding='UTF-8'?>
 3  <s n="1">
 4    <w pos="CCONJ" lemma="and">And</w>
 5    <w pos="ADV" lemma="now">now</w>
 6    <w pos="ADP" lemma="for">for</w>
 7    <w pos="PRON" lemma="something">something</w>
 8    <w pos="ADV" lemma="completely">completely</w>
 9    <w pos="ADJ" lemma="different">different</w>
10    <c pos="PUNCT" lemma="!">!</c>
11  </s>