Annotations#
A wide variety of options are available to annotate digital textual data; a first distinction is between automated and manual annotation. Only the former technique is included; manual annotation can easily be achieved through INCEpTION, a multi-platform open source tool (written in Java) for interactive annotation - the official documentation includes guides and video tutorials.
Note
Further automated annotation tools (such as PyMUSAS for semantic annotations) will be documented in future versions of the compendium.
CATLISM, 290-295
Stanza1CATLISM, 290-295
#
Stanza
uses universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats) for Part-Of-Speech annotations2https://stanfordnlp.github.io/stanza/pos.html#description - accessible through pos
, xpos
, and ufeats
respectively. Scripts [s5.19]
(lines 33
and 39
) and [s5.20]
(line 34
) both employ universal POS tags (pos
).
Installing the tool#
pip install stanza
Importing the module and installing a language model#
1# Import the module
2import stanza
3
4# Define the Pipeline and download the language model if needed
5nlp = stanza.Pipeline("en")
Annotating textual data from .txt
files into XML format#
1# Import modules to open files using regexes; use regexes; use Stanza; read and write XML files
2import glob
3import re
4import stanza
5from lxml import etree
6
7# Initiate the Stanza module; define the language of the texts,
8# and the types of information Stanza needs to add to each word
9nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
10# Find all the .txt files in the current folder
11files = glob.glob("*.txt")
12# For each file, do:
13for file in files:
14 # Read the name of the file, delete the extension '.txt', and store
15 # it inside of a variable
16 filename = re.sub(".txt", "", file)
17 # Open, read, and process the contents of the file with Stanza
18 doc = nlp(open(file, encoding="utf-8").read())
19 # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
20 text = etree.Element("text")
21 # Initiate a counter that increases by 1 for every sentence, then
22 # for every sentence do:
23 for i, sentence in enumerate(doc.sentences, start=1):
24 # Create an <s> tag element to enclose the sentence, and assign
25 # the attribute 'n' containing as value the number of the sentence
26 s = etree.SubElement(text, "s", n=str(i))
27 # For every word in the sentence do:
28 for word in sentence.words:
29 # If the word is a punctuation mark:
30 if word.pos == "PUNCT":
31 # Enclose it inside of a <c> tag element, and add its POS and lemma as attributes
32 etree.SubElement(
33 s, "c", pos=str(word.pos), lemma=str(word.lemma).lower()
34 ).text = str(word.text)
35 # If the word is not a punctuation mark:
36 else:
37 # Enclose it inside of a <w> tag element, and add its POS and lemma as attributes
38 etree.SubElement(
39 s, "w", pos=str(word.pos), lemma=str(word.lemma).lower()
40 ).text = str(word.text)
41 # Add the sentences and the words inside of the 'text' element
42 tree = etree.ElementTree(text)
43 # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
44 # at the start of the file and graphically formatting the layout ('pretty_print')
45 tree.write(
46 filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
47 )
How to use script [s5.19]
#
Copy/download the file s5.19_use_stanza-XML.py
inside the folder where the data to be annotated (in .txt
format) resides; then browse inside the folder through the terminal, e.g.
cd Downloads/corpus_txt_data/
At last, run the script from the terminal:
python s5.19_use_stanza-XML.py
Annotating textual data from .txt
files into verticalised XML format#
1# Import modules to open files using regexes; use regexes; use Stanza; read and write XML files
2import glob
3import re
4import stanza
5from lxml import etree
6
7# Initiate the Stanza module, and define the language of the texts,
8# and the types of information Stanza needs to add to each word
9nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
10# Select all .txt files in the current folder
11files = glob.glob("*.txt")
12# For each file, do:
13for file in files:
14 # Read the name of the file, delete the extension '.txt', and store
15 # it inside a variable
16 filename = re.sub(".txt", "", file)
17 # Open, read, and process the file with Stanza
18 doc = nlp(open(file, encoding="utf-8").read())
19 # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
20 text = etree.Element("text")
21 # Initiate a counter that increases by 1 for every sentence, then
22 # for every sentence do:
23 for i, sentence in enumerate(doc.sentences, start=1):
24 # Initialise and empty list that will contain the lines of tagged text for the currently processed sentence
25 tagged = []
26 # Create an <s> tag element to enclose the sentence, and assign
27 # the attribute 'n' containing as value the number of the sentence
28 s = etree.SubElement(text, "s", n=str(i))
29 # For every word in the sentence do:
30 for word in sentence.words:
31 # Construct the line that contains the word as it appears in the text,
32 # its lemma, and its POS function, each one separated from the other using
33 # a tab
34 line = word.text + "\t" + word.lemma + "\t" + word.pos
35 # Add the line to the list of tagged lines
36 tagged.append(line)
37 # Collate all the tagged contents saved to the list 'tagged' and enclose them inside of the <s> element tag
38 s.text = "\n".join(tagged)
39 tree = etree.ElementTree(text)
40 # Write the resulting XML structure to a '.vrt' file named after the input filename, using utf-8 encoding, adding the XML
41 # declaration at the start of the file and graphically formatting the layout ('pretty_print')
42 tree.write(
43 filename + ".vrt", pretty_print=True, xml_declaration=True, encoding="utf-8"
44 )
How to use script [s5.20]
#
Copy/download the file s5.20_use_stanza-vrt.py
inside the folder where the data to be annotated (in .txt
format) resides; then browse inside the folder through the terminal, e.g.
cd Downloads/corpus_txt_data/
At last, run the script from the terminal:
python s5.20_use_stanza-vrt.py
Example of data extracted with script [s5.19]
#
1<?xml version='1.0' encoding='UTF-8'?>
2<text>
3 <s n="1">
4 <w pos="CCONJ" lemma="and">And</w>
5 <w pos="ADV" lemma="now">now</w>
6 <w pos="ADP" lemma="for">for</w>
7 <w pos="PRON" lemma="something">something</w>
8 <w pos="ADV" lemma="completely">completely</w>
9 <w pos="ADJ" lemma="different">different</w>
10 <c pos="PUNCT" lemma="!">!</c>
11 </s>
12</text>