Annotations#

A wide variety of options are available to annotate digital textual data; a first distinction is between automated and manual annotation. Only the former technique is included; manual annotation can easily be achieved through INCEpTION, a multi-platform open source tool (written in Java) for interactive annotation - the official documentation includes guides and video tutorials.

Note

Further automated annotation tools (such as PyMUSAS for semantic annotations) will be documented in future versions of the compendium.

¹CATLISM, 290-295

Stanza ¹ ¹`CATLISM, 290-295`#

Stanza uses universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats) for Part-Of-Speech annotations ² ²https://stanfordnlp.github.io/stanza/pos.html#description - accessible through pos, xpos, and ufeats respectively. Scripts [s5.19] (lines 33 and 39) and [s5.20] (line 34) both employ universal POS tags (pos).

Installing the tool#

Command [c5.35]#

pip install stanza

Importing the module and installing a language model#

Script [s5.18] #

# Import the module
import stanza

# Define the Pipeline and download the language model if needed
nlp = stanza.Pipeline("en")

Annotating textual data from `.txt` files into XML format#

Script [s5.19] #

# Import modules to open files using regexes; use regexes; use Stanza;  read and write XML files
import glob
import re
import stanza
from lxml import etree

# Initiate the Stanza module; define the language of the texts,
# and the types of information Stanza needs to add to each word
nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
# Find all the .txt files in the current folder
files = glob.glob("*.txt")
# For each file, do:
for file in files:
    # Read the name of the file, delete the extension '.txt', and store
    # it inside of a variable
    filename = re.sub(".txt", "", file)
    # Open, read, and process the contents of the file with Stanza
    doc = nlp(open(file, encoding="utf-8").read())
    # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
    text = etree.Element("text")
    # Initiate a counter that increases by 1 for every sentence, then
    # for every sentence do:
    for i, sentence in enumerate(doc.sentences, start=1):
        # Create an <s> tag element to enclose the sentence, and assign
        # the attribute 'n' containing as value the number of the sentence
        s = etree.SubElement(text, "s", n=str(i))
        # For every word in the sentence do:
        for word in sentence.words:
            # If the word is a punctuation mark:
            if word.pos == "PUNCT":
                # Enclose it inside of a <c> tag element, and add its POS and lemma as attributes
                etree.SubElement(
                    s, "c", pos=str(word.pos), lemma=str(word.lemma).lower()
                ).text = str(word.text)
            # If the word is not a punctuation mark:
            else:
                # Enclose it inside of a <w> tag element, and add its POS and lemma as attributes
                etree.SubElement(
                    s, "w", pos=str(word.pos), lemma=str(word.lemma).lower()
                ).text = str(word.text)
    # Add the sentences and the words inside of the 'text' element
    tree = etree.ElementTree(text)
    # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
    # at the start of the file and graphically formatting the layout ('pretty_print')
    tree.write(
        filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
    )

How to use script `[s5.19]`#

Copy/download the file s5.19_use_stanza-XML.py inside the folder where the data to be annotated (in .txt format) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/corpus_txt_data/

At last, run the script from the terminal:

python s5.19_use_stanza-XML.py

Annotating textual data from `.txt` files into verticalised XML format#

Script [s5.20] #

# Import modules to open files using regexes; use regexes; use Stanza;  read and write XML files
import glob
import re
import stanza
from lxml import etree

# Initiate the Stanza module, and define the language of the texts,
# and the types of information Stanza needs to add to each word
nlp = stanza.Pipeline("en", processors="tokenize,mwt,pos,lemma")
# Select all .txt files in the current folder
files = glob.glob("*.txt")
# For each file, do:
for file in files:
    # Read the name of the file, delete the extension '.txt', and store
    # it inside a variable
    filename = re.sub(".txt", "", file)
    # Open, read, and process the file with Stanza
    doc = nlp(open(file, encoding="utf-8").read())
    # Create the root <text> element tag - with no extra attributes - that will wrap the content of the file
    text = etree.Element("text")
    # Initiate a counter that increases by 1 for every sentence, then
    # for every sentence do:
    for i, sentence in enumerate(doc.sentences, start=1):
        # Initialise and empty list that will contain the lines of tagged text for the currently processed sentence
        tagged = []
        # Create an <s> tag element to enclose the sentence, and assign
        # the attribute 'n' containing as value the number of the sentence
        s = etree.SubElement(text, "s", n=str(i))
        # For every word in the sentence do:
        for word in sentence.words:
            # Construct the line that contains the word as it appears in the text,
            # its lemma, and its POS function, each one separated from the other using
            # a tab
            line = word.text + "\t" + word.lemma + "\t" + word.pos
            # Add the line to the list of tagged lines
            tagged.append(line)
        # Collate all the tagged contents saved to the list 'tagged' and enclose them inside of the <s> element tag
        s.text = "\n".join(tagged)
    tree = etree.ElementTree(text)
    # Write the resulting XML structure to a '.vrt' file named after the input filename, using utf-8 encoding, adding the XML
    # declaration at the start of the file and graphically formatting the layout ('pretty_print')
    tree.write(
        filename + ".vrt", pretty_print=True, xml_declaration=True, encoding="utf-8"
    )

How to use script `[s5.20]`#

Copy/download the file s5.20_use_stanza-vrt.py inside the folder where the data to be annotated (in .txt format) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/corpus_txt_data/

At last, run the script from the terminal:

python s5.20_use_stanza-vrt.py

Example of data extracted with script `[s5.19]`#

Example [e5.28]#

<?xml version='1.0' encoding='UTF-8'?>
<text>
  <s n="1">
    <w pos="CCONJ" lemma="and">And</w>
    <w pos="ADV" lemma="now">now</w>
    <w pos="ADP" lemma="for">for</w>
    <w pos="PRON" lemma="something">something</w>
    <w pos="ADV" lemma="completely">completely</w>
    <w pos="ADJ" lemma="different">different</w>
    <c pos="PUNCT" lemma="!">!</c>
  </s>
</text>

Annotations

Contents

Annotations#

Stanza 1 1CATLISM, 290-295#

Installing the tool#

Importing the module and installing a language model#

Annotating textual data from .txt files into XML format#

How to use script [s5.19]#

Annotating textual data from .txt files into verticalised XML format#

How to use script [s5.20]#

Example of data extracted with script [s5.19]#

Stanza ¹ ¹`CATLISM, 290-295`#

Annotating textual data from `.txt` files into XML format#

How to use script `[s5.19]`#

Annotating textual data from `.txt` files into verticalised XML format#

How to use script `[s5.20]`#

Example of data extracted with script `[s5.19]`#