Language detection#

Recognition of language(s) in a document can be achieved through various tools, such as langid or its newest and updated fork py3langid (by Adrien Barbaresi, the author of Trafilatura).

Following what included in the book, only langid is exemplified in the following script. Usage of py3langid will be included in future versions of the compendium.

Options and arguments for langid can be found in the official documentation.

1CATLISM, 281-283

Identify a set of predefined languages in .txt files and write a summary report in spreadsheet format1CATLISM, 281-283#

Script [s5.16] #
 1# Import the modules to read file according to regular expressions, to read/write csv files,
 2# to use regular expressions, and to detect the languages
 3import glob
 4import csv
 5import re
 6import langid
 7
 8# Create the output csv file (and relative writer) that will contain the results of the detections, using the 'append'
 9# ("a") mode to continuously write new lines to the end of the file
10csvfile = open("language_count.csv", "a", encoding="utf-8")
11csvfile_writer = csv.writer(csvfile)
12# Write the header of the csv file
13csvfile_writer.writerow(
14    [
15        "doc_id",
16        "en",
17        "% en",
18        "it",
19        "% it",
20        "es",
21        "% es",
22        "fr",
23        "% fr",
24        "de",
25        "% de",
26        "n_lines",
27    ]
28)
29
30# Search for .txt files in all the subfolders of the current folder - where the script resides
31files = glob.glob("./**/*.txt", recursive=True)
32
33# For each file do:
34for file in files:
35    # Extract the filename and its path, without the file extension
36    filename = re.sub(".txt", "", file)
37    # Create an empty list that will contain all the lines of the input file
38    all_lines = []
39    # Open the input file
40    text_content = open(file, "r", encoding="utf-8").readlines()
41    # Read each line of the input file, and for each one do:
42    for i in text_content:
43        # Detect the language of the line
44        langcode = langid.classify(i)[0]
45        # Add the language ISO 639-1 code to the created list
46        all_lines.append(langcode)
47    # Count the total number of lines in the input file, and store it into a variable
48    lines_count = len(all_lines)
49    # Count the number of lines detected as English and other languages, and store each one in a separate variable
50    en_count = all_lines.count("en")
51    it_count = all_lines.count("it")
52    es_count = all_lines.count("es")
53    fr_count = all_lines.count("fr")
54    de_count = all_lines.count("de")
55    # Count the percentage of each language in the document, using the above results, and store results in separate variables
56    en_perc = round((en_count / lines_count) * 100)
57    it_perc = round((it_count / lines_count) * 100)
58    es_perc = round((es_count / lines_count) * 100)
59    fr_perc = round((fr_count / lines_count) * 100)
60    de_perc = round((de_count / lines_count) * 100)
61    # Create the csv line to be written, using the variables storing the different collected values
62    csv_line = [
63        filename,
64        en_count,
65        en_perc,
66        it_count,
67        it_perc,
68        es_count,
69        es_perc,
70        fr_count,
71        fr_perc,
72        de_count,
73        de_perc,
74        lines_count,
75    ]
76    # Write the above line to the csv file
77    csvfile_writer.writerow(csv_line)