Language detection#
Recognition of language(s) in a document can be achieved through various tools, such as langid
or its newest and updated fork py3langid
(by Adrien Barbaresi, the author of Trafilatura).
Following what included in the book, only langid
is exemplified in the following script. Usage of py3langid
will be included in future versions of the compendium.
Options and arguments for langid
can be found in the official documentation.
CATLISM, 281-283
Identify a set of predefined languages in .txt files and write a summary report in spreadsheet format1CATLISM, 281-283
#
1# Import the modules to read file according to regular expressions, to read/write csv files,
2# to use regular expressions, and to detect the languages
3import glob
4import csv
5import re
6import langid
7
8# Create the output csv file (and relative writer) that will contain the results of the detections, using the 'append'
9# ("a") mode to continuously write new lines to the end of the file
10csvfile = open("language_count.csv", "a", encoding="utf-8")
11csvfile_writer = csv.writer(csvfile)
12# Write the header of the csv file
13csvfile_writer.writerow(
14 [
15 "doc_id",
16 "en",
17 "% en",
18 "it",
19 "% it",
20 "es",
21 "% es",
22 "fr",
23 "% fr",
24 "de",
25 "% de",
26 "n_lines",
27 ]
28)
29
30# Search for .txt files in all the subfolders of the current folder - where the script resides
31files = glob.glob("./**/*.txt", recursive=True)
32
33# For each file do:
34for file in files:
35 # Extract the filename and its path, without the file extension
36 filename = re.sub(".txt", "", file)
37 # Create an empty list that will contain all the lines of the input file
38 all_lines = []
39 # Open the input file
40 text_content = open(file, "r", encoding="utf-8").readlines()
41 # Read each line of the input file, and for each one do:
42 for i in text_content:
43 # Detect the language of the line
44 langcode = langid.classify(i)[0]
45 # Add the language ISO 639-1 code to the created list
46 all_lines.append(langcode)
47 # Count the total number of lines in the input file, and store it into a variable
48 lines_count = len(all_lines)
49 # Count the number of lines detected as English and other languages, and store each one in a separate variable
50 en_count = all_lines.count("en")
51 it_count = all_lines.count("it")
52 es_count = all_lines.count("es")
53 fr_count = all_lines.count("fr")
54 de_count = all_lines.count("de")
55 # Count the percentage of each language in the document, using the above results, and store results in separate variables
56 en_perc = round((en_count / lines_count) * 100)
57 it_perc = round((it_count / lines_count) * 100)
58 es_perc = round((es_count / lines_count) * 100)
59 fr_perc = round((fr_count / lines_count) * 100)
60 de_perc = round((de_count / lines_count) * 100)
61 # Create the csv line to be written, using the variables storing the different collected values
62 csv_line = [
63 filename,
64 en_count,
65 en_perc,
66 it_count,
67 it_perc,
68 es_count,
69 es_perc,
70 fr_count,
71 fr_perc,
72 de_count,
73 de_perc,
74 lines_count,
75 ]
76 # Write the above line to the csv file
77 csvfile_writer.writerow(csv_line)