Youtube#
Warning
As of March 2023 the tool youtube-dl
(release 2021.12.17
) suggested in the book has a number of issues and cannot therefore be used to correctly download data from Youtube.
The commands below make use of yt-dlp
, an alternative tool (forked from youtube-dl
with fixed issues and additional options) whose basic usage is identical to youtube-dl
. Any command included in the book and using youtube-dl
can be replicated with yt-dlp
instead - as done in the contents below.
Data collection from Youtube (and more than 1,200 platforms - the same ones supported by youtube-dl
) can be obtained using yt-dlp
.
Options and arguments for the tool can be found in the official documentation.
Note
While yt-dlp
supports the extraction of comments (youtube-dl
does not have such option), this page currently follows the contents of the book, where comments are downloaded using youtube-comment-downloader
.
Future versions of the compendium will include options for downloading comments using yt-dlp
.
CATLISM, 247
Installing the tools1CATLISM, 247
#
pip install yt-dlp
[c5.28]
pip install youtube-comment-downloader
[c5.30]
CATLISM, 247;254;263
Using the tools2CATLISM, 247;254;263
#
yt-dlp 'URL' --write-info-json --skip-download --write-annotations --write-description
yt-dlp 'URL' --write-info-json --skip-download --write-subs --sub-langs LL --sub-format FORMAT
youtube-comment-downloader --url "URL" --output FILE.jsonl
Extracting the data#
CATLISM, 264-269
Extract collected Youtube data (everything except comments) to XML format3CATLISM, 264-269
#
1# Import modules for: working on local folders and files; regular expressions; finding files in folder;
2# reading JSON files; using BeautifulSoup; working with XML files
3import os
4import re
5from glob import glob
6import json
7from bs4 import BeautifulSoup
8from lxml import etree
9
10# Manually set the 2-letter ISO 3166-1 language code (e.g. English = en), so that only the subtitle files for the set
11# language are processed
12language_code = "en"
13
14# Create a regular expression to capture the title of the video preceding youtube-dl default naming conventions, where:
15# [FILENAME].info.json = the JSON file containing the metadata details
16# [FILENAME].[LL].srv3 = the XML file containing the subtitles in SRV3 format, where [LL] is the 2-letter ISO 3166-1 language code
17# [FILENAME].[LL].ttml = the XML file containing the subtitles in TTML format, where [LL] is the 2-letter ISO 3166-1 language code
18filename_filter = re.compile(
19 "(.*?)\.(info\.json|[A-Za-z]{1,3}\.srv3|[A-Za-z]{1,3}\.ttml)"
20)
21# Create an empty list to store all the video titles
22unique_filenames_list = []
23# List all filenames present in the folder where the script resides
24files = glob("*.*")
25
26# For every single filename found in the folder, do:
27for single_file in files:
28 # Search for the regular expression for capturing metadata and subtitle files in the filename, and store the result
29 # in the 'found_filename' variable
30 found_filename = re.search(filename_filter, single_file)
31 # If the filename matches the regular expression, extract the filename without the extensions; then check if the cleaned
32 # filename is present in the unique_filenames_list, and if not add it
33 if found_filename is not None and found_filename[1] not in unique_filenames_list:
34 unique_filenames_list.append(found_filename[1])
35
36
37# Create the function to convert the time-format employed by TTML (HH:MM:SS.MS) into the one employed by SRV3
38# (total number of milliseconds); adapted from
39# https://stackoverflow.com/questions/59314323/how-to-convert-timestamp-into-milliseconds-in-python
40def convert_timestamp(msf):
41 hours, minutes, seconds = msf.split(":")
42 seconds, milliseconds = seconds.split(".")
43 hours, minutes, seconds, milliseconds = map(
44 int, (hours, minutes, seconds, milliseconds)
45 )
46 return (hours * 3600 + minutes * 60 + seconds) * 1000 + milliseconds
47
48
49# For each unique filename found do:
50for filename in unique_filenames_list:
51 # Recreate the full filenames with extensions, and store each one of them into a single variable
52 json_file = filename + ".info.json"
53 srv3_file = filename + "." + language_code + ".srv3"
54 ttml_file = filename + "." + language_code + ".ttml"
55 # Create the XML element <text>, root element of the final output
56 text_tag = etree.Element("text")
57
58 # Open the metadata JSON file:
59 metadata_file = json.loads(open(json_file, encoding="utf-8").read())
60 # Add a set of metadata points as attributes of the <text> element tag
61 text_tag.attrib["uploader"] = metadata_file["uploader"]
62 text_tag.attrib["date"] = metadata_file["upload_date"]
63 text_tag.attrib["views"] = str(metadata_file["view_count"])
64 text_tag.attrib["title"] = metadata_file["fulltitle"]
65 # Check if the 'like_count' metadata point is present, if not assign the value "na" to the 'like_count' attribute
66 text_tag.attrib["likes"] = str(
67 metadata_file["like_count"] if "like_count" in metadata_file else "na"
68 )
69
70 # Check if the SRV3 file exists (priority is given to SRV3 over TTML due to the presence of autocaptioning details);
71 # if so, do:
72 if os.path.isfile(srv3_file):
73 # Assign the attribute 'format' with a value of 'srv' to the <text> element tag
74 text_tag.attrib["format"] = "srv3"
75 # Create the output filename using the input filename
76 output_filename = srv3_file + ".xml"
77 # Open the SRV3 file
78 f = open(srv3_file, "r", encoding="utf-8")
79 # Parse its XML contents using BeautifulSoup
80 soup = BeautifulSoup(f, features="xml")
81 # If the attribute 'ac' (= autocaption) with value '255' is found in the <s> element tag then the subtitles are the result of autocaptioning; hence assign the value 'true' to the variable 'is_ac'. Otherwise assign the value 'false' to 'is_ac'
82 if soup.body.find("s", attrs={"ac": True}):
83 is_ac = "true"
84 else:
85 is_ac = "false"
86
87 # Assign the value of 'is_ac' to the <text> element tag attribute 'autocaption'
88 text_tag.attrib["autocaption"] = is_ac
89
90 # For each paragraph (i.e. each line of the subtitles) in the file do:
91 for sent in soup.body.find_all("p"):
92 # Check if the textual content of the paragraph is longer than 1 character; this avoids adding empty paragraphs to the final output
93 if len(sent.get_text()) > 1:
94 # Create a <p> element tag inside of the XML output
95 p_tag = etree.SubElement(text_tag, "p")
96 # Add the attribute 'time' (indicating the starting time of the paragraph) and assign it the value appearing in 't'
97 p_tag.attrib["time"] = str(sent["t"])
98 # Add the attribute 'is_ac' and assign it the value of the previously created variable 'is_ac'
99 p_tag.attrib["is_ac"] = is_ac
100 p_tag.text = sent.get_text()
101 # If the paragraph does not contain any text (i.e. its length is < 1), skip it
102 else:
103 continue
104
105 # If the SRV3 file does not exists, check if the TTML file does instead; then do (only the steps that differ from the
106 # SRV3 procedure are commented):
107 elif os.path.isfile(ttml_file):
108 text_tag.attrib["format"] = "ttml"
109 text_tag.attrib["autocaption"] = "na"
110 output_filename = ttml_file + ".xml"
111 f = open(ttml_file, "r", encoding="utf-8")
112 soup = BeautifulSoup(f, features="xml")
113
114 for sent in soup.body.find_all("p"):
115 if len(sent.get_text()) > 1:
116 p_tag = etree.SubElement(text_tag, "p")
117 # Add the 'time' attribute, assigning it as value the starting time from the 'begin' attribute in the original file,
118 # converted into milliseconds using the 'convert_timestamp' function
119 p_tag.attrib["time"] = str(convert_timestamp(str(sent["begin"])))
120 p_tag.attrib["is_ac"] = "na"
121 p_tag.text = sent.get_text()
122 else:
123 continue
124 # If neither the SRV3 nor the TTML files are found, print 'No subtitle files found.'
125 else:
126 print("No subtitle files found.")
127
128 # Write the extracted data formatted in XML to the final XML structure
129 tree = etree.ElementTree(text_tag)
130 # Write the XML to the output file
131 tree.write(
132 output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
133 )
How to use script [s5.13]
#
Copy/download the file s5.13_youtube-dl_subs-to-XML.py
inside the folder where the data downloaded through ytp-dl
(e.g. through c5.29
and c5.32
) resides; then browse inside the folder through the terminal, e.g.
cd Downloads/youtube_data/
At last, run the script from the terminal:
python s5.13_youtube-dl_subs-to-XML.py
CATLISM, 269-272
Extract collected Youtube comments to XML format4CATLISM, 269-272
#
1# Import modules for: regular expressions; reading timestamps as date objects; loading files using regular expression;
2# generate random numbers; reading JSONL files; working with XML files
3import re
4from datetime import datetime
5from glob import glob
6from random import randint
7import jsonlines
8from lxml import etree
9
10# List all filenames with the .jsonl extension
11files = glob("*.jsonl")
12
13# For each file do:
14for file in files:
15 # Create the root <text> element tag
16 text_tag = etree.Element("text")
17 # Generate and assign a random number as attribute 'id' of the <text> element tag
18 text_tag.attrib["id"] = str(randint(0, 100000))
19 # Create the output filename using the original one and substituting '.jsonl' with '.xml'
20 output_filename = file.replace("*.jsonl", "") + ".xml"
21 # Read the file as a jsonlines one:
22 with jsonlines.open(file) as comments:
23 # For each line (i.e. metadata data-points for one comment) do:
24 for comment in comments:
25 # Create a <comment> element tag to enclose the comment
26 comment_tag = etree.SubElement(text_tag, "comment")
27 # Extract the comment id ('cid') and save it to a variable
28 comment_id = str(comment["cid"])
29 # Check if the 'cid' contains a full stop character. If so, the comment is a reply to another comment: take the string on
30 # the left of the full stop and assign it as value of the attribute 'comment_id', then the string on the right and assign
31 # it as value of the attribute 'comment_reply_to' to preserve the original hierarchical structure
32 if re.search("(.*?)\.(.*)", comment_id) is not None:
33 comment_tag.attrib["comment_id"] = str(
34 re.search("(.*?)\.(.*)", comment_id).group(2)
35 )
36 comment_tag.attrib["comment_reply_to"] = str(
37 re.search("(.*?)\.(.*)", comment_id).group(1)
38 )
39 # If there is no full stop character, assign the 'comment_id' as value of the <comment> attribute 'comment_id' and the
40 # value 'na' to the 'comment_reply_to' attribute
41 else:
42 comment_tag.attrib["comment_id"] = comment_id
43 comment_tag.attrib["comment_reply_to"] = "na"
44
45 # Extract other metadata data-points and assign them to a set of attributes of the <comment> element tag
46 comment_tag.attrib["username"] = str(comment["author"])
47 comment_tag.attrib["votes"] = str(comment["votes"])
48 comment_tag.attrib["heart"] = str(comment["heart"])
49 # Read the Unix timestamp from the metadata data-point 'time_parsed' and convert it into a human-readable datetime object,
50 # then store it into the 'comment_date' variable
51 comment_date = datetime.fromtimestamp(comment["time_parsed"])
52 # Format the time at which the message was posted into the format HHMM (hours and minutes)
53 comment_date_time = comment_date.strftime("%H%M")
54 # Assign the date elements to different metadata attributes
55 comment_tag.attrib["date_d"] = str(comment_date.day)
56 comment_tag.attrib["date_m"] = str(comment_date.month)
57 comment_tag.attrib["date_y"] = str(comment_date.year)
58 comment_tag.attrib["date_time"] = str(comment_date_time)
59
60 # At last, write the content of the comment as the text value of the <comment> element tag
61 comment_tag.text = str(comment["text"])
62
63 # Write the extracted data formatted in XML to the final XML structure
64 tree = etree.ElementTree(text_tag)
65 # Write the XML to the output file
66 tree.write(
67 output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
68 )
CATLISM, 264
Example of data extracted with [s5.13]
5CATLISM, 264
#
1<?xml version='1.0' encoding='UTF-8'?>
2<text uploader="USERNAME" date="YYYYMMMDD" views="NUMBER" likes="NUMBER" title="TITLE_OF_THE_VIDEO" format="FORMAT_NAME" autocaption="TRUE_OR_FALSE">
3 <p time="NUMBER" is_ac="TRUE_OR_FALSE">SUBTITLES LINE</p>
4</text>
CATLISM, 264
Example of data extracted with [s5.14]
6CATLISM, 264
#
1<?xml version='1.0' encoding='UTF-8'?>
2<text id="TEXT_ID">
3 <comment comment_id="UNIQUE_ID" comment_reply_to="UNIQUE_ID_OF_THE_COMMENT_OR_NA" username="USERNAME" votes="NUMBER" heart="TRUE_OR_FALSE" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" date_time="NUMBER">TEXTUAL CONTENT OF THE COMMENT</comment>
4</text>
Example of subtitle formats#
CATLISM, 252-253
TTML format 7CATLISM, 252-253
#
1<?xml version="1.0" encoding="utf-8"?>
2<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ttp:profile="http://www.w3.org/TR/profile/sdp-us">
3 <head>
4 <styling>
5 <style xml:id="s1" tts:textAlign="center" tts:extent="90% 90%" tts:origin="5% 5%" tts:displayAlign="after" />
6 <style xml:id="s2" tts:fontSize=".72c" tts:backgroundColor="black" tts:color="white" />
7 </styling>
8 <layout>
9 <region xml:id="r1" style="s1" />
10 </layout>
11 </head>
12 <body region="r1">
13 <div>
14 <p begin="00:00:09.870" end="00:00:16.110" style="s2">
15 In light of recent events concerning newscasters
16 <br />
17 being lost in the fog of… memory.
18 </p>
19 </div>
20 </body>
21</tt>