Youtube#

Warning

As of March 2023 the tool youtube-dl (release 2021.12.17) suggested in the book has a number of issues and cannot therefore be used to correctly download data from Youtube.
The commands below make use of yt-dlp, an alternative tool (forked from youtube-dl with fixed issues and additional options) whose basic usage is identical to youtube-dl. Any command included in the book and using youtube-dl can be replicated with yt-dlp instead - as done in the contents below.

Data collection from Youtube (and more than 1,200 platforms - the same ones supported by youtube-dl) can be obtained using yt-dlp.
Options and arguments for the tool can be found in the official documentation.

Note

While yt-dlp supports the extraction of comments (youtube-dl does not have such option), this page currently follows the contents of the book, where comments are downloaded using youtube-comment-downloader.
Future versions of the compendium will include options for downloading comments using yt-dlp.

1CATLISM, 247

Installing the tools1CATLISM, 247#

Command [c5.28] #
pip install yt-dlp
[c5.28]
Command [c5.30]#
pip install youtube-comment-downloader
[c5.30]
2CATLISM, 247;254;263

Using the tools2CATLISM, 247;254;263#

Command [c5.29] #
yt-dlp 'URL' --write-info-json --skip-download --write-annotations --write-description
Command [c5.32] #
yt-dlp 'URL' --write-info-json --skip-download --write-subs --sub-langs LL --sub-format FORMAT
Command [c5.31] #
youtube-comment-downloader --url "URL" --output FILE.jsonl

Extracting the data#

3CATLISM, 264-269

Extract collected Youtube data (everything except comments) to XML format3CATLISM, 264-269#

Script [s5.13] #
  1# Import modules for: working on local folders and files; regular expressions; finding files in folder;
  2# reading JSON files; using BeautifulSoup; working with XML files
  3import os
  4import re
  5from glob import glob
  6import json
  7from bs4 import BeautifulSoup
  8from lxml import etree
  9
 10# Manually set the 2-letter ISO 3166-1 language code (e.g. English = en), so that only the subtitle files for the set
 11# language are processed
 12language_code = "en"
 13
 14# Create a regular expression to capture the title of the video preceding youtube-dl default naming conventions, where:
 15# [FILENAME].info.json = the JSON file containing the metadata details
 16# [FILENAME].[LL].srv3 = the XML file containing the subtitles in SRV3 format, where [LL] is the 2-letter ISO 3166-1 language code
 17# [FILENAME].[LL].ttml = the XML file containing the subtitles in TTML format, where [LL] is the 2-letter ISO 3166-1 language code
 18filename_filter = re.compile(
 19    "(.*?)\.(info\.json|[A-Za-z]{1,3}\.srv3|[A-Za-z]{1,3}\.ttml)"
 20)
 21# Create an empty list to store all the video titles
 22unique_filenames_list = []
 23# List all filenames present in the folder where the script resides
 24files = glob("*.*")
 25
 26# For every single filename found in the folder, do:
 27for single_file in files:
 28    # Search for the regular expression for capturing metadata and subtitle files in the filename, and store the result
 29    # in the 'found_filename' variable
 30    found_filename = re.search(filename_filter, single_file)
 31    # If the filename matches the regular expression, extract the filename without the extensions; then check if the cleaned
 32    # filename is present in the unique_filenames_list, and if not add it
 33    if found_filename is not None and found_filename[1] not in unique_filenames_list:
 34        unique_filenames_list.append(found_filename[1])
 35
 36
 37# Create the function to convert the time-format employed by TTML (HH:MM:SS.MS) into the one employed by SRV3
 38# (total number of milliseconds); adapted from
 39# https://stackoverflow.com/questions/59314323/how-to-convert-timestamp-into-milliseconds-in-python
 40def convert_timestamp(msf):
 41    hours, minutes, seconds = msf.split(":")
 42    seconds, milliseconds = seconds.split(".")
 43    hours, minutes, seconds, milliseconds = map(
 44        int, (hours, minutes, seconds, milliseconds)
 45    )
 46    return (hours * 3600 + minutes * 60 + seconds) * 1000 + milliseconds
 47
 48
 49# For each unique filename found do:
 50for filename in unique_filenames_list:
 51    # Recreate the full filenames with extensions, and store each one of them into a single variable
 52    json_file = filename + ".info.json"
 53    srv3_file = filename + "." + language_code + ".srv3"
 54    ttml_file = filename + "." + language_code + ".ttml"
 55    # Create the XML element <text>, root element of the final output
 56    text_tag = etree.Element("text")
 57
 58    # Open the metadata JSON file:
 59    metadata_file = json.loads(open(json_file, encoding="utf-8").read())
 60    # Add a set of metadata points as attributes of the <text> element tag
 61    text_tag.attrib["uploader"] = metadata_file["uploader"]
 62    text_tag.attrib["date"] = metadata_file["upload_date"]
 63    text_tag.attrib["views"] = str(metadata_file["view_count"])
 64    text_tag.attrib["title"] = metadata_file["fulltitle"]
 65    # Check if the 'like_count' metadata point is present, if not assign the value "na" to the 'like_count' attribute
 66    text_tag.attrib["likes"] = str(
 67        metadata_file["like_count"] if "like_count" in metadata_file else "na"
 68    )
 69
 70    # Check if the SRV3 file exists (priority is given to SRV3 over TTML due to the presence of autocaptioning details);
 71    # if so, do:
 72    if os.path.isfile(srv3_file):
 73        # Assign the attribute 'format' with a value of 'srv' to the <text> element tag
 74        text_tag.attrib["format"] = "srv3"
 75        # Create the output filename using the input filename
 76        output_filename = srv3_file + ".xml"
 77        # Open the SRV3 file
 78        f = open(srv3_file, "r", encoding="utf-8")
 79        # Parse its XML contents using BeautifulSoup
 80        soup = BeautifulSoup(f, features="xml")
 81        # If the attribute 'ac' (= autocaption) with value '255' is found in the <s> element tag then the subtitles are the result of autocaptioning; hence assign the value 'true' to the variable 'is_ac'. Otherwise assign the value 'false' to 'is_ac'
 82        if soup.body.find("s", attrs={"ac": "255"}):
 83            is_ac = "true"
 84        else:
 85            is_ac = "false"
 86
 87        # Assign the value of 'is_ac' to the <text> element tag attribute 'autocaption'
 88        text_tag.attrib["autocaption"] = is_ac
 89
 90        # For each paragraph (i.e. each line of the subtitles) in the file do:
 91        for sent in soup.body.find_all("p"):
 92            # Check if the textual content of the paragraph is longer than 1 character; this avoids adding empty paragraphs to the final output
 93            if len(sent.get_text()) > 1:
 94                # Create a <p> element tag inside of the XML output
 95                p_tag = etree.SubElement(text_tag, "p")
 96                # Add the attribute 'time' (indicating the starting time of the paragraph) and assign it the value appearing in 't'
 97                p_tag.attrib["time"] = str(sent["t"])
 98                # Add the attribute 'is_ac' and assign it the value of the previously created variable 'is_ac'
 99                p_tag.attrib["is_ac"] = is_ac
100                p_tag.text = sent.get_text()
101            # If the paragraph does not contain any text (i.e. its length is < 1), skip it
102            else:
103                continue
104
105    # If the SRV3 file does not exists, check if the TTML file does instead; then do (only the steps that differ from the
106    # SRV3 procedure are commented):
107    elif os.path.isfile(ttml_file):
108        text_tag.attrib["format"] = "ttml"
109        text_tag.attrib["autocaption"] = "na"
110        output_filename = ttml_file + ".xml"
111        f = open(ttml_file, "r", encoding="utf-8")
112        soup = BeautifulSoup(f, features="xml")
113
114        for sent in soup.body.find_all("p"):
115            if len(sent.get_text()) > 1:
116                p_tag = etree.SubElement(text_tag, "p")
117                # Add the 'time' attribute, assigning it as value the starting time from the 'begin' attribute in the original file,
118                # converted into milliseconds using the 'convert_timestamp' function
119                p_tag.attrib["time"] = str(convert_timestamp(str(sent["begin"])))
120                p_tag.attrib["is_ac"] = "na"
121                p_tag.text = sent.get_text()
122            else:
123                continue
124    # If neither the SRV3 nor the TTML files are found, print 'No subtitle files found.'
125    else:
126        print("No subtitle files found.")
127
128    # Write the extracted data formatted in XML to the final XML structure
129    tree = etree.ElementTree(text_tag)
130    # Write the XML to the output file
131    tree.write(
132        output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
133    )

How to use script [s5.13]#

Copy/download the file s5.13_youtube-dl_subs-to-XML.py inside the folder where the data downloaded through ytp-dl (e.g. through c5.29 and c5.32) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/youtube_data/

At last, run the script from the terminal:

python s5.13_youtube-dl_subs-to-XML.py
4CATLISM, 269-272

Extract collected Youtube comments to XML format4CATLISM, 269-272#

Script [s5.14] #
 1# Import modules for: regular expressions; reading timestamps as date objects; loading files using regular expression;
 2# generate random numbers; reading JSONL files; working with XML files
 3import re
 4from datetime import datetime
 5from glob import glob
 6from random import randint
 7import jsonlines
 8from lxml import etree
 9
10# List all filenames with the .jsonl extension
11files = glob("*.jsonl")
12
13# For each file do:
14for file in files:
15    # Create the root <text> element tag
16    text_tag = etree.Element("text")
17    # Generate and assign a random number as attribute 'id' of the <text> element tag
18    text_tag.attrib["id"] = str(randint(0, 100000))
19    # Create the output filename using the original one and substituting '.jsonl' with '.xml'
20    output_filename = file.replace("*.jsonl", "") + ".xml"
21    # Read the file as a jsonlines one:
22    with jsonlines.open(file) as comments:
23        # For each line (i.e. metadata data-points for one comment) do:
24        for comment in comments:
25            # Create a <comment> element tag to enclose the comment
26            comment_tag = etree.SubElement(text_tag, "comment")
27            # Extract the comment id ('cid') and save it to a variable
28            comment_id = str(comment["cid"])
29            # Check if the 'cid' contains a full stop character. If so, the comment is a reply to another comment: take the string on
30            # the left of the full stop and assign it as value of the attribute 'comment_id', then the string on the right and assign
31            # it as value of the attribute 'comment_reply_to' to preserve the original hierarchical structure
32            if re.search("(.*?)\.(.*)", comment_id) is not None:
33                comment_tag.attrib["comment_id"] = str(
34                    re.search("(.*?)\.(.*)", comment_id).group(2)
35                )
36                comment_tag.attrib["comment_reply_to"] = str(
37                    re.search("(.*?)\.(.*)", comment_id).group(1)
38                )
39            # If there is no full stop character, assign the 'comment_id' as value of the <comment> attribute 'comment_id' and the
40            # value 'na' to the 'comment_reply_to' attribute
41            else:
42                comment_tag.attrib["comment_id"] = comment_id
43                comment_tag.attrib["comment_reply_to"] = "na"
44
45            # Extract other metadata data-points and assign them to a set of attributes of the <comment> element tag
46            comment_tag.attrib["username"] = str(comment["author"])
47            comment_tag.attrib["votes"] = str(comment["votes"])
48            comment_tag.attrib["heart"] = str(comment["heart"])
49            # Read the Unix timestamp from the metadata data-point 'time_parsed' and convert it into a human-readable datetime object,
50            # then store it into the 'comment_date' variable
51            comment_date = datetime.fromtimestamp(comment["time_parsed"])
52            # Format the time at which the message was posted into the format HHMM (hours and minutes)
53            comment_date_time = comment_date.strftime("%H%M")
54            # Assign the date elements to different metadata attributes
55            comment_tag.attrib["date_d"] = str(comment_date.day)
56            comment_tag.attrib["date_m"] = str(comment_date.month)
57            comment_tag.attrib["date_y"] = str(comment_date.year)
58            comment_tag.attrib["date_time"] = str(comment_date_time)
59
60            # At last, write the content of the comment as the text value of the <comment> element tag
61            comment_tag.text = str(comment["text"])
62
63    # Write the extracted data formatted in XML to the final XML structure
64    tree = etree.ElementTree(text_tag)
65    # Write the XML to the output file
66    tree.write(
67        output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
68    )
5CATLISM, 264

Example of data extracted with [s5.13]5CATLISM, 264#

Example [e5.18]#
1<?xml version='1.0' encoding='UTF-8'?>
2<text uploader="USERNAME" date="YYYYMMMDD" views="NUMBER" likes="NUMBER" title="TITLE_OF_THE_VIDEO" format="FORMAT_NAME" autocaption="TRUE_OR_FALSE">
3  <p time="NUMBER" is_ac="TRUE_OR_FALSE">SUBTITLES LINE</p>
4</text>
6CATLISM, 264

Example of data extracted with [s5.14]6CATLISM, 264#

Example [e5.19]#
1<?xml version='1.0' encoding='UTF-8'?>
2<text id="TEXT_ID">
3  <comment comment_id="UNIQUE_ID" comment_reply_to="UNIQUE_ID_OF_THE_COMMENT_OR_NA" username="USERNAME" votes="NUMBER" heart="TRUE_OR_FALSE" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" date_time="NUMBER">TEXTUAL CONTENT OF THE COMMENT</comment>
4</text>

Example of subtitle formats#

7CATLISM, 252-253

TTML format 7CATLISM, 252-253#

Example [e5.14]#
 1<?xml version="1.0" encoding="utf-8"?>
 2<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ttp:profile="http://www.w3.org/TR/profile/sdp-us">
 3    <head>
 4        <styling>
 5            <style xml:id="s1" tts:textAlign="center" tts:extent="90% 90%" tts:origin="5% 5%" tts:displayAlign="after" />
 6            <style xml:id="s2" tts:fontSize=".72c" tts:backgroundColor="black" tts:color="white" />
 7        </styling>
 8        <layout>
 9            <region xml:id="r1" style="s1" />
10        </layout>
11    </head>
12    <body region="r1">
13        <div>
14            <p begin="00:00:09.870" end="00:00:16.110" style="s2">
15                In light of recent events concerning newscasters
16                <br />
17                being lost in the fog of… memory.
18            </p>
19        </div>
20    </body>
21</tt>
8CATLISM, 253

SRV format without auto-captioning8CATLISM, 253#

Example [e5.15]#
1<?xml version="1.0" encoding="utf-8"?>
2<timedtext format="3">
3    <body>
4        <p t="9870" d="6240">In light of recent events concerning newscasters
5being lost in the fog of… memory.</p>
6        <p t="16110" d="4340">It may be pertinent to ask:
7can we trust the news media?</p>
8    </body>
9</timedtext> 
9CATLISM, 253-254

SRV format with auto-captioning9CATLISM, 253-254#

Example [e5.16]#
 1<?xml version="1.0" encoding="utf-8"?>
 2<timedtext format="3">
 3    <head>
 4        <ws id="0" />
 5        <ws id="1" mh="2" ju="0" sd="3" />
 6        <wp id="0" />
 7        <wp id="1" ap="6" ah="20" av="100" rc="2" cc="40" />
 8    </head>
 9    <body>
10        <w t="0" d="957600" id="1" wp="1" ws="1" />
11        <p t="1040" d="3360" w="1">
12            <s ac="255">welcome</s>
13            <s t="320" ac="255"> to</s>
14            <s t="560" ac="255"> super</s>
15            <s t="799" ac="255"> mario</s>
16            <s t="1040" ac="255"> bros</s>
17            <s t="1280" ac="255"> chaos</s>
18        </p>
19        <p t="2710" d="1690" w="1" a="1"></p>
20        <p t="2720" d="2960" w="1">
21            <s ac="255">edition</s>
22            <s t="320" ac="255"> enjoy</s>
23            <s t="719" ac="255"> the</s>
24            <s t="800" ac="255"> peaceful</s>
25            <s t="1199" ac="255"> music</s>
26            <s t="1520" ac="255"> while</s>
27        </p>
28    </body>
29</timedtext>