Youtube#

Warning

As of March 2023 the tool youtube-dl (release 2021.12.17) suggested in the book has a number of issues and cannot therefore be used to correctly download data from Youtube.
The commands below make use of yt-dlp, an alternative tool (forked from youtube-dl with fixed issues and additional options) whose basic usage is identical to youtube-dl. Any command included in the book and using youtube-dl can be replicated with yt-dlp instead - as done in the contents below.

Data collection from Youtube (and more than 1,200 platforms - the same ones supported by youtube-dl) can be obtained using yt-dlp.
Options and arguments for the tool can be found in the official documentation.

Note

While yt-dlp supports the extraction of comments (youtube-dl does not have such option), this page currently follows the contents of the book, where comments are downloaded using youtube-comment-downloader.
Future versions of the compendium will include options for downloading comments using yt-dlp.

¹CATLISM, 247

Installing the tools ¹ ¹`CATLISM, 247`#

Command [c5.28] #

pip install yt-dlp

Command [c5.30]#

pip install youtube-comment-downloader

²CATLISM, 247;254;263

Using the tools ² ²`CATLISM, 247;254;263`#

Command [c5.29] #

yt-dlp 'URL' --write-info-json --skip-download --write-annotations --write-description

Command [c5.32] #

yt-dlp 'URL' --write-info-json --skip-download --write-subs --sub-langs LL --sub-format FORMAT

Command [c5.31] #

youtube-comment-downloader --url "URL" --output FILE.jsonl

Extracting the data#

³CATLISM, 264-269

Extract collected Youtube data (everything except comments) to XML format ³ ³`CATLISM, 264-269`#

Script [s5.13] #

# Import modules for: working on local folders and files; regular expressions; finding files in folder;
# reading JSON files; using BeautifulSoup; working with XML files
import os
import re
from glob import glob
import json
from bs4 import BeautifulSoup
from lxml import etree

# Manually set the 2-letter ISO 3166-1 language code (e.g. English = en), so that only the subtitle files for the set
# language are processed
language_code = "en"

# Create a regular expression to capture the title of the video preceding youtube-dl default naming conventions, where:
# [FILENAME].info.json = the JSON file containing the metadata details
# [FILENAME].[LL].srv3 = the XML file containing the subtitles in SRV3 format, where [LL] is the 2-letter ISO 3166-1 language code
# [FILENAME].[LL].ttml = the XML file containing the subtitles in TTML format, where [LL] is the 2-letter ISO 3166-1 language code
filename_filter = re.compile(
    "(.*?)\.(info\.json|[A-Za-z]{1,3}\.srv3|[A-Za-z]{1,3}\.ttml)"
)
# Create an empty list to store all the video titles
unique_filenames_list = []
# List all filenames present in the folder where the script resides
files = glob("*.*")

# For every single filename found in the folder, do:
for single_file in files:
    # Search for the regular expression for capturing metadata and subtitle files in the filename, and store the result
    # in the 'found_filename' variable
    found_filename = re.search(filename_filter, single_file)
    # If the filename matches the regular expression, extract the filename without the extensions; then check if the cleaned
    # filename is present in the unique_filenames_list, and if not add it
    if found_filename is not None and found_filename[1] not in unique_filenames_list:
        unique_filenames_list.append(found_filename[1])


# Create the function to convert the time-format employed by TTML (HH:MM:SS.MS) into the one employed by SRV3
# (total number of milliseconds); adapted from
# https://stackoverflow.com/questions/59314323/how-to-convert-timestamp-into-milliseconds-in-python
def convert_timestamp(msf):
    hours, minutes, seconds = msf.split(":")
    seconds, milliseconds = seconds.split(".")
    hours, minutes, seconds, milliseconds = map(
        int, (hours, minutes, seconds, milliseconds)
    )
    return (hours * 3600 + minutes * 60 + seconds) * 1000 + milliseconds


# For each unique filename found do:
for filename in unique_filenames_list:
    # Recreate the full filenames with extensions, and store each one of them into a single variable
    json_file = filename + ".info.json"
    srv3_file = filename + "." + language_code + ".srv3"
    ttml_file = filename + "." + language_code + ".ttml"
    # Create the XML element <text>, root element of the final output
    text_tag = etree.Element("text")

    # Open the metadata JSON file:
    metadata_file = json.loads(open(json_file, encoding="utf-8").read())
    # Add a set of metadata points as attributes of the <text> element tag
    text_tag.attrib["uploader"] = metadata_file["uploader"]
    text_tag.attrib["date"] = metadata_file["upload_date"]
    text_tag.attrib["views"] = str(metadata_file["view_count"])
    text_tag.attrib["title"] = metadata_file["fulltitle"]
    # Check if the 'like_count' metadata point is present, if not assign the value "na" to the 'like_count' attribute
    text_tag.attrib["likes"] = str(
        metadata_file["like_count"] if "like_count" in metadata_file else "na"
    )

    # Check if the SRV3 file exists (priority is given to SRV3 over TTML due to the presence of autocaptioning details);
    # if so, do:
    if os.path.isfile(srv3_file):
        # Assign the attribute 'format' with a value of 'srv' to the <text> element tag
        text_tag.attrib["format"] = "srv3"
        # Create the output filename using the input filename
        output_filename = srv3_file + ".xml"
        # Open the SRV3 file
        f = open(srv3_file, "r", encoding="utf-8")
        # Parse its XML contents using BeautifulSoup
        soup = BeautifulSoup(f, features="xml")
        # If the attribute 'ac' (= autocaption) with value '255' is found in the <s> element tag then the subtitles are the result of autocaptioning; hence assign the value 'true' to the variable 'is_ac'. Otherwise assign the value 'false' to 'is_ac'
        if soup.body.find("s", attrs={"ac": True}):
            is_ac = "true"
        else:
            is_ac = "false"

        # Assign the value of 'is_ac' to the <text> element tag attribute 'autocaption'
        text_tag.attrib["autocaption"] = is_ac

        # For each paragraph (i.e. each line of the subtitles) in the file do:
        for sent in soup.body.find_all("p"):
            # Check if the textual content of the paragraph is longer than 1 character; this avoids adding empty paragraphs to the final output
            if len(sent.get_text()) > 1:
                # Create a <p> element tag inside of the XML output
                p_tag = etree.SubElement(text_tag, "p")
                # Add the attribute 'time' (indicating the starting time of the paragraph) and assign it the value appearing in 't'
                p_tag.attrib["time"] = str(sent["t"])
                # Add the attribute 'is_ac' and assign it the value of the previously created variable 'is_ac'
                p_tag.attrib["is_ac"] = is_ac
                p_tag.text = sent.get_text()
            # If the paragraph does not contain any text (i.e. its length is < 1), skip it
            else:
                continue

    # If the SRV3 file does not exists, check if the TTML file does instead; then do (only the steps that differ from the
    # SRV3 procedure are commented):
    elif os.path.isfile(ttml_file):
        text_tag.attrib["format"] = "ttml"
        text_tag.attrib["autocaption"] = "na"
        output_filename = ttml_file + ".xml"
        f = open(ttml_file, "r", encoding="utf-8")
        soup = BeautifulSoup(f, features="xml")

        for sent in soup.body.find_all("p"):
            if len(sent.get_text()) > 1:
                p_tag = etree.SubElement(text_tag, "p")
                # Add the 'time' attribute, assigning it as value the starting time from the 'begin' attribute in the original file,
                # converted into milliseconds using the 'convert_timestamp' function
                p_tag.attrib["time"] = str(convert_timestamp(str(sent["begin"])))
                p_tag.attrib["is_ac"] = "na"
                p_tag.text = sent.get_text()
            else:
                continue
    # If neither the SRV3 nor the TTML files are found, print 'No subtitle files found.'
    else:
        print("No subtitle files found.")

    # Write the extracted data formatted in XML to the final XML structure
    tree = etree.ElementTree(text_tag)
    # Write the XML to the output file
    tree.write(
        output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
    )

How to use script `[s5.13]`#

Copy/download the file s5.13_youtube-dl_subs-to-XML.py inside the folder where the data downloaded through ytp-dl (e.g. through c5.29 and c5.32) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/youtube_data/

At last, run the script from the terminal:

python s5.13_youtube-dl_subs-to-XML.py

⁴CATLISM, 269-272

Extract collected Youtube comments to XML format ⁴ ⁴`CATLISM, 269-272`#

Script [s5.14] #

# Import modules for: regular expressions; reading timestamps as date objects; loading files using regular expression;
# generate random numbers; reading JSONL files; working with XML files
import re
from datetime import datetime
from glob import glob
from random import randint
import jsonlines
from lxml import etree

# List all filenames with the .jsonl extension
files = glob("*.jsonl")

# For each file do:
for file in files:
    # Create the root <text> element tag
    text_tag = etree.Element("text")
    # Generate and assign a random number as attribute 'id' of the <text> element tag
    text_tag.attrib["id"] = str(randint(0, 100000))
    # Create the output filename using the original one and substituting '.jsonl' with '.xml'
    output_filename = file.replace("*.jsonl", "") + ".xml"
    # Read the file as a jsonlines one:
    with jsonlines.open(file) as comments:
        # For each line (i.e. metadata data-points for one comment) do:
        for comment in comments:
            # Create a <comment> element tag to enclose the comment
            comment_tag = etree.SubElement(text_tag, "comment")
            # Extract the comment id ('cid') and save it to a variable
            comment_id = str(comment["cid"])
            # Check if the 'cid' contains a full stop character. If so, the comment is a reply to another comment: take the string on
            # the left of the full stop and assign it as value of the attribute 'comment_id', then the string on the right and assign
            # it as value of the attribute 'comment_reply_to' to preserve the original hierarchical structure
            if re.search("(.*?)\.(.*)", comment_id) is not None:
                comment_tag.attrib["comment_id"] = str(
                    re.search("(.*?)\.(.*)", comment_id).group(2)
                )
                comment_tag.attrib["comment_reply_to"] = str(
                    re.search("(.*?)\.(.*)", comment_id).group(1)
                )
            # If there is no full stop character, assign the 'comment_id' as value of the <comment> attribute 'comment_id' and the
            # value 'na' to the 'comment_reply_to' attribute
            else:
                comment_tag.attrib["comment_id"] = comment_id
                comment_tag.attrib["comment_reply_to"] = "na"

            # Extract other metadata data-points and assign them to a set of attributes of the <comment> element tag
            comment_tag.attrib["username"] = str(comment["author"])
            comment_tag.attrib["votes"] = str(comment["votes"])
            comment_tag.attrib["heart"] = str(comment["heart"])
            # Read the Unix timestamp from the metadata data-point 'time_parsed' and convert it into a human-readable datetime object,
            # then store it into the 'comment_date' variable
            comment_date = datetime.fromtimestamp(comment["time_parsed"])
            # Format the time at which the message was posted into the format HHMM (hours and minutes)
            comment_date_time = comment_date.strftime("%H%M")
            # Assign the date elements to different metadata attributes
            comment_tag.attrib["date_d"] = str(comment_date.day)
            comment_tag.attrib["date_m"] = str(comment_date.month)
            comment_tag.attrib["date_y"] = str(comment_date.year)
            comment_tag.attrib["date_time"] = str(comment_date_time)

            # At last, write the content of the comment as the text value of the <comment> element tag
            comment_tag.text = str(comment["text"])

    # Write the extracted data formatted in XML to the final XML structure
    tree = etree.ElementTree(text_tag)
    # Write the XML to the output file
    tree.write(
        output_filename, pretty_print=True, xml_declaration=True, encoding="utf-8"
    )

⁵CATLISM, 264

Example of data extracted with `[s5.13]` ⁵ ⁵`CATLISM, 264`#

Example [e5.18]#

<?xml version='1.0' encoding='UTF-8'?>
<text uploader="USERNAME" date="YYYYMMMDD" views="NUMBER" likes="NUMBER" title="TITLE_OF_THE_VIDEO" format="FORMAT_NAME" autocaption="TRUE_OR_FALSE">
  <p time="NUMBER" is_ac="TRUE_OR_FALSE">SUBTITLES LINE</p>
</text>

⁶CATLISM, 264

Example of data extracted with `[s5.14]` ⁶ ⁶`CATLISM, 264`#

Example [e5.19]#

<?xml version='1.0' encoding='UTF-8'?>
<text id="TEXT_ID">
  <comment comment_id="UNIQUE_ID" comment_reply_to="UNIQUE_ID_OF_THE_COMMENT_OR_NA" username="USERNAME" votes="NUMBER" heart="TRUE_OR_FALSE" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" date_time="NUMBER">TEXTUAL CONTENT OF THE COMMENT</comment>
</text>

Example of subtitle formats#

⁷CATLISM, 252-253

TTML format ⁷ ⁷`CATLISM, 252-253`#

Example [e5.14]#

<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ttp:profile="http://www.w3.org/TR/profile/sdp-us">
    <head>
        <styling>
            <style xml:id="s1" tts:textAlign="center" tts:extent="90% 90%" tts:origin="5% 5%" tts:displayAlign="after" />
            <style xml:id="s2" tts:fontSize=".72c" tts:backgroundColor="black" tts:color="white" />
        </styling>
        <layout>
            <region xml:id="r1" style="s1" />
        </layout>
    </head>
    <body region="r1">
        <div>
            <p begin="00:00:09.870" end="00:00:16.110" style="s2">
                In light of recent events concerning newscasters
                <br />
                being lost in the fog of… memory.
            </p>
        </div>
    </body>
</tt>

⁸CATLISM, 253

SRV format without auto-captioning ⁸ ⁸`CATLISM, 253`#

Example [e5.15]#

<?xml version="1.0" encoding="utf-8"?>
<timedtext format="3">
    <body>
        <p t="9870" d="6240">In light of recent events concerning newscasters
being lost in the fog of… memory.</p>
        <p t="16110" d="4340">It may be pertinent to ask:
can we trust the news media?</p>
    </body>
</timedtext> 

⁹CATLISM, 253-254

SRV format with auto-captioning ⁹ ⁹`CATLISM, 253-254`#

Example [e5.16]#

<?xml version="1.0" encoding="utf-8"?>
<timedtext format="3">
    <head>
        <ws id="0" />
        <ws id="1" mh="2" ju="0" sd="3" />
        <wp id="0" />
        <wp id="1" ap="6" ah="20" av="100" rc="2" cc="40" />
    </head>
    <body>
        <w t="0" d="957600" id="1" wp="1" ws="1" />
        <p t="1040" d="3360" w="1">
            <s ac="255">welcome</s>
            <s t="320" ac="255"> to</s>
            <s t="560" ac="255"> super</s>
            <s t="799" ac="255"> mario</s>
            <s t="1040" ac="255"> bros</s>
            <s t="1280" ac="255"> chaos</s>
        </p>
        <p t="2710" d="1690" w="1" a="1"></p>
        <p t="2720" d="2960" w="1">
            <s ac="255">edition</s>
            <s t="320" ac="255"> enjoy</s>
            <s t="719" ac="255"> the</s>
            <s t="800" ac="255"> peaceful</s>
            <s t="1199" ac="255"> music</s>
            <s t="1520" ac="255"> while</s>
        </p>
    </body>
</timedtext>

Youtube

Contents

Youtube#

Installing the tools 1 1CATLISM, 247#

Using the tools 2 2CATLISM, 247;254;263#

Extracting the data#

Extract collected Youtube data (everything except comments) to XML format 3 3CATLISM, 264-269#

How to use script [s5.13]#

Extract collected Youtube comments to XML format 4 4CATLISM, 269-272#

Example of data extracted with [s5.13] 5 5CATLISM, 264#

Example of data extracted with [s5.14] 6 6CATLISM, 264#