Corpus Approaches to Language in Social Media | Online Compendium#

A living technical companion for digital textual data, corpus workflows, and social media research#

Cover for the book Corpus Approaches to Language in Social Media

Note

You are browsing version v1.0.1

This website was originally created as the online compendium for the book Corpus Approaches to Language in Social Media (CATLISM; Di Cristofaro 2023), published in the series Routledge Advances in Corpus Linguistics ¹ ¹A preview of the book (Chapter 1) is available from taylorfrancis.com.

It now also serves as a living technical companion space for research on the collection, processing, formatting, preservation, and analysis of digital textual data for corpus purposes. Its primary focus remains social media data, broadly understood as “any digital content [that] provides the user with the ability to interact with it […] through a unique uniform resource locator (URL)” ² ²CATLISM, 4, but the scope of the website also includes related forms of digitally mediated textual data, platform-specific constraints, technical workflows, and methodological issues that emerge when language data are produced, accessed, transformed, and analysed in digital environments.

The central concern of both the book and this compendium is the role of digital technicalities in corpus-based research. These include the infrastructural, procedural, and platform-specific mechanisms that are not always treated as linguistic objects in themselves, but that shape what can be collected, how it can be processed, and what kinds of linguistic evidence can ultimately be produced. The aim is therefore to support

³CATLISM, 2-4

a broad view of corpus approaches able to include those notions and mechanisms [’digital technicalities] that – while not classically associated with natural language – are […] i) foundational of the digital environments in which language production and exchanges occur and ii) at the core of the techniques that are used to produce, collect, and process the focus of investigation, that is, digital textual data. ³ ³CATLISM, 2-4

For this reason, the website should be read not only as a companion to CATLISM, but also as a working archive of scripts, procedures, updates, supplementary materials, and technical notes connected to later research outputs that continue to address the methodological consequences of digital textuality.

As such, this online compendium contains:

the scripts included in the volume ⁴ ⁴Version 1.0.0 reflects the contents, scripts and code snippets as they appear in the printed book; subsequent versions may contain modifications and updates. Consult the changelog for a list of changes. — downloadable and formatted using colour-coded syntax highlighting — aimed at collecting and processing data from webpages, blogs, fora, Facebook, Instagram, Twitter, YouTube;

interactive videos documenting the use of the commands and tools employed throughout the volume;

further scripts, examples, and instructions for tools aimed at collecting, processing, and formatting digital data from platforms and sources that, for reasons of space or timing, could not be fully covered in the volume;

updates to scripts, commands, dependencies, and workflows when technical changes make earlier procedures ineffective, deprecated, or partially outdated;

technical notes on platform changes, data access, formatting issues, preservation strategies, and other digital technicalities relevant to corpus-based research;

supplementary materials connected to research outputs published after CATLISM, including additional scripts, datasets, examples, documentation, and methodological notes;

updates to topics discussed in the book, especially where subsequent technical, methodological, or platform-related developments affect the original discussion;

links to preservation copies of online materials referenced in the volume, where available, as archived through The Wayback Machine.

Where possible and unless stated differently (e.g. in the case of quotations), all the textual contents are published under Creative Commons CC BY-NC 4.0, while all the scripts are licenced under the open source GPLv3 licence - see FAQs for more details on how to (re)use the materials.

Important

Descriptions and further details for scripts and code originally available in the book are left out of this compendium. Scripts and code exclusive to this online compendium are fully described and detailed in each relevant page/section.
A number of answers to common questions are included in the FAQs section.

How to use this online compendium#

Consult the Using the online compendium section for more details on how to use this website, as well as a legend of the symbols used throughout the pages.

Corpus Approaches to Language in Social Media | Online Compendium

Contents

Corpus Approaches to Language in Social Media | Online Compendium#

A living technical companion for digital textual data, corpus workflows, and social media research#

How to use this online compendium#