Corpus Approaches to Language in Social Media | Online Compendium#
Note
This website serves as online compendium for the book Corpus Approaches to Language in Social Media (CATLISM
; Di Cristofaro 2023) published in the series Routledge Advances in Corpus Linguistics1A preview of the book (Chapter 1) is available from taylorfrancis.com.
The focus of both this compendium and the book is the collection, processing, and formatting of digital data from social media (understood as “any digital content [that] provides the user with the ability to interact with it […] through a unique uniform resource locator (URL)”2CATLISM, 4
) for corpus purposes (see also On scripts and tools).
The aim is
CATLISM, 2-4
proposing a broad view of corpus approaches able to include those notions and mechanisms [’digital technicalities] that – while not classically associated with natural language – are […] i) foundational of the digital environments in which language production and exchanges occur and ii) at the core of the techniques that are used to produce, collect, and process the focus of investigation, that is, digital textual data.3
CATLISM, 2-4
As such this online compendium contains:
the scripts included in the volume4Version 1.0.0 reflects the contents, scripts and code snippets as they appear in the printed book; subsequent versions may contain modifications and updates. Consult the changelog for a list of changes. - downloadable and formatted using colour-coded syntax highlighting - aimed at collecting and processing data from webpages, blogs, fora, Facebook, Instagram, Twitter, Youtube
interactive videos documenting the use of the commands and tools employed throughout the volume
further scripts and instructions for tools aimed at collecting data from platforms that, due to reasons of space, could not be included in the volume;
updates to scripts and commands in case – due to technical changes - they become ineffective/outdated;
updates to topics discussed in the book;
links to preservation copies of all the online materials referenced in the volume as archived through The Wayback Machine
Where possible and unless stated differently (e.g. in the case of quotations), all the textual contents are published under Creative Commons CC BY-NC 4.0, while all the scripts are licenced under the open source GPLv3 licence - see FAQs for more details on how to (re)use the materials.
Important
Descriptions and further details for scripts and code originally available in the book are left out of this compendium. Scripts and code exclusive to this online compendium are fully described and detailed in each relevant page/section.
A number of answers to common questions are included in the FAQs section.
How to use this online compendium#
Consult the Using the online compendium section for more details on how to use this website, as well as a legend of the symbols used throughout the pages.
Structure of the online compendium#
Contents
- On scripts and tools
- Using the online compendium
- From the book
- Setting up the working environment
- Metadata evaluation
- Data collection
- General purpose scrapers
- #LancsBox
- Archivebox
- Trafilatura
- BeautifulSoup
- Extracting the data
- Extract links from HTML pages
- Download HTML pages
- Use
requests
in script[s5.02a]
- Extract metadata from the downloaded HTML pages
- Download PDF files linked in HTML pages
- Extract the contents of PDF files as plain-text
- Create an XML corpus combining the metadata from HTML pages and the contents of PDF files
- Basic structure of the metadata table included in MoreThesis pages
- Extracting the data
- Social Media Platforms
- General purpose scrapers
- Data processing
- Date, time, and Unix
- Text normalisation
- PDF, Word, images
- Language detection
- Emoticons and emojis
- Hashtags (word segmentation)
- Other elements
- Regular expression to capture usernames (e.g.
@matteodic
) - Regular expression to capture simple URLs (e.g.
http://example.com
andhttps://example.com
) - Regular expression to capture complex URLs (e.g. simple URLs plus email addresses,
mailto:
links, URLs with optional parameters) - Regular expression to capture cashtags (e.g.
$EUR
)
- Regular expression to capture usernames (e.g.
- Annotations
- Verticalised (.vrt) format
- Data exploration
- Data preservation
- Wayback Machine
- Git 101: the basics
- Initiate
git
in a local folder - Clone a remote repository
- Add all changes (even from previously untracked files) to the local
git
database (i.e.stage
the changes) - Record (
commit
) all changes, along with a textual description of what has been changed - Send (
push
) all changes to the remote repository - Obtain (
fetch
) all changes from the remote repository - Include/apply (
fetch
) all changes from the remote repository to the local repository - Obtain and include/apply (
pull
) all changes from the remote repository to the local repository
- Initiate
- Case-studies: CATLISM practical applications
- Analysing crypto-drug market fora
- Analysing the language of far-right groups on Twitter and Facebook
- The communicative modus operandi of online child sexual groomers
- Digital technicalities: a reading list
- FAQs
- How do I cite your work?
- Can I use/modify script X for my own purposes/publication?
- A script/command is not working, can you help?
- I have found an error in a script, can you fix it?
- I have found an error in a webpage, can you fix it?
- I have found an error in the book, can you add it to the Errata section?
- I have feedback/suggestions, how can I share it/them with you?
- Can I download the online compendium for local/offline use?
- Can I download all the scripts (and only the scripts) at once?
- The website looks too dark!
- Acknowledgments
- Changelog
- References