General purpose scrapers#
- #LancsBox
- Archivebox
- Trafilatura
- BeautifulSoup
- Extracting the data
- Extract links from HTML pages
- Download HTML pages
- Use
requests
in script[s5.02a]
- Extract metadata from the downloaded HTML pages
- Download PDF files linked in HTML pages
- Extract the contents of PDF files as plain-text
- Create an XML corpus combining the metadata from HTML pages and the contents of PDF files
- Basic structure of the metadata table included in MoreThesis pages
- Extracting the data