PDF, Word, images#
CATLISM, 278-279
Extract text from PDF and other multimedia formats1CATLISM, 278-279
#
textract
is a Python module and tool acting as a “single interface for extracting content from any type of fle, without any irrelevant markup”2https://textract.readthedocs.io/en/stable/; as such it can be used to process a variety of formats containing some form of language (including images and audio files; consult the list of all supported formats). Processing of images and audio files requires the installation of additional modules. Options and further arguments may be found in the official documentation.
Installing the tool#
pip install textract
[c5.33]
Using the tool#
Usage as a Python module is exemplified in [s5.05]
, CLI usage is exemplified below.
textract filename.extension -o output.extension