PDF, Word, images#

¹CATLISM, 278-279

Extract text from PDF and other multimedia formats ¹ ¹`CATLISM, 278-279`#

textract is a Python module and tool acting as a “single interface for extracting content from any type of fle, without any irrelevant markup” ² ²https://textract.readthedocs.io/en/stable/; as such it can be used to process a variety of formats containing some form of language (including images and audio files; consult the list of all supported formats). Processing of images and audio files requires the installation of additional modules. Options and further arguments may be found in the official documentation.

Installing the tool#

Command [c5.33]#

pip install textract

Using the tool#

Usage as a Python module is exemplified in [s5.05], CLI usage is exemplified below.

Command [c5.34] #

textract filename.extension -o output.extension

PDF, Word, images

Contents

PDF, Word, images#

Extract text from PDF and other multimedia formats 1 1CATLISM, 278-279#

Installing the tool#

Using the tool#

Extract text from PDF and other multimedia formats ¹ ¹`CATLISM, 278-279`#