AFAIK there currently is no module available for this. I have created a function in the past to find a QR-code inside a pdf and this was quite straightforward in java. IN order to extract text from a pdf I would go for Apache pdfbox.
See https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox for a simple example of a standalone java class that extracts the text from a PDF. With some alterations and some tweaks you should be able to use this.
I’ve been using PDFbox to generate and update information in pdf’s a this works like a charm.
I have tried to use pdfbox, which works for getting text out of PDF files, as you mentioned; currently when I try to find data from tables in PDFs, this becomes a bit more tedious, since the PDF format defines tables as just various line strokes around texts. For this, I’ve taken a look at https://github.com/tabulapdf/tabula-java which seems to be able to achieve this goal when I try the demo command line app, using the tabula-1.0.2-jar-with-dependencies.jar.
Do you have any tips for making a temporary Java File from a Mendix FileDocument, as well as Mendix not finding the required classes in the jar? While the question about using a temporary File object is still relevant to me, I have solved the task of extracting information from tables now.
I published a module for reading content, and reading and setting metadata on PDF files. Might be useful for future reference or other people with the same issue.
You can find the module here: https://appstore.home.mendix.com/link/app/109922/