Uploaded PDFs search limit 100.000 characters


#1

Enonic version: XP 6.15.5
OS: OSX

There seems to be a limit on indexed characters to 100.000 on uploaded PDFs.


Here are two PDF - when I use the search function I would like that both publications appear when I search for ie “pendling” which appear in both publications.

Currently only one document will be presented in the search result due to the 100.000 character limit - reported in log as

c.e.x.e.impl.BinaryExtractorImpl - Error extracting binary: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

Our plan is to add 22.000 PDFs to an Enonic XP instance, so it’s important that the search function works as expected!

BR

Preben


#2

Planlegger å trekke ut råteksten fra PDF-ene i scriptet som populerer data til XP, så jeg er ikke så veldig stressa over dette problemet lenger. Det hadde vært nice om det var mulig å gjøre content.modify på opplastede PDF-er som kaster error, men jeg planlegger å gå rundt dette også ved å legge rådata på en egen content type og laste opp PDF-en på dette innholdet instead. Søket vil da gjøres på den nye innholdstypen som inneholder råtekst fra PDF-en.


#3

My conclusions on investigation PDFs and XP:
A better approach to index PDFs is to parse them with an external tool like pdf-parse (available as a npm module). So you load the pdf by ie request-promise-native (in a separate node program) and populate the raw text to XP by a service on a content type you created, and then you load the PDF as an attachment or media.
With this method you can be assured that all the data is indexed, and you can use that data to create a autocomplete function and also a context for the search result (present the sentence that the search word occurred in).
In summary, use this method for:

  1. creating a dictionary for your autocomplete function
  2. ensure that all the data is indexed for large PDFs
  3. provide data for a search function to show the context for where a search word occured
  4. override the inherit limitations in XPs limited PDF parsing capabillities

The text fields in XP happily supports unlimited data and indexes it, well so far I have tested 700k of text. So now I can search for all the text in a PDF and not only the first 100 000 characters.


#4

Hi Preben, appreciate that you found a workaround.
However, it would be much simpler for everyone if this limit was configurable (and maybe already is somehow?).

I’m converting the discussion to a freature request!


#5

Thanks, it would be nice if all text was searchable! Although for my purposes I might need the raw text for a dictionary and to extract the context for the search word. As I understand you use Tika for extracting text from PDFs, which is indeed configurable. I would advice to just give it a large enough number, so that we don’t need to configure it - 10 megs should probably be large enough for all purposes.

Also it would be nice if it was possible to add x-data with content.modify() for uploaded mime types application/pdf, which I havn’t managed to do. Works fine for images somehow.