Enterprise search, not Google search, is big business, especially with the advent of LLMs and people wanting to know everything about an organisation immediately!

To solve this riddle, I interviewed Tim Allison, who recently departed from NASA’s JPL, where he’d been working on many significant projects, including DARPA’s Safedocs program, which looked at how to make PDFs more secure from attacks. The upshot was a new corpus for developers to use comprising 8m PDFs.

We also discussed Tim’s association with Apache Tika, the document parsing engine that powers many engines today, and how it extracts information from PDF documents and other documentation across many formats.

Search and document parsing has many wrinkles; what documents are you parsing? Are they textual, images or something completely different? How will you perform that search when you or your users search for them? How will you tune up your engine to mean users find meaningful content fast? Don’t think Search is set and forget, you can’t just stick everything in an Elasticsearch cluster and hope for the best. But I expect most people don’t understand how much open-source software is used under the hood to provide the results to the users. From query interface to document parser and document storage, so much of this is open source by nature and often maintained by teams you can count on your fingers.

If you’re interested in what Tim does, you can follow him on his Linkedin. If you’re interested in Apache Tika click here and if you’re interested in Safedocs you can find out more right here.

If your company could use our expert knowledge in deploying and scaling systems, then book an introductory call and find out how Spicule can help.

Thanks for reading Idea Ignition: Fueling Startups from Concept to Cloud! Subscribe for free to receive new posts and support my work.

Leave A Comment