A search engine that autonomously crawls documents from a given domain including their subdomains, analyzes them and renders them into a search frontend. This implementation demonstrates the functionality with the Stanford University's website. This project was implemented during the Next Iteration Hackathon 2018
Python implementation with Scrapy here.
Python implementation using Watson NLU (for Named Entities, Keywords), gensim (for Summarization and Semantic Representation) and a custom Document Type classifier (Random Forest, with sklearn). Title, a thumbnail and embedded images are also extracted from documents. See notebooks for specific implementations.
A react frontend that displays the information with additional image information using Bing Image Search here.
- Swig etc. for Textract: https://textract.readthedocs.io/en/stable/installation.html
- Ghostscript: https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript
- ImageMagick 6: ImageMagick/ImageMagick#953