The Portable Document Format (PDF) is a file format introduced by Adobe to represent documents in a manner independent from the software, hardware and operating system used to create them, as well as from the output mechanisms used to display them. Since reading a high volume of electronic documents can be a long, cumbersome process, Swansea University has developed a system for extracting information from electronic files such as PDF in an interactive and user-centric manner. This invention is similar to querying databases and the Internet search engines but assumes electronic files instead.
The project's mission is to create a software suite extending the capabilities of the existing electronic file viewers in order to allow the users to input the desired search queries for retrieving specific information. The search criteria can also be extended to more than one electronic document simultaneously, and the query results can be stored with the electronic document to aid future reading. The software design includes a pop-up user-interface accessed from the menu (similar to a search/replace function in Word) for entering the search queries using a simple scripting language, and to designate what information should be highlighted in the viewed document. As the demand for reading large volumes of documents by the researchers is steadily growing, especially in the areas where they have little or no knowledge, the proposed method of information extraction will make reading of the electronic documents far more efficient, and more collaborative, and thus the productivity of discovering relevant information will be greatly improved.
The invention allows for efficient extraction of information from PDF documents, turning the static PDF reading and viewing into an interactive process. The proposed method can be readily integrated into existing software products such as Adobe Reader. Therefore this invention has a mass market appeal.
A patent for this invention has been filed by Swansea University under PCT/GB2013/000369