Sintelix’s Performance

Performance image

At Semantic Sciences we have worked to provide the highest quality entity extractor on the market.  Our customers tell us that we have succeeded.  

The five areas of performance in which we try to make Sintelix excel are:

  • entity recognition accuracy (precision, recall, F1, F2),
  • document processing speed,
  • search speed,
  • hardware footprint, and
  • ease of use of the graphical user interface and the system’s integration interfaces.

Entity and Relationship Recognition Accuracy

A snapshot of the Sintelix’s entity recognition performance is shown in the table below.  It shows scores and direct counts of results calculated using 10-fold cross validation (which ensures that testing is done on different data from the training data).  The documents are the 100 documents of the MUC 7 development collection.  We have added new classes and relationships to the original MUC 7 annotations and corrected mistakes and inconsistencies.

Entity recognition performance table

Document Processing Speed

The fastest way of processing documents is via the Java API.  With this method Sintelix can process 1 million XML-encoded newswire reports (2.8 GB of raw documents) per hour on a modern 4 core workstation with 12 GB of RAM.  Depending on the network overhead, this speed is approximately halved when using the web service interface.  If documents and annotations are stored in Sintelix’s database just over 600,000 newswire reports are processed per hour.

Search Speed

We set Sintelix up on a 4-core 2011 workstation having ingested the 806,000 document Reuters Corpus.  On trials of randomized searches, each returning the first ten instances, the system was capable of responding to 3000 queries per second.

Hardware Footprint

Sintelix has been designed to make the best possible use of the hardware resources.  It works well on a dual core laptop with 4GB of RAM and an SSD hard drive to provide a very snappy response.  In operational applications we recommend that 5GB of RAM be made available to the program.  If processed documents are stored within the system’s database, we recommend budgeting six times the disk space used for the source documents.