Features

Performance image

Sintelix has a wide range of features to enable you to rapidly configure high quality information extraction components for your work flows.  It uses novel proprietary language technology, text analytics and text mining algorithms to achieve high accuracy at great speed.

Document Ingestion

Information Extraction Rate

30 full pages of text per core per second. 2.5 million pages per core per day.

Sintelix will extract whatever text it can find from files of any type — including text from executables and file fragments recovered from hard drives. We provide the following features:

  • deNISTing (exclusion of computer system files)
  • deduplication
  • Culling (exclusion) of files by:
    • file content type (e.g. binary, application, image, etc. - over 1,200 file types)
    • file extension (e.g. .exe, .inf, .gif, etc.)
    • language (>50 languages supported)
    • user defined file hash list
      • to exclude unwanted files
      • to mark known files of interest (e.g. suspect images, virus files or other files of interest)
  • Optionally save source files
  • Ingest archives:
    • compression (e.g. zip, bzip, gzip, etc.)
    • email (PST, MBOX)

Document Normalization

Document normalisation handles all the character encoding issues and extracts document structures such as paragraphs, tables, headers etc. This provides the base for subsequent text mining and analysis.

Entity Extraction

Accuracy

95% F1 on MUC 7 documents.

(Named) Entity Recognition automatically finds proper nouns of interest and assign them to classes, including people, organizations and artifacts.  Sintelix also extracts, dates, times, percentages, money amounts and relationships of various types. Special features of Sintelix’s entity recognition include:

  • Handles text in:
    • mixed case (normal)
    • upper case
    • lower case
    • title case
  • Splits of entities into their subcomponents is configurable (e.g. “President James Black" can optionally be split into a job title and a name)
  • Can be optimized to your data
  • Users can include their own hand crafted rules for extraction, combination and deletion of entities using Sintelix's powerful context sensitive grammar parser (see below).

Accuracy

Sintelix Entity Recognition has world-leading accuracy.  Sintelix was created because Australian Government agencies could not find entity extraction tools of sufficient accuracy on the market.

precision (percentage of extracted entities that Sintelix got correct - using MUC scoring algorithm):
                    Sintelix  96.21%; Lead competitor < 85% [i.e. Sintelix gives less than a third of the errors]

recall (percentage of true entities that Sintelix found - using MUC scoring algorithm):
                    Sintelix  94.54%; Lead competitor < 78% [i.e. Sintelix gives less than a quarter of the misses]

Scalability & Speed

Very fast - 30 full pages of text per core per second or 2.5 million per day per core (Intel X980 processor)

Entity Finding

Customers commonly have databases of entities of interest that they want to detect in their document collections. Entity Finding locates reference entities within the documents using the full power of Sintelix's Entity Recognition system. Entity Finding happens at the same time as Entity Recognition. It uses a fast scored approximate matching algorithm, handles aliases and the multiple ways names can be written (e.g. "John Smith" and "SMITH, John"). Entity finding takes into account word frequencies, fame and context, where available.

Entity Resolution & Network Building   (i.e. Identity Resolution, Sense-making)

Sintelix provides a very high performance entity resolver that connects up references to the same underling entity across a document collection.  It clusters the references, and each cluster refers to same underlying entity.

For example, across a document collection or data set there may be hundreds references to three people called "James Adams".  Sintelix Entity Resolution creates a cluster of references for each cluster.

Sintelix's entity resolver can be used independently of the rest of Sintelix and can be applied to both structured and unstuctured data.

Accuracy

  • Sintelix has world-leading accuracy: f-measure is 95.9% (best comparable solution on same data is 88.2%)

Scalability & Speed

  • Very fast - 466,000 entities resolved per minute (Intel X980 processor) with comparable rates (e.g. R-Swoosh on Oyster) of less than 15,000 per minute for similar data on similar hardware but only doing deterministic entity resolution on structured data.  Such systems fail to apply probabilistic contextual constraints which give high accuracy.
  • Scales linearly (see graph below)
Scalability of Entity Resolution with Sintelix


Capabilities

  • Handles every entity type (Person, Organization, Location etc.)
  • Handles misspellings
  • Uses context and any available attributes
  • Works with structured and unstructured data
  • For structured data entity resolution performs collapse (deduplication) and record linking (foreign key finding)
  • Sintelix can scrape tables within documents into database tables and then link these together to create a fully linked relational database.
  • Can be operated entirely independently from other Sintelix functions, if required


Benefits

  • All documents relevant to a particular person, location, organization etc. are automatically identified
  • Better than search because results are grouped by underlying entity - you don't have to wade through irrelevant cases
  • In combination with collocation and relationship extraction, the result of entity resolution is a network summarizing your entire data collection
  • If the instances of the entities are time tagged (as with dated documents) entity resolution also creates timelines
  • Automatically extract information from complex tables in a collection of documents to create a fully linked database.

Contextual Geotagging

Sintelix finds the map positions (latitude and longitude) of location entities extracted from documents and provides other data about them (country, population, location type, standardised name).

Sintelix uses the context of the location references within the document to achieve very high accuracy.

This capability can be used as via the Sintelix GUI (see below), the Java API or via Sintelix's web services.

Customers can use their own location gazetteers with whatever attributes they provide.

If the Sintelix GUI is used to provide a map view of locations, users are free opt for any map tile server they choose.

Geotagging is sometimes also called geocoding, but this is term refers more narrowly to address location.

Accuracy

Sintelix geotagging grew out of a customer's dissatisfaction with the market leading solution.  Our accuracy figures for geotagging of locations extracted from newswire articles are excellent:

precision (percentage of geotagged locations with correct geotags) - Sintelix  97.6%; Lead competitor < 89%

recall (percentage of locations that were correctly geotagged) - Sintelix  96.7%; Lead competitor < 60%

Note: a sizable minority of the incorrect geotagss have practically identical locations, for example the administrative region of a city rather than the city itself.  If these cases are counted as correct  precision becomes 98.2% and recall becomes 97.6%.

Speed

Geotagging is included in Sintelix standard processing times (30 full pages of text per core per second).  It is more than an order of magnitude faster than its leading competitor.

Document Showing Geotagged Geolocations  (other entity types are not shown)

Geocoding with Sintelix

Event Recognition

Events are recognized via faceted search to co-ocurrences of participating factors, such as dates, people, locations, topics etc.

Key-Value Extraction

Sintelix can recognize key-value pairs, such as “Date of Birth: <Date>" and extraction and normalization of the “<Date>" for direct generation of fields for database records.

Entity Discovery

Discovers, annotates and names new categories of entities not previously seen or included within the training documents.

Relationship Recognition

Sintelix identifies and extracts relationships between entities, for example:

  • Person-job title
  • Person-associated company
  • Person-employer
  • Person-associated job location
  • Family relationships
  • Co-occurrences

Anaphora Resolution

Sintelix resolves pronouns ("he", "she", "it", etc.) onto the proper nouns to which they refer.

Entity Alias Generation

Generation of alternative words and phrases for extracted entities from Sintelix’s reference database.

Structure-Based Information Extraction

Sintelix offers a versatile high precision information extraction capability that enables entities and content from complex tables and data structures to be extracted to create database records.

Context-Sensitive Grammar Annotator

Sintelix comes with a very fast and fully featured context-sensitive grammar (CSG) annotation tool.  Advanced users can write CSG rules to create annotations and give them features.  The CSG parser has a system for automatically checking that all rules in the system are achieving their intended purpose. This system provides a debugging report for any rules that are failing.

Entity Features

Extraction of features for entities including:

  • Person
    • gender
    • first name/initial
    • middle name/initial
    • last name
    • title
    • job title/activity
    • location
    • organization
    • Wikipedia page
  • Location
    • country
    • broad type
    • narrow type
    • latitude
    • longitude
    • population
    • Wikipedia page
  • Organization
    • country
    • type
    • location
    • activity
    • web address
    • email
    • telephone
    • fax
    • Wikipedia page
  • Date/Time
    • full local normalized date/time
    • range
    • duration
  • Money
    • currency
    • amount
    • range
  • Etc.

Document Annotation Tools

Sintelix possesses a suite of tools for rapidly creating very high quality gold standard annotations of document collections.  These tools include:

  • A rapid manual annotation tool
  • An annotation refinement tool that collates annotations from across one or more document collections.  The tool helps rapid identification and correction of irregularities and ambiguities in gold standard collections.

Evaluation Tools

Creation of the highest quality annotations requires tools that indicate and diagnose problems.  Sintelix provides:

  • An entity recognition evaluator
  • An entity recognition change analyzer
  • An annotator analyzer/debugger
  • A context-sensitive grammar parser debugger

Concept Tree Editor

Sintelix provides a drag-and-drop editor for constructing concept trees with color schemes selectable by the user.

Concept Mapping Editor

Sintelix also provides a drag-and-drop editor for mapping one or more input concept trees onto an output concept tree.  Mappings provide great flexibility in search and for managing gold standard collections.

Information Extraction Manager

The Information Extraction Manager provides the capability for combining all the entity recognition and structure-based information extraction capabilities of Sintelix to create multi-field records for export to databases.  It includes the following components:

  • Document loader
  • Extraction template editor
  • Document element finder
  • Document element finder analyzer
  • Table transformer
  • Table interpreter
  • Table transformer/interpreter analyzer
  • Output viewer

Sophisticated Search

Speed

3000 searches per second handled from a 2010 workstation.

Sintelix offers a sophisticated tool for high speed searches over entities/events, their features, and relationships.

Administration

Users are registered and given privileges via Sintelix’s administration interface.