International Journal on Advanced Science, Engineering and Information Technology, Vol. 8 (2018) No. 4-2: Special Issue on Empowering the Nation via 4IR (The Fourth Industrial Revolution)., pages: 1437-1445, Chief Editor: Khairuddin Omar | Editorial Boards : Shahnorbanun Sahran Hassan, Nor Samsiah Sani, Heuiseok Lim & Danial Hoosyar, DOI:10.18517/ijaseit.8.4-2.6831

Building Compact Entity Embeddings Using Wikidata

Mohamed Lubani, Shahrul Azman Mohd Noah

Abstract

Representing natural language sentences has always been a challenge in statistical language modelling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations regardless the fact that they share identical words. In this paper, we focus on building the vector representations (embeddings) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labelled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embeddings generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embeddings by removing irrelevant features and emphasizing the most descriptive ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embeddings of the chosen subset using a deep stacked autoencoders model. Cosine similarity and t-SNE visualisation technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

Keywords:

Entity Embeddings; Entity Vector Representations; Named Entity Disambiguation.

Viewed: 314 times (since Sept 4, 2017)

cite this paper     download