Taxonomic-Relation-Identification

This is a note for constructing taxonomy.

总结研究

Statistical Approaches（统计方法）

频繁共同出现的词更可能有taxonomic relationship
但这种方法非常依赖于特征类型的选取，正确率低

Pattern-Based Methods（语言学方法）

Based on hypernym-hyponym pairs from the corpus（基于词汇语义结构）
因为语言结构的多样性和组成的不确定性，通过特别的pattern进行寻找，使得coverage低，正确率低

Word Embedding（词嵌入）

主要集中在通过词共生（word co-occurrence），来学习word embedding
因此，相似的词往往有相似的embedding
然而，此方法对于identify taxonomic relations效果差

Clustering-Based Methods（无监督学习）

Supervised Methods（监督学习）

训练集不可能包含所有的taxonomic relations，所以一定存在缺点

TaxoGen：Constructing Topical Concept Taxonomy by Adaptive

Contributions

An adaptive spherical clustering module for allocating terms to proper levels when splitting a coarse topic into fine-grained ones.
A local embedding module for learning term embeddings that maintain strong discriminative power at different levels of the taxonomy.

Methods

1. Adaptive Spherical Clustering

Identify general terms and refine the sub-topics and push general terms back to the parent.
Using TF-IDF to generate representativeness terms, because representativeness term should appear frequently in topic S but not in the sibling of topic S.

2. Local Embedding

Using Skip-Gram to learn word embedding。
Retrieve sub-corpus for topic.

Datasets

DBLP(Contains 600 thousand computer science paper titles)
SP(Contains 91 thousand paper abstracts)

Compared Methods

HLDA(Hierarchical Latent Dirichlet Allocation Model)

Hierarchical Topic Models and the Nested Chinese Restaurant Process

HPAM(Hierarchical Pachinko Allocation Model)

Mixtures of Hierarchical Topics with Pachinko Allocation

HCLUS(Hierarchical Clustering)
NoAC
NoLe

Quantitative Analysis

Relation Accuracy
Term Coherence
Cluster Quality

Differentiating Concepts and Instances for Knowledge Graph Embedding

Contributions

first to propose and formalize the problem of knowledge graph embedding which differentiates between concepts and instances.
We propose a novel knowledge embeddin method named TransC, which distinguishe between concepts and instances and deal with the transitivity of isA relations.
We construct a new dataset based on YAGO for evaluation. Experiments on link prediction and triple classification demonstrate that TransC successfully addresses the above problems and outperforms state-of-the-art methods.

Related Work

Translation-based Models
Bilinear Models
External Information Learning Models

Approach

1. TransC

InstancesOf Triple Representation
SubClassOf Triple Representation
Relation Triple Representation

2. Training Method

Margin-based ranking loss
- Le: instanceOf triples
- Lc: subClassOf triples
- Ll: relational triples
Retrieve sub-corpus for topic: L = Le + Lc + Ll .

Experiments and Analysis

Datasets: YAGO
Link Prediction
Triple Classification

Customized Organization of Social Media Contents using Focused Topic Hierarchy

Contributions

Focused topic hierarchy construction: We introduce the focused topic hierarchy to provide users a customized view of the social media contents, in which the information need is seamlessly incorporated into the topic hierarchy construction process.
Customized representative content selection: We develop a probability model to identify the representative content for each topic node on the hierarchy, which enables fast information retrieval on the hierarchically organized social media corpus.

Introduction

Information overload and noise —— organizing the social media contents into a general topic hierarchy.
Find useful contents is time-consuming —— identify representative contents for each node on the hierarchy.
Step
- Step1: use propagation algorithm to collect the potentially useful topics.
- Step2: devise a function to estimate likelihood of a topic hierarchy and use it to integrate topic hierarchy, which fit social media corpus and user need.
- Step3: propose a probability based ranking model

Related work

Information organization
- Past: split corpus into shallow clusters
- Recent: organize corpus into cluster hierarchies, such as hierarchical LDA…
Topic hierarchy generation
- Didn’t consider users’ information as input
Problem formulation
- Social media corpus
- User information needs
- Focused topic hierarchy

Methodology

Corpus collection and topic modeling

topic extraction
- TF-IDF generate topic set
- Salience score measure importance
topic relation
- Topic relevance ——semantic relatedness between two topic
- Subtopic relation strength ——likelihood of subtopic

Graph-based focused topic discovering

Given information need and social media corpus ——our task is to collect a subset Tq

co-occurrence based method（共同出现的topic，建立粗糙的子集）
graph-based label propagation algorithm to refine

Information need-aware topic hierarchy construction

likelihood of topic hierarchy construction
- Information need’s perspective
- Taxonomy structure’s perspective
topic hierarchy construction algorithm

Topic hierarchy based customized corpus organization

Representative document selection
- 1 its position on the hierarchy
- 2 documents’ source
OkapiBM25 function
Bayesian rule(priori probability)

Evaluation

social media corpus
test information needs（来源：搜索引擎的相关搜索）
focused topic discovering（与只用其中一个指标的情况进行对比，topic salience，topic relevance）
focused topic hierarchy construction
- 1 对假设进行验证
- 2 方法对比
customized corpus organization
- nDCG
- ablation study

Learning Term Embeddings for Hypernymy Identification

Contributions

We introduce a dynamic distance margin model to learn term embeddings that capture hypernymy properties, and we train an SVM classifier for hypernymy identification using the embeddings as features.
之前Hypernymy Identification的工作主要基于词汇模式（lexical pattern）和分布假设（distributional inclusion hypothesis），准确率不高。
现在设计了一种distance-margin neural network ，通过提前获得的上位关系数据，来学习term embedding。然后把获得的term embedding作为特征，通过有监督的方法来识别上下位关系。
原先人们非常关注从 term co-occurrence data 学习，因此相似的词语往往有相似的 embedding，但是我们需要判别的是两个词语是否有特定的关系，而不是经常一起出现。
更高的准确率，not domain dependent

Methods

1. Dynamic distance-margin model

embedding for the hypernymy relationship
- 每个term有两个embedding，一个hyponym embedding O，一个hypernym embedding E
- 所有的embedding 满足三个性质：
  1. hyponym-hypernym similarity
  2. co-hyponym similarity
  3. co-hypernym similarity
learning embedding
- hypernymy relationship：x=（v,u,q），v hypernyms，u hyponyms
- 目标：O(u)接近E(v)
neural network architecture

2. Supervised hypernymy identification

f(x)<Δ，确定阈值Δ比较困难
因此使用SVM，输入特征为O(u)、E(v)、O(u)-E(v)

Learning Term Embeddings for Taxonomic Relation Identification Using Dynamic Weighting Neural Network

Contributions

For this purpose, we first design a dynamic weighting neural network to learn term embeddings based on not only the hypernym and hyponym terms, but also the contextual information between them. （提出用dynamic weighting neural network学习word embedding）
We then apply such embeddings as features to identify taxonomic relations using a supervised method.（用得到的embedding为特征，用SVM进行分类）

Methods

和Yu et al.（2015）的工作很像，distance-margin neural network（这个可以看一下）

1. Learning Term Embedding

Extracting taxonomic relations
- 用WordNet hierarchies获得所有的taxonomic relations，去掉其中的top-level terms
Extracting training triples
- 维基百科上获得含有taxonomic relations词语的句子，除了taxonomic relations词语，句子中其他词语都是context
- we use the Stanford parser (Manning et al., 2014) to parse it, and check whether there is any pair of terms which are nouns or noun phrases in the sentence having a taxonomic relationship.
Training neural network
- Specifically, the target of the neural network is to predict the hypernym term from the given hyponym term and contextual words.
- 所有词用one-hot表示成向量（这个或许能改进）
- 根据context数量动态调整
- Softmax

2. Supervised Taxonomic Relation Identification

embedding作为特征
输入SVM的特征（x, y, x-y）三个维度

Datasets

1. BLESS It covers 200 distinct, unambiguous concepts (terms); each of which is involved with terms, called relata, in some relations. 2. ENTAILMENT It consists of 2,770 pairs of terms, with equal number of positive and negative examples of taxonomic relations. 3. Animal, Plant and Vehicle datasets They are taxonomies constructed based on the dictionaries and data crawled from the Web for the corresponding domains.

Compared Methods

SVM + Our
SVM + Word2Vec
SVM + Yu

Learning Semantic Hierarchies via Word Embeddings

Contributions

This paper proposes a novel method for semantic hierarchy construction based on word embeddings, which are trained using a large-scale corpus.
Generally speaking, the proposed method greatly improves the recall and F-score but damages the precision.

Methods

1. Word Embedding Training

2. A Uniform Linear Projection Intuitively, we assume that all words can be projected to their hypernyms based on a uniform transition matrix.

3. Piecewise Linear Projections Specifically, the input space is first segmented into several regions. That is, all word pairs (x, y) in the training data are first clustered into several groups, where word pairs in each group are expected to exhibit similar hypernym–hyponym relations.

4. Piecewise Linear Projections If a circle has only two nodes, we remove the weakest path. If a circle has more than two nodes, we reverse the weakest path to form an indirect hypernym–hyponym relation.

Datasets

Learning word embeddings: Baidubaike, which contains about 30 million sentences (about 780 million words).
The Chinese segmentation is provided by the open-source Chinese language processing platform LTP5.
The training data for projection learning is collected from CilinE.
For evaluation, we collect the hypernyms for 418 entities, which are selected randomly from Baidubaike.
The final data set contains 655 unique hypernyms and 1,391 hypernym–hyponym relations among them.
Randomly spliting the labeled data into 1/5 for development and 4/5 for testing.

Compared Methods

MWiki+CilinE refers to the manually-built hierarchy extension method of Suchanek et al. (2008).
MPattern refers to the pattern-based method of Hearst (1992).
MSnow originally proposed by Snow et al. (2005)

Translating Representations of Knowledge Graphs with Neighbors

Contributions

A approach to capture more precise context information and to incorporate neighbor information dynamically.

Firstly, we apply effective neighbor selection to reduce the number of neighbors.
Second, we try to encode neighborhood information with context embeddings.
Third, we further utilize attention mechanism to focus on most influential nodes since different neighbors provide different level of information.

Methods

1. Neighbor Selection For each epoch t, we derive θte, which means the number of neighbors to be considered for an entity e.

2. Neighbor-based Representation each object, entity or relation, is represented by two vectors, one is called object embedding while the other is called context embedding.

Enriching Taxonomies With Functional Domain Knowledge

Contributions

A novel framework, ETF, to enrich large-scale, generic taxonomies with new concepts from resources such as news and research publications.

We develop a novel, fully automated framework, ETF, that generates semantic text-vector embeddings for each new concept.
We propose the use of a learning algorithm that combines a carefully selected set of graph-theoretic and semantic similarity based features to rank candidate parent relations.
We test ETF on large, real-world, publicly available knowledge bases such as Wikipedia and Wordnet, and outperform baselines at the task of inserting new concepts.

Methods

1. Finding Concepts and Taxonomic Relations Acquire the entities and categories from the given taxonomy structure. And then obtain the novel concepts to be integrated into T.

2. Learning Concept Representations

To get the representation of an entity, we add a tf-weighted sum of hte word2vec embeddings of its context terms to the doc2vec representation of its associated document.
After creating embeddingd for the existing concepts in T, we next learn representations for the new concepts tp be inserted into T.

3. Filtering and Ranking Potential Parents

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

VincentGaoHJ/Taxonomic-Relation-Identification

Folders and files

Latest commit

History

Repository files navigation

Taxonomic-Relation-Identification

总结研究

Statistical Approaches（统计方法）

Pattern-Based Methods（语言学方法）

Word Embedding（词嵌入）

Clustering-Based Methods（无监督学习）

Supervised Methods（监督学习）

Contributions

Methods

Datasets

Compared Methods

Quantitative Analysis

Contributions

Related Work

Approach

Experiments and Analysis

Contributions

Introduction

Related work

Methodology

Corpus collection and topic modeling

Graph-based focused topic discovering

Information need-aware topic hierarchy construction

Topic hierarchy based customized corpus organization

Evaluation

Contributions

Methods

Contributions

Methods

Datasets

Compared Methods

Contributions

Methods

Datasets

Compared Methods

Contributions

Methods

Contributions

Methods

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!