Introduction

https://img.shields.io/badge/Python-3.6-blue.svghttps://img.shields.io/badge/PyTorch-0.4.1-blue.svg

Cosmetic Entity Linking (CosmEL) is an entity linking tool in cosmetic domain.

Author

Requirement

About

Cosmetic Entity Linking (CosmEL) is an entity linking tool in cosmetic domain. We train a entity linking model from given articles with manually labeling. Usually, it is inefficient to manually label all the mentions. We provide several decision-tree-based rules with laboriously observation, and train the entity linking model by using the labeled data automatically generated by the rule with a small amount of manual labeled data.

Flow Chart

We split our progress in to four parts:

Database Generation

Generates the database (including the information of brands and products) from raw CSV data.

See also

Corpus Generation

Creates the corpus from articles in plain text. This step exports the rule-labeled articles. Please manually label the XML files to create golden labeled articles.

../_images/flowchart.001.png

See also

Training

Trains the model using rule-labeled articles and golden labeled articles.

../_images/flowchart.002.png

See also

Prediction

Predicts the mention labels using the model.

../_images/flowchart.003.png

See also

Quick Start

Installation

One may install using either pip or Conda.

Install with pip

pip install torch==0.4.1 torchvision
pip install beautifulsoup4==4.6 gensim==3.6 lxml==4.2 numpy==1.15 scikit-learn==0.20 tqdm==4.27

Install with Conda

First install the Conda environment. Conda is an open source package management system. It quickly installs, runs and updates packages and their dependencies.

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh ./Miniconda3-latest-Linux-x86_64.sh

Make sure to prepend the Miniconda3 install location to PATH in your .bashrc. Now restart your shell to enable PATH, or export it manually:

export PATH="$HOME/miniconda3/bin:$PATH"

Next, create a new Conda environment for CosmEL, named cosmel, with Python version 3.6.

conda create -n cosmel python=3.6

Install Packages

First activate the CosmEL Conda environment:

source activate cosmel

Now, (cosmel) will be appended to the prompt string:

(cosmel) <user>@<host>:~$

Next, install the Python packages:

conda install pytorch=0.4.1 -c pytorch
conda install beautifulsoup4=4.6 gensim=3.4 lxml=4.2 numpy=1.15 scikit-learn=0.20 tqdm=4.26

CKIPWS

Please install the CKIPWS Cython Package. Choose one of the following instruction.

Install With Cython

Denote the root path of CKIPWS as <ckipws-root>.

pip install cython
LIBRARY_PATH=<ckipws-root>/lib:$LIBRARY_PATH pip install ckipws>=0.2

Install Without Cython

Replace <cosmel-root>/cosmel/util/ckipws.py by <cosmel-root>/cosmel/util/ckipws-old.py.

Denote the root path of CosmEL (the folder containing this README) as <cosmel-root>, and the root path of CKIPWS as <ckipws-root>. Copy the following files:

cp <ckipws-root>/lib/libWordSeg.so <cosmel-root>/libWordSeg.so
cp <ckipws-root>/Data2             <cosmel-root>/Data2 -r

You may add <ckipws-root>/lib/ to LD_LIBRARY_PATH instead of copying libWordSeg.so.

Example

First, goto the root path of CosmEL (<cosmel-root>, the folder containing this README), and create the working space for this example (data/demo/).

cd <cosmel-root>
mkdir -p data/demo

Database Generation

Generate database from demo/styleme.csv:

python3 ./util/database_generate.py -i demo/styleme.csv -d data/demo/repo

You can modify data/demo/repo/etc/ to ameliorate the database.

You may also use the predefined database by adding --etc:

python3 ./util/database_generate.py -i demo/styleme.csv -d data/demo/repo --etc

The database are stored in data/demo/repo/.

See also

Training

In training step, first generate the corpus (data/demo/corpus1/) from the articles (demo/original_article1/). Here demo/repo/ is used as database.

python3 ./tool/corpusgen.py -c data/demo/corpus1 -d demo/repo -i demo/original_article1 -x data/demo/output/rid1  -X data/demo/output/nil1

The rule-labeled articles are exported to data/demo/output/rid1/, and the empty XML articles are exported to data/demo/output/nil1/. You may modify the gid flags in the empty XML articles for manually annotation. (For HTML format, please refer HTML Encoding and HTML Decoding)

Next, you may train word embeddings from the corpus (stored in data/demo/corpus1/embeddings/):

python3 ./util/word2vec.py -c data/demo/corpus1

Or use other embeddings, but make sure that all brand aliases are contained in this embeddings.

Finally, train the model using the corpus (data/demo/corpus1/), with manually-labeled articles demo/purged_article_gid_xml1/ and embeddings file demo/emb1.bin:

python3 ./tool/train.py -c data/demo/corpus1 -m data/demo/model1 -x demo/purged_article_gid_xml1 --emb demo/emb1.bin

The model data are stored in data/demo/model1/.

See also

Prediction

In prediction step, first generate the corpus (data/demo/corpus2/) from the articles (demo/original_article2/). Here demo/repo/ is used as database.

python3 ./tool/corpusgen.py -c data/demo/corpus2 -d demo/repo -i demo/original_article2

Next, predict the labels of the corpus (data/demo/corpus2/) with model data/demo/model1/.

python3 ./tool/predict.py -c data/demo/corpus2 -m data/demo/model1 -o data/demo/output/nid2

The results are exported to data/demo/output/nid2/.

See also

Documentation

To build the documentation, please install the following packages.

(Using pip)

pip install sphinx==1.8.1 sphinx_rtd_theme==0.4.2 sphinxcontrib-programoutput==0.11

(Using Conda)

conda install sphinx=1.8.1 sphinx_rtd_theme=0.4.2
conda install sphinxcontrib-programoutput=0.11 -c conda-forge

Next, build the HTML documentation.

cd <cosmel-root>/docs
make html

The outputs are located in <cosmel-root>/docs/_build/html/.

You may also build PDF documentation using LaTeX if you have latexmk and xelatex installed.

make latex

The outputs are located in <cosmel-root>/docs/_build/latex/.