Introduction¶

Cosmetic Entity Linking (CosmEL) is an entity linking tool in cosmetic domain.

Author¶

CKIP Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan.
- Mu Yang <emfomy@gmail.com>
- Chi-Yen Chen <jina199312@gmail.com>
- Yi-Hui Lee <lilyyhlee30@gmail.com>
- Wei-Yun Ma <ma@iis.sinica.edu.tw>

Links¶

GitHub https://github.com/emfomy/cosmel

Requirement¶

Program and Tools
- Python 3.6.
- CKIPWS Linux version.
- CKIPWS Cython Package 0.2.
Python Packages
- BeautifulSoup 4.6.
- gensim 3.6.
- lxml 4.2.
- NumPy 1.15.
- PyTorch 0.4.1.
- scikit-learn 0.20.
- tqdm 4.27.
Documentation Packages (Optional)
- sphinx 1.8.1.
- sphinx_rtd_theme 0.4.2.
- sphinxcontrib-programoutput 0.11.

About¶

Cosmetic Entity Linking (CosmEL) is an entity linking tool in cosmetic domain. We train a entity linking model from given articles with manually labeling. Usually, it is inefficient to manually label all the mentions. We provide several decision-tree-based rules with laboriously observation, and train the entity linking model by using the labeled data automatically generated by the rule with a small amount of manual labeled data.

Flow Chart¶

We split our progress in to four parts:

Database Generation

Generates the database (including the information of brands and products) from raw CSV data.

See also

Specification - Database Generation

Quick Start - Database Generation

Corpus Generation

Creates the corpus from articles in plain text. This step exports the rule-labeled articles. Please manually label the XML files to create golden labeled articles.

See also

Specification - Corpus Generation

Quick Start - Training

Quick Start - Prediction

Training

Trains the model using rule-labeled articles and golden labeled articles.

See also

Specification - Model Training

Quick Start - Training

Prediction

Predicts the mention labels using the model.

See also

Specification - Model Prediction

Quick Start - Prediction

Quick Start¶

Installation¶

One may install using either pip or Conda.

Install with pip¶

pip install torch==0.4.1 torchvision
pip install beautifulsoup4==4.6 gensim==3.6 lxml==4.2 numpy==1.15 scikit-learn==0.20 tqdm==4.27

Install with Conda¶

First install the Conda environment. Conda is an open source package management system. It quickly installs, runs and updates packages and their dependencies.

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh ./Miniconda3-latest-Linux-x86_64.sh

Make sure to prepend the Miniconda3 install location to PATH in your .bashrc. Now restart your shell to enable PATH, or export it manually:

export PATH="$HOME/miniconda3/bin:$PATH"

Next, create a new Conda environment for CosmEL, named cosmel, with Python version 3.6.

conda create -n cosmel python=3.6

Install Packages¶

First activate the CosmEL Conda environment:

source activate cosmel

Now, (cosmel) will be appended to the prompt string:

(cosmel) <user>@<host>:~$

Next, install the Python packages:

conda install pytorch=0.4.1 -c pytorch
conda install beautifulsoup4=4.6 gensim=3.4 lxml=4.2 numpy=1.15 scikit-learn=0.20 tqdm=4.26

CKIPWS¶

Please install the CKIPWS Cython Package. Choose one of the following instruction.

Install With Cython¶

Denote the root path of CKIPWS as <ckipws-root>.

pip install cython
LIBRARY_PATH=<ckipws-root>/lib:$LIBRARY_PATH pip install ckipws>=0.2

Install Without Cython¶

Replace <cosmel-root>/cosmel/util/ckipws.py by <cosmel-root>/cosmel/util/ckipws-old.py.

Denote the root path of CosmEL (the folder containing this README) as <cosmel-root>, and the root path of CKIPWS as <ckipws-root>. Copy the following files:

cp <ckipws-root>/lib/libWordSeg.so <cosmel-root>/libWordSeg.so
cp <ckipws-root>/Data2             <cosmel-root>/Data2 -r

You may add <ckipws-root>/lib/ to LD_LIBRARY_PATH instead of copying libWordSeg.so.

Example¶

First, goto the root path of CosmEL (<cosmel-root>, the folder containing this README), and create the working space for this example (data/demo/).

cd <cosmel-root>
mkdir -p data/demo

Database Generation¶

Generate database from demo/styleme.csv:

python3 ./util/database_generate.py -i demo/styleme.csv -d data/demo/repo

You can modify data/demo/repo/etc/ to ameliorate the database.

You may also use the predefined database by adding --etc:

python3 ./util/database_generate.py -i demo/styleme.csv -d data/demo/repo --etc

The database are stored in data/demo/repo/.

Training¶

In training step, first generate the corpus (data/demo/corpus1/) from the articles (demo/original_article1/). Here demo/repo/ is used as database.

python3 ./tool/corpusgen.py -c data/demo/corpus1 -d demo/repo -i demo/original_article1 -x data/demo/output/rid1  -X data/demo/output/nil1

The rule-labeled articles are exported to data/demo/output/rid1/, and the empty XML articles are exported to data/demo/output/nil1/. You may modify the gid flags in the empty XML articles for manually annotation. (For HTML format, please refer HTML Encoding and HTML Decoding)

Next, you may train word embeddings from the corpus (stored in data/demo/corpus1/embeddings/):

python3 ./util/word2vec.py -c data/demo/corpus1

Or use other embeddings, but make sure that all brand aliases are contained in this embeddings.

Finally, train the model using the corpus (data/demo/corpus1/), with manually-labeled articles demo/purged_article_gid_xml1/ and embeddings file demo/emb1.bin:

python3 ./tool/train.py -c data/demo/corpus1 -m data/demo/model1 -x demo/purged_article_gid_xml1 --emb demo/emb1.bin

The model data are stored in data/demo/model1/.

Prediction¶

In prediction step, first generate the corpus (data/demo/corpus2/) from the articles (demo/original_article2/). Here demo/repo/ is used as database.

python3 ./tool/corpusgen.py -c data/demo/corpus2 -d demo/repo -i demo/original_article2

Next, predict the labels of the corpus (data/demo/corpus2/) with model data/demo/model1/.

python3 ./tool/predict.py -c data/demo/corpus2 -m data/demo/model1 -o data/demo/output/nid2

The results are exported to data/demo/output/nid2/.

Documentation¶

To build the documentation, please install the following packages.

(Using pip)

pip install sphinx==1.8.1 sphinx_rtd_theme==0.4.2 sphinxcontrib-programoutput==0.11

(Using Conda)

conda install sphinx=1.8.1 sphinx_rtd_theme=0.4.2
conda install sphinxcontrib-programoutput=0.11 -c conda-forge

Next, build the HTML documentation.

cd <cosmel-root>/docs
make html

The outputs are located in <cosmel-root>/docs/_build/html/.

You may also build PDF documentation using LaTeX if you have latexmk and xelatex installed.

make latex

The outputs are located in <cosmel-root>/docs/_build/latex/.