Data Structure

In CosmEL, we define some data structures for collecting cosmetics database, blog articles and CKIP tools relatively.

Repository

To maintain the cosmetics database from PIXNET in an organized way, we create a data structure named Repo, containing several properties of Brand and Product.

Brand contains brand aliases for each brand. See more details in Brand.

Product has attributes of product ID, the corresponding brand, its product name, its product head word, and its description. Also, the product brand, name, and description have been word-segmented by CKIPWS tool and stored as segmented forms. A full product name can be split into four parts: brand, infix, head word and suffix.

Corpus

To preserve all information offered from PIXNET blog articles, we convert these information into several structured data objects. The main object Corpus contains information of mentions MentionSet and information of articles in both word segmented version ArticleSet and parsed version ParsedArticleSet.

MentionSet, consisting of all mentions in corpus, is constructed by MentionBundleSet, where a MentionBundle compromises all Mention in an Article. See more details in MentionSet.

An Article contains the information of an article, including its article ID, the parsed version of this article, and the bundle of mention set in this article. In addition, ArticleSet is the collection of Article. See more details in Article.

A ParsedArticle contains parsed sentences of the original Article and other information of the corresponding Article. Also, ParsedArticleSet is the collection of ParsedArticle. See more details in ParsedArticle.

Text and Word

All the texts are purged in our framework. Briefly, we convert them to lower cases, remove special symbols, remove spaces near non-alphabets, and replace spaces by ‘□’s. See more details in the source code of PurgeString.

In order to utilize CKIP tools, including word segmentation tool and parser, we also create several data structures, all collected in utility modules.

WsWords stores the result of segmented words as a list of str. Also, the PoS (part of speech) information and the role information (denoting the word roles — brand, infix, and head) are stored as lists of str. See more detailed information in WsWords.

XML Format

We use XML/HTML tag to represent mention attributes.

Syntax:

<product sid="SID" mid="MID" gid="GID" nid="NID" rid="RID" rule="RULE">MENTION</product>

Attribute Values:

SID
the sentence index in the article.
MID
the mention word index in the sentence.
GID
the golden ID (human labeled).
NID
the network-predicted ID.
RID
the rule-labeled ID.
RULE
the rule name.
MENTION
the mention word.

Take, the title of the article part-00001/choyce96_324901572, for example,

巴黎萊雅限量玫瑰珍藏版訂製<product sid="0" mid="6" gid="6097" nid="" rid="6097" rule="P_rule1">唇膏</product>、
奢華<product sid="0" mid="9" gid="4339" rid="OSP" rule="OSP_rule4">唇釉</product>新品體驗

Here we have two mentions, 唇膏 and 唇釉.

唇膏
From the attributes, we know that the mention is the 6th term of the 0th sentence, linked to the PID6097 product by the P_rule1 rule, and linked to the PID6097 product by human.
唇釉
From the attributes, we know that the mention is the 9th term of the 0th sentence, linked to an Other Specific Product by the OSP_rule4 rule, and linked to the PID4339 product by human.