.. _SectionDataStructure: Data Structure ============== In CosmEL, we define some data structures for collecting cosmetics database, blog articles and CKIP tools relatively. Repository ---------- To maintain the cosmetics database from PIXNET in an organized way, we create a data structure named :class:`.Repo`, containing several properties of :class:`.Brand` and :class:`.Product`. **Brand** contains brand aliases for each brand. See more details in :class:`.Brand`. **Product** has attributes of product ID, the corresponding brand, its product name, its product head word, and its description. Also, the product brand, name, and description have been word-segmented by CKIPWS tool and stored as segmented forms. A full product name can be split into four parts: **brand**, **infix**, **head word** and **suffix**. Corpus ------ To preserve all information offered from PIXNET blog articles, we convert these information into several structured data objects. The main object :class:`.Corpus` contains information of mentions :class:`.MentionSet` and information of articles in both word segmented version :class:`.ArticleSet` and parsed version :class:`.ParsedArticleSet`. **MentionSet**, consisting of all mentions in corpus, is constructed by :class:`.MentionBundleSet`, where a :class:`.MentionBundle` compromises all :class:`.Mention` in an :class:`.Article`. See more details in :class:`.MentionSet`. An **Article** contains the information of an article, including its article ID, the parsed version of this article, and the bundle of mention set in this article. In addition, :class:`.ArticleSet` is the collection of :class:`.Article`. See more details in :class:`.Article`. A **ParsedArticle** contains parsed sentences of the original **Article** and other information of the corresponding :class:`.Article`. Also, :class:`.ParsedArticleSet` is the collection of :class:`.ParsedArticle`. See more details in :class:`.ParsedArticle`. Text and Word ------------- All the texts are purged in our framework. Briefly, we convert them to lower cases, remove special symbols, remove spaces near non-alphabets, and replace spaces by '□'s. See more details in the source code of :class:`.PurgeString`. In order to utilize CKIP tools, including word segmentation tool and parser, we also create several data structures, all collected in utility modules. **WsWords** stores the result of segmented words as a :class:`list` of :class:`str`. Also, the PoS (part of speech) information and the role information (denoting the word roles --- brand, infix, and head) are stored as :class:`list`\ s of :class:`str`. See more detailed information in :class:`.WsWords`. .. _XMLFormat: XML Format ---------- We use XML/HTML tag to represent mention attributes. Syntax: .. code-block:: xml MENTION Attribute Values: SID the sentence index in the article. MID the mention word index in the sentence. GID the golden ID (human labeled). NID the network-predicted ID. RID the rule-labeled ID. RULE the rule name. MENTION the mention word. Take, the title of the article *part-00001/choyce96_324901572*, for example, .. code-block:: xml 巴黎萊雅限量玫瑰珍藏版訂製唇膏、 奢華唇釉新品體驗 Here we have two mentions, **唇膏** and **唇釉**. 唇膏 From the attributes, we know that the mention is the **6th** term of the **0th** sentence, linked to the **PID6097** product by the **P_rule1** rule, and linked to the **PID6097** product by human. 唇釉 From the attributes, we know that the mention is the **9th** term of the **0th** sentence, linked to an **Other Specific Product** by the **OSP_rule4** rule, and linked to the **PID4339** product by human.