Utility Specification

Article Preprocessing

Purge the texts, and apply word segmentation.

Usage:

usage: article_preprocess.py [-h] -c CORPUS [-t THREAD]

CosmEL: Preprocesses Article.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/repo/
the repository. [from Database Generation]
<CORPUS>/html/original_article/
the original articles. [from HTML Preprocessing]

Output:

<CORPUS>/html/purged_article/
the purged articles.
<CORPUS>/article/purged_article_ws_re#/
the temporary word-segmented articles.
<CORPUS>/article/bad_article.txt
the list of bad articles (with too long sentences).

Article Postprocessing

Add roles into word-segmented articles.

Usage:

usage: article_postprocess.py [-h] -c CORPUS [-t THREAD]

CosmEL: Postprocesses Article.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/repo/
the repository. [from Database Generation]
<CORPUS>/article/purged_article_ws_re#/
the temporary word-segmented articles. [from Article Preprocessing]

Output:

<CORPUS>/article/purged_article_ws/
the word-segmented articles.
<CORPUS>/article/purged_article_role/
the word-segmented articles with roles.

HTML Decoding

Extract mentions from HTML.

Usage:

usage: html_decode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD]

CosmEL: Decode HTML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -i INPUT, --input INPUT
                        load HTML from "<CORPUS>/html/<INPUT>/"
  -o OUTPUT, --output OUTPUT
                        dump mention into "data/<ver>/mention/<OUTPUT>/";
                        default is <INPUT>
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/html/<INPUT>/
the HTML files.
<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]
<CORPUS>/mention/purged_article/
the mentions files. [from Mention Detection]

Output:

<CORPUS>/mention/<OUTPUT>/
the mention files.

HTML Encoding

Embed mentions into HTML.

Usage:

usage: html_encode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD]

CosmEL: Encode HTML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -i INPUT, --input INPUT
                        load mention from "<CORPUS>/mention/<INPUT>/"
  -o OUTPUT, --output OUTPUT
                        dump HTML into "<CORPUS>/html/<OUTPUT>/"; default is
                        <INPUT>
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/html/html_article/
the HTML files. [from HTML Preprocessing]
<CORPUS>/html/purged_article_idx/
the index files. [from HTML Postprocessing]
<CORPUS>/mention/<INPUT>/
the mention files.

Output:

<CORPUS>/html/<OUTPUT>
the encoded HTML files.

HTML Extracting

Extract HTML articles from JSON sources.

Usage:

usage: html_extract.py [-h] -c CORPUS [-t THREAD]

CosmEL: Extracting HTML files.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/html/article_filtered/
the JSON files.

Output:

<CORPUS>/html/html_article/
the HTML files.

HTML Preprocessing

Extract texts from HTML articles.

Usage:

usage: html_preprocess.py [-h] -c CORPUS [-t THREAD]

CosmEL: Preprocesses HTML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/html/html_article/
the HTML files. [from HTML Extracting]

Output:

<CORPUS>/html/html_article_notag/
the HTML files without HTML tags.
<CORPUS>/article/original_article/
the text files extracted from above.

HTML Postprocessing

Indexing HTML with word-segmented articles.

Usage:

usage: html_postprocess.py [-h] -c CORPUS [-t THREAD]

CosmEL: Postprocesses HTML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]
<CORPUS>/html/html_article_notag/
the HTML files without HTML tags. [from HTML Preprocessing]

Output:

<CORPUS>/html/purged_article_idx/
the index files.

Mention Merging

Merge the mentions.

Usage:

usage: mention_detect.py [-h] -c CORPUS [-t THREAD]

CosmEL: Detect Mention.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/mention/<BASE>/
the base mentions files.
<CORPUS>/mention/<INPUT>/
the input mentions files.

Output:

<CORPUS>/mention/<OUTPUT>/
the output mentions files.

Mention Detection

Detect the mentions from articles.

Usage:

usage: mention_detect.py [-h] -c CORPUS [-t THREAD]

CosmEL: Detect Mention.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]

Output:

<CORPUS>/mention/purged_article/
the mentions files.

Database Generation

Generates the repository.

Usage:

usage: database_generate.py [-h] -i INPUT -d DATABASE [--etc]

CosmEL: Generate Database.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        load CSV from file <INPUT>
  -d DATABASE, --database DATABASE
                        save/load database from directory <DATABASE>
  --etc                 use predefined etc files (from ./etc)

Requirement:

<INPUT>
the original purged CSV file.
<DATABASE>/etc/
the setting files.

Output:

<DATABASE>/
the repository files.

Rule Annotation

Annotate the mentions by the rules.

Usage:

usage: rule_exact.py [-h] -c CORPUS [-t THREAD]

CosmEL: Annotate by Rule (Exact Match).

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`
usage: rule_parser.py [-h] -c CORPUS [-t THREAD]

CosmEL: Annotate by Rule (Parser).

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]
<CORPUS>/mention/purged_article
the mentions files. [from Mention Detection]

Output:

<CORPUS>/xml/purged_article_ws_rid/
the rule-labeled XML articles.

Sentence Parsing

Parse the sentences.

Usage:

usage: parse.py [-h] -c CORPUS [-t THREAD] --host HOST --port PORT

CosmEL: Parse Sentence.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`
  --host HOST           connect to host with IP <HOST>
  --port PORT           connect to host with port <PORT>

Requirement:

<CORPUS>/article/purged_article_ws_re<#>/
the temporary word-segmented articles. [from Article Preprocessing]

Output:

<CORPUS>/article/parsed_article_parse/
the parsed articles.

Word2Vec

Embed words using Word2Vec.

Usage:

usage: word2vec.py [-h] -c CORPUS [-t THREAD]

CosmEL: Word2Vec.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/repo/
the repository. [from Database Generation]
<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]

Output:

<CORPUS>/embeddings/purged_article.dim300.emb.bin
the embedding file.

XML Decoding

Extract mentions from XML.

Usage:

usage: xml_decode.py [-h] -c CORPUS [-iw INPUT_WS] -i INPUT [-o OUTPUT]
                     [-t THREAD]

CosmEL: Decode XML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -iw INPUT_WS, --input-ws INPUT_WS
                        load word-segmented XML from "<CORPUS>/xml/<INPUT-
                        WS>/"
  -i INPUT, --input INPUT
                        load XML from "<CORPUS>/xml/<INPUT>/"
  -o OUTPUT, --output OUTPUT
                        dump mention into "data/<ver>/mention/<OUTPUT>/";
                        default is <INPUT>
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/xml/<INPUT_WS>/
the word-segmented XML files. [optional]
<CORPUS>/xml/<INPUT>/
the XML files. [required if <INPUT_WS> is not set]
<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]
<CORPUS>/mention/purged_article/
the mentions files. [from Mention Detection]

Output:

<CORPUS>/xml/<INPUT>
the XML files. [generated if <INPUT_WS> is set]
<CORPUS>/mention/<OUTPUT>/
the mention files.

XML Encoding

Encode articles and mentions into XML.

Usage:

usage: xml_encode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD]

CosmEL: Encode XML.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -i INPUT, --input INPUT
                        load mention from "<CORPUS>/mention/<INPUT>/"
  -o OUTPUT, --output OUTPUT
                        dump XML into "<CORPUS>/xml/<OUTPUT>/"; default is
                        <INPUT>
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Requirement:

<CORPUS>/article/purged_article_role/
the word-segmented articles with roles. [from Article Postprocessing]
<CORPUS>/mention/<INPUT>/
the mention files.

Output:

<CORPUS>/xml/<OUTPUT>/
the encoded XML files.