Utility Specification¶
Article Preprocessing¶
Purge the texts, and apply word segmentation.
Usage:
usage: article_preprocess.py [-h] -c CORPUS [-t THREAD] CosmEL: Preprocesses Article. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/repo/
- the repository. [from Database Generation]
- <CORPUS>/html/original_article/
- the original articles. [from HTML Preprocessing]
Output:
- <CORPUS>/html/purged_article/
- the purged articles.
- <CORPUS>/article/purged_article_ws_re#/
- the temporary word-segmented articles.
- <CORPUS>/article/bad_article.txt
- the list of bad articles (with too long sentences).
Article Postprocessing¶
Add roles into word-segmented articles.
Usage:
usage: article_postprocess.py [-h] -c CORPUS [-t THREAD] CosmEL: Postprocesses Article. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/repo/
- the repository. [from Database Generation]
- <CORPUS>/article/purged_article_ws_re#/
- the temporary word-segmented articles. [from Article Preprocessing]
Output:
- <CORPUS>/article/purged_article_ws/
- the word-segmented articles.
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles.
HTML Decoding¶
Extract mentions from HTML.
Usage:
usage: html_decode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD] CosmEL: Decode HTML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -i INPUT, --input INPUT load HTML from "<CORPUS>/html/<INPUT>/" -o OUTPUT, --output OUTPUT dump mention into "data/<ver>/mention/<OUTPUT>/"; default is <INPUT> -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/html/<INPUT>/
- the HTML files.
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
- <CORPUS>/mention/purged_article/
- the mentions files. [from Mention Detection]
Output:
- <CORPUS>/mention/<OUTPUT>/
- the mention files.
HTML Encoding¶
Embed mentions into HTML.
Usage:
usage: html_encode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD] CosmEL: Encode HTML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -i INPUT, --input INPUT load mention from "<CORPUS>/mention/<INPUT>/" -o OUTPUT, --output OUTPUT dump HTML into "<CORPUS>/html/<OUTPUT>/"; default is <INPUT> -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/html/html_article/
- the HTML files. [from HTML Preprocessing]
- <CORPUS>/html/purged_article_idx/
- the index files. [from HTML Postprocessing]
- <CORPUS>/mention/<INPUT>/
- the mention files.
Output:
- <CORPUS>/html/<OUTPUT>
- the encoded HTML files.
HTML Extracting¶
Extract HTML articles from JSON sources.
Usage:
usage: html_extract.py [-h] -c CORPUS [-t THREAD] CosmEL: Extracting HTML files. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/html/article_filtered/
- the JSON files.
Output:
- <CORPUS>/html/html_article/
- the HTML files.
HTML Preprocessing¶
Extract texts from HTML articles.
Usage:
usage: html_preprocess.py [-h] -c CORPUS [-t THREAD] CosmEL: Preprocesses HTML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/html/html_article/
- the HTML files. [from HTML Extracting]
Output:
- <CORPUS>/html/html_article_notag/
- the HTML files without HTML tags.
- <CORPUS>/article/original_article/
- the text files extracted from above.
HTML Postprocessing¶
Indexing HTML with word-segmented articles.
Usage:
usage: html_postprocess.py [-h] -c CORPUS [-t THREAD] CosmEL: Postprocesses HTML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
- <CORPUS>/html/html_article_notag/
- the HTML files without HTML tags. [from HTML Preprocessing]
Output:
- <CORPUS>/html/purged_article_idx/
- the index files.
Mention Merging¶
Merge the mentions.
Usage:
usage: mention_detect.py [-h] -c CORPUS [-t THREAD] CosmEL: Detect Mention. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/mention/<BASE>/
- the base mentions files.
- <CORPUS>/mention/<INPUT>/
- the input mentions files.
Output:
- <CORPUS>/mention/<OUTPUT>/
- the output mentions files.
Mention Detection¶
Detect the mentions from articles.
Usage:
usage: mention_detect.py [-h] -c CORPUS [-t THREAD] CosmEL: Detect Mention. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
Output:
- <CORPUS>/mention/purged_article/
- the mentions files.
Database Generation¶
Generates the repository.
Usage:
usage: database_generate.py [-h] -i INPUT -d DATABASE [--etc] CosmEL: Generate Database. optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT load CSV from file <INPUT> -d DATABASE, --database DATABASE save/load database from directory <DATABASE> --etc use predefined etc files (from ./etc)
Requirement:
- <INPUT>
- the original purged CSV file.
- <DATABASE>/etc/
- the setting files.
Output:
- <DATABASE>/
- the repository files.
Rule Annotation¶
Annotate the mentions by the rules.
Usage:
usage: rule_exact.py [-h] -c CORPUS [-t THREAD] CosmEL: Annotate by Rule (Exact Match). optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`usage: rule_parser.py [-h] -c CORPUS [-t THREAD] CosmEL: Annotate by Rule (Parser). optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
- <CORPUS>/mention/purged_article
- the mentions files. [from Mention Detection]
Output:
- <CORPUS>/xml/purged_article_ws_rid/
- the rule-labeled XML articles.
Sentence Parsing¶
Parse the sentences.
Usage:
usage: parse.py [-h] -c CORPUS [-t THREAD] --host HOST --port PORT CosmEL: Parse Sentence. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()` --host HOST connect to host with IP <HOST> --port PORT connect to host with port <PORT>
Requirement:
- <CORPUS>/article/purged_article_ws_re<#>/
- the temporary word-segmented articles. [from Article Preprocessing]
Output:
- <CORPUS>/article/parsed_article_parse/
- the parsed articles.
Word2Vec¶
Embed words using Word2Vec.
Usage:
usage: word2vec.py [-h] -c CORPUS [-t THREAD] CosmEL: Word2Vec. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/repo/
- the repository. [from Database Generation]
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
Output:
- <CORPUS>/embeddings/purged_article.dim300.emb.bin
- the embedding file.
XML Decoding¶
Extract mentions from XML.
Usage:
usage: xml_decode.py [-h] -c CORPUS [-iw INPUT_WS] -i INPUT [-o OUTPUT] [-t THREAD] CosmEL: Decode XML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -iw INPUT_WS, --input-ws INPUT_WS load word-segmented XML from "<CORPUS>/xml/<INPUT- WS>/" -i INPUT, --input INPUT load XML from "<CORPUS>/xml/<INPUT>/" -o OUTPUT, --output OUTPUT dump mention into "data/<ver>/mention/<OUTPUT>/"; default is <INPUT> -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/xml/<INPUT_WS>/
- the word-segmented XML files. [optional]
- <CORPUS>/xml/<INPUT>/
- the XML files. [required if <INPUT_WS> is not set]
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
- <CORPUS>/mention/purged_article/
- the mentions files. [from Mention Detection]
Output:
- <CORPUS>/xml/<INPUT>
- the XML files. [generated if <INPUT_WS> is set]
- <CORPUS>/mention/<OUTPUT>/
- the mention files.
XML Encoding¶
Encode articles and mentions into XML.
Usage:
usage: xml_encode.py [-h] -c CORPUS -i INPUT [-o OUTPUT] [-t THREAD] CosmEL: Encode XML. optional arguments: -h, --help show this help message and exit -c CORPUS, --corpus CORPUS store corpus data in directory "<CORPUS>/" -i INPUT, --input INPUT load mention from "<CORPUS>/mention/<INPUT>/" -o OUTPUT, --output OUTPUT dump XML into "<CORPUS>/xml/<OUTPUT>/"; default is <INPUT> -t THREAD, --thread THREAD use <THREAD> threads; default is `os.cpu_count()`
Requirement:
- <CORPUS>/article/purged_article_role/
- the word-segmented articles with roles. [from Article Postprocessing]
- <CORPUS>/mention/<INPUT>/
- the mention files.
Output:
- <CORPUS>/xml/<OUTPUT>/
- the encoded XML files.