Tool Specification

Corpus Generation

Usage:

usage: corpusgen.py [-h] -c CORPUS [-d DATABASE] [-i INPUT] [-x XML]
                    [-X XML_EMPTY] [-r RULE | --rule-exact | --rule-parser]
                    [-t THREAD]

CosmEL Tool: Corpus Generator.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -d DATABASE, --database DATABASE
                        load product database from directory <DATABASE>;
                        default is "<CORPUS>/repo/"
  -i INPUT, --input INPUT
                        load articles in directory <INPUT>; default is
                        "<CORPUS>/article/original_article/"
  -x XML, --xml XML     output rule labeled XML articles into directory <XML>;
                        default is "<CORPUS>/xml/purged_article_rid/"
  -X XML_EMPTY, --xml-empty XML_EMPTY
                        output empty XML articles into directory <XML-EMPTY>
                        for human annotation; default is
                        "<CORPUS>/xml/purged_article/"
  -r RULE, --rule RULE  use file <RULE> as rule
  --rule-exact          use rule with only exact-match
  --rule-parser         use rule with parser
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Model Training

Usage:

usage: train.py [-h] -c CORPUS [-m MODEL] [-x XML] [--emb EMB]
                [-s {c,cd,cn,cdn}]
                [-l [{gid,rid,joint} [{gid,rid,joint} ...]]]
                [-L [{gid,rid,joint} [{gid,rid,joint} ...]]] [-t THREAD]

CosmEL Tool: Training.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -m MODEL, --model MODEL
                        store model data in directory "<MODEL>/"; default is
                        "<CORPUS>/model/"
  -x XML, --xml XML     load golden labeled XML articles from directory <XML>;
                        default is "<CORPUS>/xml/purged_article_gid/"
  --emb EMB             load pretrained embeddings from file <EMB>; default is
                        "<CORPUS>/embeddings/purged_article.dim300.emb.bin"
  -s {c,cd,cn,cdn}, --structure-eem {c,cd,cn,cdn}
                        use model structure <STRUCTURE-EEM> for entity
                        embeddings model; default is "cdn"
  -l [{gid,rid,joint} [{gid,rid,joint} ...]], --label-eem [{gid,rid,joint} [{gid,rid,joint} ...]]
                        use label type <LABEL-EEM> for entity embeddings
                        model; default is "joint"
  -L [{gid,rid,joint} [{gid,rid,joint} ...]], --label-mtc [{gid,rid,joint} [{gid,rid,joint} ...]]
                        use label type <LABEL-MTC> for mention type
                        classifier; default is "gid"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`

Model Prediction

Usage:

usage: predict.py [-h] -c CORPUS [-m MODEL] [-o OUTPUT] [-s {c,cd,cn,cdn}]
                  [-l {gid,rid,joint}] [-L {gid,rid,joint}] [-t THREAD]

CosmEL Tool: Prediction.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS, --corpus CORPUS
                        store corpus data in directory "<CORPUS>/"
  -m MODEL, --model MODEL
                        store model data in directory "<MODEL>/"; default is
                        "<CORPUS>/model/"
  -o OUTPUT, --output OUTPUT
                        output predicted XML articles into directory <OUTPUT>;
                        default is "<CORPUS>/xml/purged_article_gnrid/"
  -s {c,cd,cn,cdn}, --structure-eem {c,cd,cn,cdn}
                        use model structure <STRUCTURE-EEM> for entity
                        embeddings model; default is "cdn"
  -l {gid,rid,joint}, --label-eem {gid,rid,joint}
                        use label type <LABEL-EEM> for entity embeddings
                        model; default is "joint"
  -L {gid,rid,joint}, --label-mtc {gid,rid,joint}
                        use label type <LABEL-MTC> for mention type
                        classifier; default is "gid"
  -t THREAD, --thread THREAD
                        use <THREAD> threads; default is `os.cpu_count()`