系列タギングというものについて勉強する

SVMを構造化処理に用いる具体的な方法は、単語の情報(見出し語・品詞の種類・語の長さなど)をベクトルに変換し、このベクトルが特定の属性を持つかどうか(「部位」「所見」などに相当するかどうか)を求める際にSVM を用いる、というものである。

No.14 医学的知識の抽出

Both the training file and the test file need to be in a particular format for YamCha to work properly. Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, just put an empty line (or just put 'EOS').

You can give as many columns as you like, however the number of columns must be fixed through all tokens. Furthermore, there are some kinds of "semantics" among the columns. For example, 1st column is 'word', second column is 'POS tag' third column is 'sub-category of POS' and so on.

The last column represents a true answer tag which is going to be trained by SVMs.

YamCha: Yet Another Multipurpose CHunk Annotator