CMU的William Cohen小组(http://curtis.ml.cmu.edu/w/courses/index.php/ACE_2005_Dataset)
The ACE 2005
addresses five primary tasks – the recognition of entities, values, temporal expressions,
, and events.
The dataset is available at the Linguistic Data Consortium. The data is taken from a variety of sources and is available for the tasks in the following languages: Arabic, Chinese and English.
这个数据集可以从语言数据联盟获得. 数据来自多种数据源并可以在如下语言的相关任务中使用
Four versions of each document are provided:
Source text files (.sgm): All source files, including the Chinese files, are encoded in UTF-8.
APF files (.apf.xml): The ACE Program Format.
APF文件 (.apf.xml): ACE程序格式
AG files (.ag.xml): The LDC Annotation Graph Format.
AG文件(. AG .xml): LDC注释图格式。
TABLE files (.tab): Files that store mapping tables between the IDs used in each ag.xml file and their corresponding apf.xml file.
The detailed statistics for the training portion of this corpus are as follows:
源 | 训练数据时间 | 估算大小 |
英文资源 |
广播新闻 | 3/03-6/03 | 60000词 |
广播对话 | 3/03-6/03 | 45000词 |
新闻专线 | 3/03-6/03 | 60000词 |
微博 | 3/03-6/03 | 45000词 |
网络新闻 | 11/04-2/05 | 45000词 |
对话、电话、讲话 | 11/04-12/04(根据主题等区分) | 45000词 |
阿拉伯语资源 |
广播新闻 | 10/00-12/00 | 60000词 |
新闻专线 | 10/00-12/00 | 60000词 |
微博 | 11/04-2/05 | 30000词 |
中文资源(1.5字符=1个词) |
广播新闻 | 10/00-12/00 | 120000词 |
新闻专线 | 10/00-12/00 | 120000词 |
微博 | 11/04-2/05 | 60000词 |