ACE 2005数据集（介绍1）

Post author:xfxia
Post published:2023年9月14日
Post category:其他

以下内容来自

CMU的William Cohen小组（http://curtis.ml.cmu.edu/w/courses/index.php/ACE_2005_Dataset）

The ACE 2005

dataset

addresses five primary tasks – the recognition of entities, values, temporal expressions,

relations

, and events.

ACE2005数据库解决了3项基本的任务——实体识别、值、事件表达式、关系和事件

The dataset is available at the Linguistic Data Consortium. The data is taken from a variety of sources and is available for the tasks in the following languages: Arabic, Chinese and English.

这个数据集可以从语言数据联盟获得. 数据来自多种数据源并可以在如下语言的相关任务中使用

Four versions of each document are provided:

每一个文件都提供了4种版本：

Source text files (.sgm): All source files, including the Chinese files, are encoded in UTF-8.
源文本文件(.sgm):所有源文件，包括中文文件，都用UTF-8编码。

APF files (.apf.xml): The ACE Program Format.
APF文件 (.apf.xml): ACE程序格式

AG files (.ag.xml): The LDC Annotation Graph Format.
AG文件(. AG .xml): LDC注释图格式。

TABLE files (.tab): Files that store mapping tables between the IDs used in each ag.xml file and their corresponding apf.xml file.
表文件(.tab):存储每个ag.xml文件中使用的id与其对应的apf.xml文件id之间的映射表的文件

The detailed statistics for the training portion of this corpus are as follows:

本预料中训练集部分的详细统计如下：

LDC2005E18发布版本的2005 ACE训练语料统计。
源	训练数据时间	估算大小
英文资源
广播新闻	3/03-6/03	60000词
广播对话	3/03-6/03	45000词
新闻专线	3/03-6/03	60000词
微博	3/03-6/03	45000词
网络新闻	11/04-2/05	45000词
对话、电话、讲话	11/04-12/04(根据主题等区分)	45000词
阿拉伯语资源
广播新闻	10/00-12/00	60000词
新闻专线	10/00-12/00	60000词
微博	11/04-2/05	30000词
中文资源（1.5字符=1个词）
广播新闻	10/00-12/00	120000词
新闻专线	10/00-12/00	120000词
微博	11/04-2/05	60000词

你可能也喜欢