参考框架 系统 基准
In this article, I will show how to retrieve close to one million public text or PDF documents. Some of these documents are raw text, some are clean text, and some include categorical labelling. I will also introduce
KILT,
a
benchmark framework for natural language models.
在本文中,我将展示如何检索接近一百万个公共文本或PDF文档。 这些文档中有些是原始文本,有些是纯文本,有些则包含分类标签。 我还将介绍
KILT,
自然语言模型的基准框架。
公共NLP数据集列表列表。
(
List of Lists of Public NLP Datasets.
)
The following are non-inclusive
lists
of lists of
NLP
datasets:
以下是
NLP
数据集
列表
的非包含
列表
:
原始文字
(
Raw text
)
-
Google
datasets;
Google
数据集; -
文本数据集
; -
Kaggle
datasets;
Kaggle
数据集; -
fast.ai
datasets;
fast.ai
数据集; -
USC Machine Learning Repository
datasets;
USC机器学习存储库
数据集; -
pyquora: A Python module to fetch and parse data from Quora;
联邦
(
Federal
)
-
IRS
Documents;
国税局
文件; -
专利表
; -
10-K归档规则
;
偏压
(
Bias
)
-
StereoSet is a dataset of 17,000 sentences that measures model preferences across gender, race, religion, and profession. StereoSett is used to measure bias in
NLP
models.StereoSet是一个包含17,000个句子的数据集,用于测量跨性别,种族,宗教和职业的模型偏好。 StereoSett用于测量
NLP
模型中的偏差。
COVID-19,医学和NIH原始文本
(
COVID-19, Medical and NIH Raw Text
)
-
如果你生病怎么办
; -
https://www.cdc.gov/coronavirus/2019-ncov/downloads/10Things.pdf
;
https://www.cdc.gov/coronavirus/2019-ncov/downloads/10Things.pdf
; -
COVID-19 Open Research Database.
COVID-19 is a resource of over 200,000 scholarly articles, including over 97,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
COVID-19开放研究数据库。
COVID-19的资源超过200,000篇学术文章,其中包括97,000篇以上的全文,涉及COVID-19,SARS-CoV-2和相关冠状病毒。 -
CDC指导文件
; -
National Institutes of Health (NIH) Funding: FY1995-FY2021
(search NIH PDF);
美国国立卫生研究院(NIH)资助:1995-FY2021
(研究NIH PDF); -
WebMD tex
t.
WebMD文本
吨。
非英语自然语言
(
Non-English Natural Language
)
-
25 Best Parallel Text Datasets for Machine Translation Training
; -
2
0 Best French Language Datasets for Machine Learning
;2
0最佳机器学习法语语言数据集
;
专门的NLP数据集
(
Specialized NLP datasets
)
1.
1。
2..
2 ..
最终元数据集
(
The Ultimate Meta-Dataset
)
Goggle Dataset Search
:
Finding Millions of Datasets on the Web
Goggle数据集搜索
:
在Web上查找数百万个数据集
Goggle Dataset Search was released into public publication in January, 2020
[1]
.
Goggle数据集搜索已于2020年
1
月公开发布
[1]
。
Instead of grepping or web scraping a dataset of interest, you can filter many candidate PDFs, Word text, image
,
sound, structured data
,
and
somebody-already-created-it-for-you
datasets from
Goggle Dataset Search.
相反grepping或网页抓取感兴趣的数据集,您可以过滤许多候选人的PDF,Word中文字,图像
,
声音,结构化数据
,
而
有人-已经创建的,它适合你
的
护目镜数据集搜索
数据集
。
标杆管理
(
Benchmarking
)
短裙
(
KILT
)
Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure.
[2].
诸如开放域问题解答,事实检查,空位填充和实体链接之类的具有挑战性的问题需要访问大量外部知识资源。 尽管某些模型可以很好地完成单个任务,但是开发通用模型却很困难,因为除了专用的基础结构之外,每个任务都可能需要计算昂贵的自定义知识源索引。
[2]。
KILT
(knowledge-intensive language tasks) is a benchmark for an Artificial natural language models. The
KILT
benchmark is for a wide range of knowledge-intensive tasks.
KILT
(知识密集型语言任务)是人工自然语言模型的基准。
KILT
基准测试适用于各种知识密集型任务。
Admittedly it is “specialized” to Natural Language Processing (
NLP
) models.
诚然,它是“专用于”自然语言处理(
NLP
)模型的。
However, the
Turing test
, the widely accepted AGI (Artificial General Intelligence) test, is a natural language-based
[3]
.
但是,
图灵测验
是一种被广泛接受的AGI(人工通用情报)测验,是基于自然语言的
[3]
。
… solving
knowledge-intensive
tasks requires (even for humans) access to a large body of information [].…解决
知识密集型
任务需要(甚至对于人类)访问大量信息[]。
KILT
uses 5.9 million Wiki pages for its knowledge base
[2,4].
KILT
使用590万个Wiki页面作为其知识库
[2,4]。
Using a large corpus to start and then keep feeding more text to
KILT,
the researchers at Facebook hope that
KILT
is a benchmark for any
NLP
model for any domain.
Facebook的研究人员希望使用大型语料库来开始然后继续向
KILT
提供更多文本
,
因此希望
KILT
是任何域的任何
NLP
模型的基准。
Being able to benchmark any domain is a lofty goal. Below are domain-specific
NLP
tasks:
能够对任何领域进行基准测试是一个崇高的目标。 以下是特定于域的
NLP
任务:
-
Business-specific entities, like artifacts, events, and actors;
特定于业务的实体,例如工件,事件和参与者;
-
Relationships between entities;
实体之间的关系;
-
Business processes.
业务流程。
-
Meta-knowledge. Knowledge about what knowledge you know.
元知识。 有关您所知道的知识的知识。
概要
(
Summary
)
You are presented with 33 lists of datasets.
您会看到33个数据集列表。
Fast.ai
probably has datasets most common to researchers.
Fast.ai
可能具有研究人员最常用的数据集。
Kaggle
has datasets of text, Q&A, structured data audio, and 2D- and 3-D images.
Kaggle
具有文本,问题与
解答
,结构化数据音频以及2D和3D图像的数据集。
You were presented with a formal NLP benchmark framework:
Kate
.
您将看到一个正式的NLP基准框架:
Kate
。
Finally, you were introduced to an awesome dataset search engine:
Goggle Dataset Search.
最后,向您介绍了一个很棒的数据集搜索引擎:
Goggle数据集搜索。
Compiling lists of datasets has helped me. I hope it helps you.
编译数据集列表对我有所帮助。 希望对您有帮助。
参考框架 系统 基准