数据挖掘数据集下载搜集整理版

  • Post author:
  • Post category:其他



1、气候监测数据集


http://cdiac.ornl.gov/ftp/ndp026b



2、几个实用的测试数据集下载的网站

Data for MATLAB hackers (Handwritten Digits、Faces、Text)

http://www.cs.toronto.edu/~roweis/data.html

3、UCI KDD Archive(各类数据集)

http://kdd.ics.uci.edu/summary.task.type.html

http://kdd.ics.uci.edu/summary.data.type.html

4、UCI收集的机器学习数据集

ftp://pami.sjtu.edu.cn/

http://www.ics.uci.edu/~mlearn//MLRepository.htm

5、样本数据库

http://kdd.ics.uci.edu/

WWW-pages were manually classified

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

6、CMU World Wide Knowledge Base (Web->KB) project(classified web pages、relational data describing pages and hyperlinks)

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/

7、人工智能机器学习

http://duch-links.wikispaces.com/

8、文本分类,即rainbow的数据集

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

9、Statlib 数理统计相关程序库

http://liama.ia.ac.cn/SCILAB/scilabindexgb.htm

http://lib.stat.cmu.edu/

http://lib.stat.cmu.edu/datasets/

http://lib.stat.cmu.edu/modules.php?op=modload&name=Downloads&file=index&req=viewdownload&cid=2

10、癌症基因:

http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

11、金融、医药数据:

http://lisp.vse.cz/pkdd99/Challenge/chall.htm

12、时间序列数据的网址

http://www.stat.wisc.edu/~reinsel/bjr-data/

13、kdnuggets 相关链接各种数据集:

http://www.kdnuggets.com/datasets/index.html

14、德国智能分析和信息系统

http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html

http://dctc.sjtu.edu.cn/adaptive/datasets/

http://fimi.cs.helsinki.fi/data/

15、IBM智能信息

http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

http://www.almaden.ibm.com/software/quest/Resources/index.shtml

16、Frequent Set Counting

http://miles.cnuce.cnr.it/~palmeri/datam/DCI/datasets.php

17、评分数据集

Movielens 电影评分数据

基本数据描述:包括以下三个数据集:

a.943个用户对1682个电影的10万条评分

b.6040个用户对3900个电影的1百万条评分

c.71567个用户对10681个电影的1千万条评分

http://www.grouplens.org/

Book-Crossing 书籍评分数据

基本数据描述:包含了278,858个用户对271,379本书籍的1,149,780条评分。该数据集由Cai-Nicolas Ziegler 在2004年8-9月用4周的时间从Book-Crossing社区用网络爬出。

http://www.informatik.uni-freiburg.de/~cziegler/BX/

Jester Joke Data Set 笑话评分集合

来自UC Berkeley的Ken Goldberg发布的一个推荐系统使用的数据集。包含关于100个笑话的73,496名用户评分的410万条连续评分。

http://www.ieor.berkeley.edu/~goldberg/jester-data/

Netflix 数据集

也是电影评分数据集,480,189 个用户,17,770 部电影,100,480,507 条评分记录。与它相比,MovieLens 数据集少了 2 个数量级。它的位置相信会逐渐被 Netflix 数据所替代,这是时代进步的必然结果。

说明:以上四个均为用户评分数据

21、GPS轨迹数据

GeoLife GPS Trajectories

http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/default.aspx

GPS Trajectories with transportation mode labels

http://research.microsoft.com/apps/pubs/?id=141896

Movebank 动物轨迹

http://www.movebank.org/

22、手机WIFI蓝牙

A Community Resource for Archiving Wireless Data At Dartmouth

http://crawdad.cs.dartmouth.edu/

crowflow  手机和wifi轨迹

http://crowdflow.net/

23、OpenStreetMap Data

planet.openstreetmap.org 或者 http://metro.teczno.com/

24、openpath上传数据+API

https://openpaths.cc/

25、FOURSQUARE

26、GeoTime

http://www.geotime.com/GeoTime(s)/January-2012/Cupid-Strikes-Again–Time-Series—GIS–Together-a.aspx

27、数据堂

http://www.datatang.com/

28、http://www.kdnuggets.com/datasets/

29、http://appsrv.cse.cuhk.edu.hk/~kdd/data_collection.html

IBM Almaden Research Center Data Mining Projects

Data Sets:

·         Synthetic Data Generation Code for Associations and Sequential Patterns

·         Synthetic Data Generation Code for Classification

·         “Dense” Data-Sets (apriori binary format, 3.2Mb)

·         Enron Email Data Set

Demos:

·         General Visualizations for Associations

·         Visualization Demo: Market Basket Analysis

IBM Intelligent Miner:

·         IBM Intelligent Miner for Data

·         Video and image clips from IBM Data Mining T.V. Ad

IBM Data Mining Resources:

·         Business Intelligence Solutions   Our colleagues offering data mining consultancy and services.

·         Data Abstraction Research Group   Our colleagues in IBM Thomas J. Watson Research Center.   Our colleagues in France.

·         Data Mining: Extending the Information Warehouse Framework   IBM White Paper on Data Mining.

在下面的网址可以找到reuters数据集

http://www.research.att.com/~lewis/reuters21578.html

关于基金的数据挖掘的网站

http://www.gotofund.com/index.asp

http://lans.ece.utexas.edu/~strehl/

reuters数据集

http://www.research.att.com/~lewis/reuters21578.html

http://www-2.cs.cmu.edu/webkb

http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf

关联:

http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar

http://www.phys.uni.torun.pl/~duch/software.html

WEKA:

http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar

1。A jarfile containing 37 classification problems, originally obtained from the UCI repository

http://prdownloads.sourceforge.net/weka/datasets-UCI.jar

2。A jarfile containing 37 regression problems, obtained from various sources

http://prdownloads.sourceforge.net/weka/datasets-numeric.jar

3。A jarfile containing 30 regression datasets collected by Luis Torgo

http://prdownloads.sourceforge.net/weka/regression-datasets.jar

数据挖掘相关比赛以及数据集

2005 University of California data mining contest, predicting bad accounts and their churn date using real-world CRM data, deadline June 30, 2005.

ILP 2005 Challenge, on the prediction of functional classes of genes.

KDD Cup 2005, on classifying internet user search queries, deadline July 8.

Data Mining Cup 2005 (Chemnitz, Germany), for students; topic: How data mining can ascertain the risk of loss of payments and reduce this risk.

KDD Cup 2004, focuses on data-mining for a several performance criteria using datasets frombioinformatics and quantum physics.

InfoVis 2004 Contest, The History of InfoVis.

DATA MINING CUP 2004 (Chemnitz, Germany), for students.

InfoVis 2003 Contest: Visualization and Pair Wise Comparison of Trees, results announced Sep 5, 2003.

KDD CUP 2003

http://www.cs.cornell.edu/projects/kddcup/index.html


KDD Cup 2003, focuses on problems motivated by network mining and the analysis of usage logs.

DATA MINING CUP 2003 (Chemnitz, Germany). The task is to identify spam emails before they reach the user′s mailbox.

KDD Cup 2002, focus on data mining in molecular biology.

Student Data Mining Cup (2002), Chemnitz University and Prudential Systems.