1、气候监测数据集
http://cdiac.ornl.gov/ftp/ndp026b
2、几个实用的测试数据集下载的网站
Data for MATLAB hackers (Handwritten Digits、Faces、Text)
http://www.cs.toronto.edu/~roweis/data.html
3、UCI KDD Archive(各类数据集)
http://kdd.ics.uci.edu/summary.task.type.html
http://kdd.ics.uci.edu/summary.data.type.html
4、UCI收集的机器学习数据集
ftp://pami.sjtu.edu.cn/
http://www.ics.uci.edu/~mlearn//MLRepository.htm
5、样本数据库
http://kdd.ics.uci.edu/
WWW-pages were manually classified
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
6、CMU World Wide Knowledge Base (Web->KB) project(classified web pages、relational data describing pages and hyperlinks)
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
7、人工智能机器学习
http://duch-links.wikispaces.com/
8、文本分类,即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
9、Statlib 数理统计相关程序库
http://liama.ia.ac.cn/SCILAB/scilabindexgb.htm
http://lib.stat.cmu.edu/
http://lib.stat.cmu.edu/datasets/
http://lib.stat.cmu.edu/modules.php?op=modload&name=Downloads&file=index&req=viewdownload&cid=2
10、癌症基因:
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi
11、金融、医药数据:
http://lisp.vse.cz/pkdd99/Challenge/chall.htm
12、时间序列数据的网址
http://www.stat.wisc.edu/~reinsel/bjr-data/
13、kdnuggets 相关链接各种数据集:
http://www.kdnuggets.com/datasets/index.html
14、德国智能分析和信息系统
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html
http://dctc.sjtu.edu.cn/adaptive/datasets/
http://fimi.cs.helsinki.fi/data/
15、IBM智能信息
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
16、Frequent Set Counting
http://miles.cnuce.cnr.it/~palmeri/datam/DCI/datasets.php
17、评分数据集
Movielens 电影评分数据
基本数据描述:包括以下三个数据集:
a.943个用户对1682个电影的10万条评分
b.6040个用户对3900个电影的1百万条评分
c.71567个用户对10681个电影的1千万条评分
http://www.grouplens.org/
Book-Crossing 书籍评分数据
基本数据描述:包含了278,858个用户对271,379本书籍的1,149,780条评分。该数据集由Cai-Nicolas Ziegler 在2004年8-9月用4周的时间从Book-Crossing社区用网络爬出。
http://www.informatik.uni-freiburg.de/~cziegler/BX/
Jester Joke Data Set 笑话评分集合
来自UC Berkeley的Ken Goldberg发布的一个推荐系统使用的数据集。包含关于100个笑话的73,496名用户评分的410万条连续评分。
http://www.ieor.berkeley.edu/~goldberg/jester-data/
Netflix 数据集
也是电影评分数据集,480,189 个用户,17,770 部电影,100,480,507 条评分记录。与它相比,MovieLens 数据集少了 2 个数量级。它的位置相信会逐渐被 Netflix 数据所替代,这是时代进步的必然结果。
说明:以上四个均为用户评分数据
21、GPS轨迹数据
GeoLife GPS Trajectories
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/default.aspx
GPS Trajectories with transportation mode labels
http://research.microsoft.com/apps/pubs/?id=141896
Movebank 动物轨迹
http://www.movebank.org/
22、手机WIFI蓝牙
A Community Resource for Archiving Wireless Data At Dartmouth
http://crawdad.cs.dartmouth.edu/
crowflow 手机和wifi轨迹
http://crowdflow.net/
23、OpenStreetMap Data
planet.openstreetmap.org 或者 http://metro.teczno.com/
24、openpath上传数据+API
https://openpaths.cc/
25、FOURSQUARE
26、GeoTime
http://www.geotime.com/GeoTime(s)/January-2012/Cupid-Strikes-Again–Time-Series—GIS–Together-a.aspx
27、数据堂
http://www.datatang.com/
28、http://www.kdnuggets.com/datasets/
29、http://appsrv.cse.cuhk.edu.hk/~kdd/data_collection.html
IBM Almaden Research Center Data Mining Projects
Data Sets:
· Synthetic Data Generation Code for Associations and Sequential Patterns
· Synthetic Data Generation Code for Classification
· “Dense” Data-Sets (apriori binary format, 3.2Mb)
· Enron Email Data Set
Demos:
· General Visualizations for Associations
· Visualization Demo: Market Basket Analysis
IBM Intelligent Miner:
· IBM Intelligent Miner for Data
· Video and image clips from IBM Data Mining T.V. Ad
IBM Data Mining Resources:
· Business Intelligence Solutions Our colleagues offering data mining consultancy and services.
· Data Abstraction Research Group Our colleagues in IBM Thomas J. Watson Research Center. Our colleagues in France.
· Data Mining: Extending the Information Warehouse Framework IBM White Paper on Data Mining.
在下面的网址可以找到reuters数据集
http://www.research.att.com/~lewis/reuters21578.html
关于基金的数据挖掘的网站
http://www.gotofund.com/index.asp
http://lans.ece.utexas.edu/~strehl/
reuters数据集
http://www.research.att.com/~lewis/reuters21578.html
http://www-2.cs.cmu.edu/webkb
http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf
关联:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
http://www.phys.uni.torun.pl/~duch/software.html
WEKA:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
1。A jarfile containing 37 classification problems, originally obtained from the UCI repository
http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
2。A jarfile containing 37 regression problems, obtained from various sources
http://prdownloads.sourceforge.net/weka/datasets-numeric.jar
3。A jarfile containing 30 regression datasets collected by Luis Torgo
http://prdownloads.sourceforge.net/weka/regression-datasets.jar
数据挖掘相关比赛以及数据集
2005 University of California data mining contest, predicting bad accounts and their churn date using real-world CRM data, deadline June 30, 2005.
ILP 2005 Challenge, on the prediction of functional classes of genes.
KDD Cup 2005, on classifying internet user search queries, deadline July 8.
Data Mining Cup 2005 (Chemnitz, Germany), for students; topic: How data mining can ascertain the risk of loss of payments and reduce this risk.
KDD Cup 2004, focuses on data-mining for a several performance criteria using datasets frombioinformatics and quantum physics.
InfoVis 2004 Contest, The History of InfoVis.
DATA MINING CUP 2004 (Chemnitz, Germany), for students.
InfoVis 2003 Contest: Visualization and Pair Wise Comparison of Trees, results announced Sep 5, 2003.
KDD CUP 2003
http://www.cs.cornell.edu/projects/kddcup/index.html
KDD Cup 2003, focuses on problems motivated by network mining and the analysis of usage logs.
DATA MINING CUP 2003 (Chemnitz, Germany). The task is to identify spam emails before they reach the user′s mailbox.
KDD Cup 2002, focus on data mining in molecular biology.
Student Data Mining Cup (2002), Chemnitz University and Prudential Systems.