sklearn: 数据集加载

2018-11-01

description: Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories made available by Reuters, Ltd. for research purposes
samples: 20640
dimensionality: 8
features: real
和波斯顿房价数据的区别：这个是基于房屋本身属性的，而那个是基于城市地区属性的，这个其实更接近现实一点。

MedInc median income in block
HouseAge median house age in block
AveRooms average number of rooms
AveBedrms average number of bedrooms
Population block population
AveOccup average house occupancy
Latitude house block latitude
Longitude house block longitude

5. 生成数据

分类：单标签

函数：make_blobs 函数：make_classification

分类：多标签

函数：make_multilabel_classification

二分聚类

函数：make_biclusters，Generate an array with constant block diagonal structure for biclustering. 函数：make_checkerboard，Generate an array with block checkerboard structure for biclustering.

回归生成器

函数：make_regression，产生的回归目标作为一个可选择的稀疏线性组合的具有噪声的随机的特征

流行学习生成器

函数：make_s_curve，生成S曲线数据集函数：make_swiss_roll，生成swiss roll数据集

6. 下载公开数据集：openml.org

openml.org：是一个用于机器学习数据和实验的公共存储库，它允许每个人上传开放的数据集
函数：sklearn.datasets.fetch_openml

>>> from sklearn.datasets import fetch_openml
>>> mice = fetch_openml(name='miceprotein', version=4)
>>> 

# 查看数据集的信息和属性
# DESCR：自由文本描述数据
# details：字典格式的元数据
>>> print(mice.DESCR)
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
Syndrome. PLoS ONE 10(6): e0129126...

>>> mice.details
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
'file_id': '17928620', 'default_target_attribute': 'class',
'row_id_attribute': 'MouseID',
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
'visibility': 'public', 'status': 'active',
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}

7. 加载外部数据集

数据集已经准备好了，自行加载以输入模型
不同的工具包：pandas.io,scipy.io,numpy
杂项数据：skimage.io,Imagio,scipy.misc.imread,scipy.io.wavfile.read

参考

据集加载工具@sklearn 中文

If you link this blog, please refer to this page, thanks!
Post link：https://tsinghua-gongjing.github.io/posts/sklearn_dataset.html

Previous: [CS229] 06: Logistic Regression

Next: [CS229] 07: Regularization

sklearn: 数据集加载

目录

1. 数据集

2. 通用数据集

3. 通用标准数据集

boston house-prices

Iris

diabetes

digitals

linnerud

wine

breast cancer

4. 真实数据

The Olivetti faces dataset

The Olivetti faces dataset

The Labeled Faces in the Wild face recognition dataset

Forest covertypes

RCV1 dataset

Kddcup 99 dataset

California Housing dataset