scikit-learn toy datasets
scikit-learn provides some built-in datasets that can be used for prototyping purposes because they don't require very long training processes and offer different levels of complexity. They're all available in the sklearn.datasets package and have a common structure: the data instance variable contains the whole input set X while the target contains the labels for classification or target values for regression. For example, considering the Boston house pricing dataset (used for regression), we have the following:
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
Y = boston.target
print(X.shape)
(506, 13)
print(Y.shape)
(506,)
In this case, we have 506 samples with 13 features and a single target value. In this book, we're going to use it for regressions and the MNIST handwritten digit dataset (load_digits()) for classification tasks. scikit-learn also provides functions for creating dummy datasets from scratch: make_classification(), make_regression(), and make_blobs() (which are particularly useful for testing cluster algorithms). They're very easy to use and, in many cases, it's the best choice to test a model without loading more complex datasets.
The MNIST dataset provided by scikit-learn is limited for obvious reasons. If you want to experiment with the original one, refer to the website managed by Y. LeCun, C. Cortes, and C. Burges: http://yann.lecun.com/exdb/mnist/. Here, you can download a full version made up of 70,000 handwritten digits that are already split into training and test sets.