How to create data for n-class problems using Scikit-learn

sklearn.datasets.make_classification(n_samples=100,
                                     n_features=20,
                                     n_informative=2,
                                     n_redundant=2,
                                     n_repeated=0,
                                     n_classes=2,
                                     n_clusters_per_class=2,
                                     weights=None,
                                     flip_y=0.01,
                                     class_sep=1.0,
                                     hypercube=True,
                                     shift=0.0,
                                     scale=1.0,
                                     shuffle=True,
                                     random_state=None)

Parameters

n_samples: This is the number of samples, and its value type is an int. The default value is 100.
n_features: This is the total number of functions. Its value type is int, and its default value is 20.
n_informative: This is the number of informative features. Its value type is int, and its default value is 2.
n_redundant: This is the number of redundant functions. This feature generates arbitrary linear combinations of informative features. Its value type is int, and its default value is 2.
n_repeated: This is the number of repeating functions that derive from information and redundant functions. Its value type is int, and its default value is 0.
n_classes: This is the number of classes (or labels) for classification problems. Its value type is int, and its default value is 2.
n_clusters_per_class: This is the number of clusters per class. Its value type is int, and its default value is 2.
weights: This is the proportion of monsters assigned to each category. Its value type is an array-like shape (n_classes,) or (n_classes - 1,) and its default value is None.
flip_y: This is the proportion of samples randomly assigned to classes. Its value type is float, and its default value is 0.01.
class_sep: This is the factor to multiply the size of the hypercube with. Its value type is float, and its default value is 1.0.
hypercube: This is a boolean value. If it's set to True, the clusters are placed on the vertices of the hypercube. If it's set to False, the clusters are placed on the vertices of any polyhedron. Its default value is True.
shift: This shifts the function by the specified value. Its value type is float, and its default value is 0.0.
scale: This multiplies the function by the specified value. Its value type is float, and its default value is 1.0.
shuffle: This shuffles the samples and the features. Its value type is bool, and its default value is True.
random_state: This controls the generation of random numbers used to create the dataset. Its value type is int, and its default value is None.

Return values

The function returns the following two values:

X: This shows the input samples in the form of an n-dimensional array of shape (n_samples, n_features).
Y: This shows the integer labels for class membership of each sample in the form of an n-dimensional array of shape (n_samples,).

Example

In the code snippet below, we use the make_classification() function.

# import library
from sklearn.datasets import make_classification
# create features and target
features, target = make_classification(n_samples=100,
                                      n_features=10,
                                      n_informative=10,
                                      n_redundant=0,
                                      n_classes=2,
                                      weights=[0.3, 0.7],
                                      random_state=42)
# print features and target
print("Features:")
print(features[:5])
print("Targets:")
print(target[:5])

Free Resources

How to create data for n-class problems using Scikit-learn

Overview

Syntax

Parameters

Return values

Example

Explanation