What is sklearn.datasets load_svmlight_file() in Python?
Overview
load_svmlight_file() loads the dataset in the format of svmlight and libsvm. load_svmlight_file() is suitable for the sparse dataset, as it doesn’t support features with zero value. The first element in each line is used to store the target value to predict.
When we are continuously working on the same dataset, it is better to use the joblib library. The following format is considered the default for both svmlight and libsvm.
Syntax
sklearn.datasets.load_svmlight_file(
f,
*,
n_features= None,
dtype= <class 'numpy.float64'>,
multilabel= False,
zero_based= 'auto',
query_id= False,
offset= 0,
length= -1
)
Parameters
f: Shows the path of a file to load.n_features: Shows the number of features to be used.dtype: Depicts the data type regarding the dataset.multilabel: Can be helpful if the sample may also contain several labels.zero_based: Checks whether the indices of columns are one-based or not. If they are one-based, then this parameter changes them to zero-based.query_id: If true,query_idwill return the relevant array regarding each file.offset: Ignores the first byte of offset.offsetdiscards the next bytes until it reaches the next line.length: Helps stop reading the new line when the file has reached the bytes threshold.
Return value
X:scipy.sparsematrix ofn_samples*n_featuresdimensions.Y:ndarrayof shapen_samplesor multilabel list of tuples of lengthn_samples.query_id: The query ID for each sample. This is optional value that is only returned whenquery_idis set toTrue_
Code
In line 3, we use the load_svmlight_file() method to load a data.csv to get the sparse matrix of features and length of multilabels.
main.py
data.csv
from sklearn.datasets import load_svmlight_file# Load data as svmlight from data.csv fileX,Y = load_svmlight_file("data.csv", multilabel = True, zero_based = True)print("Sparse matrix of n_samples * n_features dimensions: ")print(X) # X valueprint("Length of ndarray or multilabel list of tuples")print(Y) # Y value