What is sklearn.datasets load_svmlight_file() in Python?

Overview

load_svmlight_file() loads the dataset in the format of svmlight and libsvm. load_svmlight_file() is suitable for the sparse dataset, as it doesn’t support features with zero value. The first element in each line is used to store the target value to predict.

When we are continuously working on the same dataset, it is better to use the joblib library. The following format is considered the default for both svmlight and libsvm.

Syntax

sklearn.datasets.load_svmlight_file(
  f,
  *,
  n_features= None,
  dtype= <class 'numpy.float64'>,
  multilabel= False,
  zero_based= 'auto',
  query_id= False,
  offset= 0,
  length= -1
 )

Parameters

  • f: Shows the path of a file to load.
  • n_features: Shows the number of features to be used.
  • dtype: Depicts the data type regarding the dataset.
  • multilabel: Can be helpful if the sample may also contain several labels.
  • zero_based: Checks whether the indices of columns are one-based or not. If they are one-based, then this parameter changes them to zero-based.
  • query_id: If true, query_id will return the relevant array regarding each file.
  • offset: Ignores the first byte of offset. offset discards the next bytes until it reaches the next line.
  • length: Helps stop reading the new line when the file has reached the bytes threshold.

Return value

  • X: scipy.sparse matrix of n_samples * n_features dimensions.
  • Y: ndarray of shape n_samples or multilabel list of tuples of length n_samples.
  • query_id: The query ID for each sample. This is optional value that is only returned when query_id is set to True_

Code

In line 3, we use the load_svmlight_file() method to load a data.csv to get the sparse matrix of features and length of multilabels.

main.py
data.csv
from sklearn.datasets import load_svmlight_file
# Load data as svmlight from data.csv file
X,Y = load_svmlight_file("data.csv", multilabel = True, zero_based = True)
print("Sparse matrix of n_samples * n_features dimensions: ")
print(X) # X value
print("Length of ndarray or multilabel list of tuples")
print(Y) # Y value

Free Resources