Loading Data in Scikit-Learn
Before you can train a model, you need to get your data into a format that Scikit-Learn understands. Scikit-Learn works primarily with NumPy arrays or Pandas DataFrames, but it also provides built-in tools to help you get started quickly.
1. The Scikit-Learn Data Format
Regardless of how you load your data, Scikit-Learn expects two main components:
- The Feature Matrix (): A 2D array of shape
(n_samples, n_features). - The Target Vector (): A 1D array of shape
(n_samples)containing the labels or values you want to predict.
2. Built-in "Toy" Datasets
Scikit-Learn comes bundled with small datasets that require no internet connection. These are perfect for testing your code or learning new algorithms.
load_iris(): Classic classification dataset (flowers).load_diabetes(): Regression dataset.load_digits(): Classification dataset (handwritten digits).
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
# Access data and labels
X = iris.data
y = iris.target
print(f"Features: {iris.feature_names}")
print(f"Target Names: {iris.target_names}")
3. Fetching Large Real-World Datasets
For larger datasets, Scikit-Learn provides "fetchers" that download data from the internet and cache it locally in your ~/scikit_learn_data folder.
fetch_california_housing(): Predict median house values.fetch_20newsgroups(): Text dataset for NLP.fetch_lfw_people(): Labeled Faces in the Wild (for face recognition).
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
print(f"Dataset shape: {housing.data.shape}")
4. Loading from External Sources
In a professional environment, you will rarely use the built-in datasets. You will likely load data from CSVs, SQL Databases, or Pandas DataFrames.
From Pandas to Scikit-Learn
Scikit-Learn is designed to be "Pandas-friendly." You can pass DataFrames directly into models.
import pandas as pd
from sklearn.linear_model import LinearRegression
# Load your own CSV
df = pd.read_csv('my_data.csv')
# Split into X and y
X = df[['feature1', 'feature2']] # Select specific columns
y = df['target_column']
# Train model directly
model = LinearRegression().fit(X, y)
5. Generating Synthetic Data
Sometimes you need to create "fake" data to test how an algorithm handles specific scenarios (like high noise or non-linear patterns).
from sklearn.datasets import make_blobs, make_moons
# Create 3 distinct clusters for a classification task
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
6. Understanding the "Bunch" Object
When you use load_* or fetch_*, Scikit-Learn returns a Bunch object. This is essentially a dictionary that contains:
.data: The feature matrix..target: The labels..feature_names: The names of the columns..DESCR: A full text description of where the data came from.
Use as_frame=True in your loader to get the data returned as a Pandas DataFrame immediately: data = load_iris(as_frame=True).frame
References for More Details
- Sklearn Dataset Loading Guide: Exploring all 20+ available fetchers and loaders.
- OpenML Integration: Accessing thousands of community-uploaded datasets via
fetch_openml.
Now that you can load data, the next step is to ensure it's in the right shape and split correctly for training and testing.