The success of a neural network largely depends on the quality and structure of its data. In this example, we’ll focus on preparing and visualizing data for a neural network designed to classify points into two categories based on their coordinates.
Step 1: Generate Synthetic Data
Synthetic data simplifies experimentation by providing easily understandable relationships.
import numpy as np
# Generate random points with two features
np.random.seed(42) # For reproducibility
X = np.random.rand(1000, 2) # 1000 samples with 2 features each
# Define the labels based on a rule (e.g., sum of coordinates > 1)
y = (X[:, 0] + X[:, 1] > 1).astype(int)
Step 2: Split Data into Training and Testing Sets
Dividing data into training and testing subsets ensures the model is evaluated on unseen data.
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Visualize the Data
Data visualization helps in understanding patterns and relationships, which is crucial before model training.
import matplotlib.pyplot as plt
# Plot the data points
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Class 0", color='blue', alpha=0.6)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Class 1", color='red', alpha=0.6)
plt.axline((0.5, 0.5), slope=-1, color='green', linestyle='--', label="Decision Boundary (x1 + x2 = 1)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.title("Visualization of Synthetic Data")
plt.show()
Step 4: Normalize the Data
Normalization scales features to a range that ensures faster convergence during training.
from sklearn.preprocessing import MinMaxScaler
# Normalize features to the range [0, 1]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 5: Convert Data to TensorFlow Tensors
TensorFlow models require data in tensor format.
import tensorflow as tf
# Convert numpy arrays to TensorFlow tensors
X_train_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tf = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tf = tf.convert_to_tensor(y_test, dtype=tf.float32)
Code Overview
Here’s the full code for preparing and visualizing the data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
# Step 1: Generate data
np.random.seed(42)
X = np.random.rand(1000, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)
# Step 2: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Visualize data
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Class 0", color='blue', alpha=0.6)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Class 1", color='red', alpha=0.6)
plt.axline((0.5, 0.5), slope=-1, color='green', linestyle='--', label="Decision Boundary (x1 + x2 = 1)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.title("Visualization of Synthetic Data")
plt.show()
# Step 4: Normalize data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Step 5: Convert to TensorFlow tensors
X_train_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tf = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tf = tf.convert_to_tensor(y_test, dtype=tf.float32)
Key Insights
- Data Distribution: Understand the patterns in the data before training the model.
- Normalization: Always scale data for models that rely on gradient-based optimizers.
- Test Set: Keep a portion of data aside to evaluate model performance.