Example 2: Preparing Data for Neural Network Training

The success of a neural network largely depends on the quality and structure of its data. In this example, we’ll focus on preparing and visualizing data for a neural network designed to classify points into two categories based on their coordinates.

Step 1: Generate Synthetic Data

Synthetic data simplifies experimentation by providing easily understandable relationships.

import numpy as np

# Generate random points with two features
np.random.seed(42)  # For reproducibility
X = np.random.rand(1000, 2)  # 1000 samples with 2 features each

# Define the labels based on a rule (e.g., sum of coordinates > 1)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

Step 2: Split Data into Training and Testing Sets

Dividing data into training and testing subsets ensures the model is evaluated on unseen data.

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Visualize the Data

Data visualization helps in understanding patterns and relationships, which is crucial before model training.

import matplotlib.pyplot as plt

# Plot the data points
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Class 0", color='blue', alpha=0.6)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Class 1", color='red', alpha=0.6)
plt.axline((0.5, 0.5), slope=-1, color='green', linestyle='--', label="Decision Boundary (x1 + x2 = 1)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.title("Visualization of Synthetic Data")
plt.show()

Step 4: Normalize the Data

Normalization scales features to a range that ensures faster convergence during training.

from sklearn.preprocessing import MinMaxScaler

# Normalize features to the range [0, 1]
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 5: Convert Data to TensorFlow Tensors

TensorFlow models require data in tensor format.

import tensorflow as tf

# Convert numpy arrays to TensorFlow tensors
X_train_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tf = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tf = tf.convert_to_tensor(y_test, dtype=tf.float32)

Code Overview

Here’s the full code for preparing and visualizing the data:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf

# Step 1: Generate data
np.random.seed(42)
X = np.random.rand(1000, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

# Step 2: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Visualize data
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Class 0", color='blue', alpha=0.6)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Class 1", color='red', alpha=0.6)
plt.axline((0.5, 0.5), slope=-1, color='green', linestyle='--', label="Decision Boundary (x1 + x2 = 1)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.title("Visualization of Synthetic Data")
plt.show()

# Step 4: Normalize data
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 5: Convert to TensorFlow tensors
X_train_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tf = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tf = tf.convert_to_tensor(y_test, dtype=tf.float32)

Key Insights

Data Distribution: Understand the patterns in the data before training the model.
Normalization: Always scale data for models that rely on gradient-based optimizers.
Test Set: Keep a portion of data aside to evaluate model performance.