In this blog, we'll explore how to use TensorFlow to create a simple regression model that predicts housing prices using the Boston Housing dataset. We'll walk through data preprocessing, model building, training, and evaluation.
Understanding Boston Housing Dataset
The Boston Housing dataset is one of the most famous datasets in the machine learning community. It contains information collected by the U.S Census Service concerning housing in the area of Boston, Massachusetts. The dataset is commonly used for regression analysis, where the objective is to predict the median value of owner-occupied homes based on various features such as crime rate, average number of rooms per dwelling, and more. The Boston Housing dataset contains 506 instances with 13 features each. The target variable is the median value of owner-occupied homes in $1000s. Below is a brief description of each feature:
CRIM: Per capita crime rate by town.
ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.
INDUS: Proportion of non-retail business acres per town.
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
NOX: Nitric oxides concentration (parts per 10 million).
RM: Average number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
DIS: Weighted distances to five Boston employment centers.
RAD: Index of accessibility to radial highways.
TAX: Full-value property tax rate per $10,000.
PTRATIO: Pupil-teacher ratio by town.
B: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town.
LSTAT: Percentage of lower status of the population.
MEDV: Median value of owner-occupied homes in $1000s.
Exploring the Boston Housing Dataset with TensorFlow in Python
Exploring the Boston Housing dataset with TensorFlow in Python offers a hands-on opportunity to understand and implement regression analysis using neural networks. This classic dataset, which includes various socio-economic and geographical features, is often used to predict the median value of homes in Boston, Massachusetts. By leveraging TensorFlow, we can efficiently preprocess the data, build a predictive model, and evaluate its performance. Through this exploration, one gains insights into the process of training a neural network, the importance of data standardization, and the practical application of machine learning techniques in real-world scenarios.
Loading and Preprocessing Boston Housing Dataset
First, let's load the dataset and perform some basic preprocessing. TensorFlow has the Boston Housing dataset available in its keras.datasets module, making it easy to load the data.
import tensorflow as tf
from tensorflow.keras.datasets import boston_housing
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler
import numpy as np
# Load the dataset
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
# Standardize the data
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
# Print the shape of the data
print(f'Training data shape: {train_data.shape}')
print(f'Test data shape: {test_data.shape}')
Output for the above code:
Training data shape: (404, 13)
Test data shape: (102, 13)
Standardizing the data ensures that each feature has a mean of 0 and a standard deviation of 1, which helps in training the neural network more efficiently.
Building the Sequential Model
We'll build a simple feedforward neural network using TensorFlow's Keras API. The model will have a few dense layers, with ReLU activation functions, followed by a linear output layer.
# Build the model
model = Sequential([
Dense(64, activation='relu', input_shape=(train_data.shape[1],)),
Dropout(0.5),
Dense(64, activation='relu'),
Dense(1) # Output layer for regression
])
# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
In the above code:
Dense(64, activation='relu'): Creates a dense (fully connected) layer with 64 neurons and ReLU activation function.
Dropout(0.5): Drops 50% of the neurons during training, which helps prevent overfitting.
Dense(1): The output layer has a single neuron since we're predicting a continuous value.
Training the Model
Next, we'll train the model using the training data. We'll also include validation to monitor the model's performance on unseen data.
# Train the model
history = model.fit(train_data, train_targets,
epochs=100,
validation_split=0.2,
batch_size=32,
verbose=1)
Output for the above code:
11/11 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 22.1980 - mae: 3.5504 - val_loss: 13.1642 - val_mae: 2.6839
Epoch 97/100
11/11 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 19.4921 - mae: 3.3381 - val_loss: 13.5337 - val_mae: 2.7524
Epoch 98/100
11/11 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step - loss: 25.4868 - mae: 3.5843 - val_loss: 13.4094 - val_mae: 2.7224
Epoch 99/100
11/11 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 20.7868 - mae: 3.4319 - val_loss: 13.5221 - val_mae: 2.7711
Epoch 100/100
11/11 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 19.2941 - mae: 3.2998 - val_loss: 14.3015 - val_mae: 2.8350
Here, we're training the model for 100 epochs, using 20% of the training data as validation data.
Evaluating the Model
After training, we can evaluate the model on the test set to see how well it generalizes to new data.
# Evaluate the model
model.evaluate(test_data, test_targets)
Output for the above code:
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 20.8195 - mae: 3.3060
[25.54439926147461, 3.5264275074005127]
The mean absolute error (MAE) gives us an indication of how far off our predictions are from the actual values on average.
Making Predictions
Finally, let's use the trained model to make predictions on the test data.
# Make predictions
predictions = model.predict(test_data)
# Print some predictions
for i in range(5):
print(f'Predicted value: {predictions[i][0]:.2f}, Actual value: {test_targets[i]:.2f}')
Output for the above code:
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step
Predicted value: 8.02, Actual value: 7.20
Predicted value: 16.52, Actual value: 18.80
Predicted value: 20.18, Actual value: 19.00
Predicted value: 31.13, Actual value: 27.00
Predicted value: 24.20, Actual value: 22.20
Conclusion
In this blog, we've walked through the process of building a simple regression model using TensorFlow to predict housing prices from the Boston Housing dataset. We started by loading and preprocessing the data, then built and trained a neural network, and finally evaluated its performance.
This example demonstrates how easy it is to get started with TensorFlow for regression tasks. The Boston Housing dataset is just one of many datasets available for experimentation, and TensorFlow's powerful yet intuitive API makes it a great tool for both beginners and experts alike.
Whether you're interested in building more complex models or experimenting with different datasets, TensorFlow provides the flexibility and performance to help you achieve your goals in machine learning.
Yorumlar