Chapter 3: The Dataset: Generation and Analysis
Welcome to Chapter 3! In this chapter, we will focus on the dataset that we will use to train our neural network. We will start by understanding the type of data we're working with and its purpose. We will then discuss the basic linear function we'll be modeling and the role of noise in making the data realistic. After that, we will walk through the process of creating a synthetic dataset using NumPy. Finally, we will visualize the dataset using Python's Matplotlib library and provide a detailed line-by-line explanation of the Python code used to generate the dataset.
3.1: Understanding the Dataset
In machine learning, a dataset is a collection of examples that are used to train and evaluate a model. Each example in the dataset consists of one or more features and a target. The features are the inputs to the model, and the target is the output that the model is trying to predict.
For our simple neural network, we will use a synthetic dataset. A synthetic dataset is a dataset that is artificially created, rather than collected from real-world observations. The advantage of using a synthetic dataset is that we can control the exact relationship between the features and the target, which makes it easier to understand how the neural network is learning.
Our synthetic dataset will consist of one feature and one target. The feature will be a random number between 0 and 1, and the target will be 10 times the feature plus some random noise. This means that the true relationship between the feature and the target is a simple linear function, f(x) = 10x
, where x
is the feature.
3.2: Linear Functions and Noise
A linear function is a function that creates a straight line when graphed. In our case, the linear function f(x) = 10x
multiplies the input x
by 10. This means that if x
is 0, the output is 0, and if x
is 1, the output is 10. For any value of x
between 0 and 1, the output is somewhere between 0 and 10.
However, real-world data is rarely perfectly linear. There are usually other factors at play that cause some variation in the data. To simulate this variation, we add some random noise to the target. The noise is a random number between -1 and 1, which we generate using the np.random.randn
function from NumPy. This means that the actual relationship between the feature and the target is f(x) = 10x + noise
, where noise
is a random number between -1 and 1.
3.3: Creating a Synthetic Dataset with NumPy
Now that we understand the type of data we're working with and the relationship between the feature and the target, let's create our synthetic dataset. We will use the np.random.rand
function from NumPy to generate the feature and the np.random.randn
function to generate the noise.
Here's the Python code to create the dataset:
In this code, np.random.rand(100, 1)
generates a 2-dimensional array of 100 random numbers between 0 and 1. We round these numbers to 3 decimal places using np.round
. This is our feature X
.
To generate the target Y
, we multiply X
by 10 and add some random noise. The noise is generated by 0.2 * np.random.randn(100, 1)
, which creates a 2-dimensional array of 100 random numbers that are normally distributed around 0. We multiply the noise by 0.2 to reduce its magnitude, so it doesn't overwhelm the signal from the feature. We round the target to 3 decimal places.
3.4: Visualizing the Dataset
Visualizing the dataset is a crucial step in understanding the data. It can help us see the relationship between the feature and the target, detect any outliers or errors in the data, and get a sense of the data distribution.
We can visualize our dataset using the scatter
function from the Matplotlib library, which creates a scatter plot of the feature versus the target. Here's the Python code to create the scatter plot:
In this code, plt.scatter(X, Y)
creates a scatter plot with the feature X
on the x-axis and the target Y
on the y-axis. plt.xlabel
, plt.ylabel
, and plt.title
add labels to the x-axis, y-axis, and the plot itself, respectively. plt.show
displays the plot.
When you run this code, you should see a scatter plot that shows a clear linear relationship between the feature and the target, with some variation due to the noise.
3.5: Code Explanation: Dataset Generation
Let's go through the Python code for dataset generation line by line to ensure we understand each step.
This line imports the NumPy library, which we will use to generate the dataset.
This line sets the random seed to 0. The random seed is a number that initializes the random number generator. By setting the random seed, we ensure that we get the same random numbers each time we run the code. This is useful for reproducibility.
This line generates the feature X
. np.random.rand(100, 1)
generates a 2-dimensional array of 100 random numbers between 0 and 1. np.round(..., 3)
rounds these numbers to 3 decimal places.
This line generates the target Y
. 10 * X
multiplies the feature by 10. 0.2 * np.random.randn(100, 1)
generates a 2-dimensional array of 100 random numbers that are normally distributed around 0 and multiplies them by 0.2. 10 * X + 0.2 * np.random.randn(100, 1)
adds the feature and the noise together. np.round(..., 3)
rounds the result to 3 decimal places.
That's it for this chapter! You now understand how to generate and analyze a synthetic dataset for a simple neural network. In the next chapter, we will discuss how to initialize the weights of the neural network. Stay tuned!