Predicting Salary with Neural Networks: A Regression Analysis Based on Age, Education, Gender, and Experience




Introduction

In the realm of machine learning and artificial intelligence, one of the most intriguing challenges is making predictions based on numerical values. This task, known as scalar regression, involves predicting a continuous numeric output. Imagine the ability to forecast someone's salary based on various factors such as age, education level, gender, and years of experience. It's not just an academic exercise; it has real-world applications in fields like human resources, finance, and economics.

For this assignment, I ventured into the fascinating world of scalar regression, using a powerful tool in machine learning—neural networks. But before diving into the technicalities, let me set the stage.

The path to building an accurate regression model involves several critical steps, each with its unique challenges and opportunities. Through this blog post, I'll take you on a journey through these steps, providing insights into data preparation, preprocessing, model design, hyperparameter selection, training, optimization, and testing.

As you read on, you'll gain a deep understanding of how to tackle scalar regression problems using neural networks. Along the way, you'll encounter code snippets, visualizations, and valuable insights into the art and science of machine learning. So, fasten your seatbelts as we embark on a journey to develop a neural network that predicts salaries based on age, education, gender, and experience. Let's get started!

Stay tuned for the subsequent sections where we delve into the details of this exciting project, starting with "Data Preparation."

Data Preparation

Before diving into building our Neural Network model, it's essential to lay a strong foundation by preparing our dataset appropriately. 

1.    Understanding the Dataset

Our dataset comprises 373 records with 5 columns namely age, gender, education level and years of experience and salary. It's vital to comprehend the nature of the data, including its size, structure, and the information it holds. 

code:
import pandas as pd
#reading data file
salary=pd.read_csv('Salary Data  Copy.csv')
print(salary)

output:

Age  Gender Education Level  Years of Experience  Salary
0     32    Male      Bachelor's                  5.0   90000
1     28  Female        Master's                  3.0   65000
2     45    Male             PhD                 15.0  150000
3     36  Female      Bachelor's                  7.0   60000
4     52    Male        Master's                 20.0  200000
..   ...     ...             ...                  ...     ...
368   35  Female      Bachelor's                  8.0   85000
369   43    Male        Master's                 19.0  170000
370   29  Female      Bachelor's                  2.0   40000
371   34    Male      Bachelor's                  7.0   90000
372   44  Female             PhD                 15.0  150000

[373 rows x 5 columns]


2.    Feature Engineering

        a. Handling Categorical Variables

Two of the columns in our dataset, "Gender" and "Educational Qualifications," are categorical. To incorporate these into our Neural Network, we perform one-hot encoding. This process converts categorical variables into binary vectors, making them suitable for machine learning models.

Code:

salary_encoded = pd.get_dummies(salary, columns=["Gender", "Education Level"], drop_first=True)
salary_encoded.shape

Output:

(373, 6)

        b. Grouping Data

We group our data into input (X) and output (y) variables. In our case, the input (X) will include features like "Age," "Education," "Gender," and "Years of Experience," while the output (y) will be the target variable, "Predicted Salary."

Code:

# Separate input features (X) and output feature (y)
X = salary_encoded.drop(columns=['Salary'])
y = salary_encoded['Salary']

3. Data Splitting

For training and evaluating our Neural Network, we split our dataset into two parts: a training set (X_train, y_train) and a test set (X_test, y_test). We allocate 80% of the data for training and reserve the remaining 20% for testing. 

Code:

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

4. Data Normalization

Data normalization ensures that all input features have similar scales. We perform this process to prevent certain features from dominating the learning process. 

Code:

# Calculate the mean and standard deviation for X_train
mean_X_train = X_train.mean(axis=0)
std_X_train = X_train.std(axis=0)

# Calculate the mean and standard deviation for y_train
mean_y_train = y_train.mean()
std_y_train = y_train.std()

# Normalize X_train
X_train= (X_train - mean_X_train) / std_X_train

# Normalize y_train
y_train= (y_train - mean_y_train) / std_y_train

#  Normalize X_test using the same mean and standard deviation from X_train
#X_test= (X_test - mean_X_train) / std_X_train

# Normalize y_test using the same mean and standard deviation from y_train
y_test= (y_test - mean_y_train) / std_y_train



With our data well-prepared, we're now ready to design and train our Neural Network. In the next section, we'll delve into the exciting process of creating our model architecture.

Designing the Neural Network

In this section, we will delve into the architecture of our Neural Network, which plays a pivotal role in solving our scalar regression problem.

    1.   Model Architecture

Code:

#neural network
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop  

model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))  # Output layer with 1 neuron (since it's a scalar regression)


Input Layer: The input layer is designed to accept the features of our dataset.

Hidden Layers: I have included two hidden layers, each with 64 neurons and ReLU (Rectified Linear Unit) activation functions. ReLU is commonly used in hidden layers as it helps the network learn complex patterns in the data.

Output Layer: The output layer contains a single neuron, which is suitable for scalar regression tasks. I used a linear activation function here as it allows the network to predict continuous values.

    2.   Model Compilation

Before training the model, we need to compile it by specifying the optimizer, loss function, and metrics:

Code:
optimizer = RMSprop(lr=0.001)
model.compile(optimizer=optimizer, loss='mse',metrics=['mae'])


Optimizer: I used the RMSprop optimizer with a learning rate of 0.001. The optimizer is responsible for adjusting the model's weights during training.

Loss Function: Choice of loss function is 'mse' (Mean Squared Error), which is suitable for regression tasks. It measures the mean squared difference between predicted and actual values.

Metrics: To evaluate the model's performance during training, the Mean Absolute Error (MAE) as a metric. MAE represents the average absolute difference between predicted and actual values.


    3.   Hyperparameter Initialization

code:
model.fit(X_train, y_train, epochs=1000, batch_size=32)

The learning rate, batch size, and number of epochs are essential hyperparameters. In our model, we set the learning rate to 0.001 and use a batch size of 32. Additionally, we train the model for 1000 epochs.

With this Neural Network architecture in place, we are now ready to train and optimize our model to make accurate scalar regression predictions.

In the next sections, we will explore the training process, hyperparameter tuning, and model evaluation.

Prediction and Evaluation and Testing

After training our neural network model, we proceeded to test it on our previously set-aside test data.
So here we can see the code used for prediction and comparing the actual values with the predicted values in a dataframe

Code for prediction:

predictions = model.predict(X_test)
y_test_array = y_test.to_numpy().flatten()

# Create a DataFrame to store actual and predicted values
results_df = pd.DataFrame({'Actual': y_test_array, 'Predicted': predictions.flatten(),'Loss':(y_test_array-redictions.flatten())})

print(results_df)

output:





















Now lets look at the evaluation

Code:
# Evaluate the model on the test dataset
loss, mae = model.evaluate(X_test, y_test)

print(f"Test Loss (MSE): {loss:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")

Output:

Test Loss (MSE): 0.0984
Mean Absolute Error (MAE): 0.2033

Mean Squared Error (MSE): MSE measures the average squared difference between the actual and predicted values. In our case, the MSE is approximately 0.0984.

Mean Absolute Error (MAE): MAE calculates the average absolute difference between the actual and predicted values. Our model achieved an MAE of about 0.2033.

These metrics help us understand how well our model is performing in terms of prediction accuracy. Lower values of MSE and MAE indicate better performance.

Now lets visualize the results we obtained before fine tunning the hyperparameters

code:
import matplotlib.pyplot as plt

# Scatter plot for actual values (in blue)
plt.scatter(y_test, y_test, c='blue', label='Actual Values')

# Scatter plot for predicted values (in red)
plt.scatter(y_test, predictions, c='red', label='Predicted Values')

plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.legend()  # Display legend to differentiate colors
plt.show()

output:


Next lets look at the graph Loss vs Epochs (no. of epochs=1000)

Code:
import matplotlib.pyplot as plt

# Plot training history
plt.plot(history.history['loss'], label='Training Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Output:


















Optimization of Hyperparameters

 Now lets tune the hyperparameters by taking different values for no.of epochs, no. of nodes, activation function, batch size and the learning rate. By running the model using different values I have prepared a table with the resulting MSE and MAE values.














so from these combinations the least error occurs when learning rate=0.001, no. of nodes=128,epochs=1000, activation function in hidden layers are sigmoid. and batch size is 32. Therefore I shall use it as the optimised model and train and test it.

Code:

#neural network
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop  # Import the RMSprop optimizer

# Create an RMSprop optimizer with a specific learning rate (e.g., 0.001)
optimizer = RMSprop(lr=0.001)

model = Sequential()
model.add(Dense(128, activation='sigmoid', input_shape=(X_train.shape[1],)))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(1))  # Output layer with 1 neuron (since it's a scalar regression)

model.compile(optimizer=optimizer, loss='mse',metrics=['mae'])

history=model.fit(X_train, y_train, epochs=1000, batch_size=32)

After training the model lets make the predictions and see:

Code:

predictions = model.predict(X_test)

y_test_array = y_test.to_numpy().flatten()

# Create a DataFrame to store actual and predicted values
results_df = pd.DataFrame({'Actual': y_test_array, 'Predicted': predictions.flatten(),'Loss':(y_test_array-predictions.flatten())})

# Display the DataFrame
print(results_df)

Output:















































If you go back and observe the graph created by the very first model you can clearly see the difference after optimizing the parameters on how close the predicted values are to the actual values.

Conclusion

In this blog post, we embarked on a journey to build and fine-tune a neural network model for scalar regression. We covered various crucial steps, from data preparation to model design, training, evaluation, and hyperparameter tuning. Let's recap the key takeaways:

  1. Data Preparation: Properly preparing your dataset is essential. We explored techniques such as one hot encode, data normalization and splitting it into training and testing sets to ensure robust model training and evaluation.
  2. Model Design: We designed a neural network model with customizable hyperparameters such as the number of layers, nodes, and activation functions. The architecture of your neural network plays a significant role in its performance.
  3. Model Training: We trained the model on the training data using an appropriate optimizer and loss function. Monitoring the training history helps you understand how your model is learning over time.
  4. Evaluation: We evaluated the model's performance using metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). Visualizing the actual vs. predicted values provided insights into how well the model generalizes to unseen data.
  5. Hyperparameter Tuning: We explored hyperparameter tuning manually, to find the optimal configuration for our model. 
  6. Comparison: To make informed decisions, we compared the performance of our optimized model with the initial model, highlighting the improvements achieved through hyperparameter tuning.

References


Comments