top of page

Random Forest Models!

Sometimes one machine-learning model isn't enough. In this post, we will create a random forest model, which builds upon the information and code in the previous blog post. Best to check that one out if you haven't yet!


Unsplash Photo by Steven Kamenar

Just like forests in the real world, random forest models are comprised of many decision trees working together. This makes them usually much more accurate than any single decision tree. Solo decision trees are more susceptible to under-fitting, over-fitting, and other sources of error when applied to test data.


Under-fitting vs. Over-fitting


These terms both refer to mistakes in how a machine-learning model is fit to training data.


Photo by Geeks for Geeks

The image above shows examples of under-fitting and over-fitting in a model meant to draw a line on the graph to separate the X's and O's. On the far left, a straight line doesn't do a good enough job of separating the two values to be considered meaningful. Although the middle graph misses a few X's amongst the O's, it does a pretty good job. If that curved line was applied to another set of data, it would consistently do well enough.


The graph on the right, however, loops around every data point to sort it into the correct category. It fits too well onto the training data, making it not as useful for future use on different data. Random forest models are good at using a Goldilocks-style approach to prevent these issues.


Random Forest Model


Photo by Analytics Vidhya

The image above shows the basic design of a random forest model. Like the decision tree regressor from the last post, this model takes in data, processes it, and returns a prediction. The main difference is that in a random forest, there are multiple decision trees producing multiple results, which are often averaged to produce a more accurate prediction.


In the last post, we saw how the decision tree was based on the training feature data passed into it. In a random forest, different subsets of the training data are used to create the similar-but-slightly-different decision trees.


The Code!


To follow along with the code, please navigate to Kaggle's exercise called "Machine Learning Competitions" in the beginner machine learning course. The page should look like this:


Screenshot of Kaggle

We'll be following along with Kaggle's code for the most part, but skipping the "Submit to the competition" section. Like the last post, I'll be omitting Kaggle's code-checking functions for clarity.


import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")

Here, Kaggle is importing the data we'll be creating the random forest model from. It's not super necessary to understand every line of this snippet of code, as the important stuff comes later!


import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)

We imported RandomForestRegressor from sklearn because we'll be making a random forest model. Other than that, this snippet is the same as in the last post! The file path of the data is transformed into a usable DataFrame through the Pandas library.


y = home_data.SalePrice

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

Next up, we're assigning the SalePrice column of the home_data DataFrame to our variable y. This is the target data, the values we will be trying to predict. We're also assigning the features data columns from home_data to the variable X. So far, so good!


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

Again, we are splitting the data in X and y into 4 new variables: train_X, val_X, train_y, and val_y. train_X and train_y are the data being used to fit and optimize our random forest model, while val_X and val_y are used to evaluate it on new data.


rf_model = RandomForestRegressor(random_state=1)

Here comes the new code! Instead of initializing a DecisionTreeRegressor, we are using sklearn's RandomForestRegressor function to store a model in the variable rf_model. The Python library sklearn makes it relatively easy to initialize different types of machine-learning models. The random_state isn't too important here; it controls how the random forest model randomly creates multiple decision trees.


rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)

The structure of the code is super similar to how we fit and made predictions from the decision tree regressor! We use sklearn's .fit and .predict methods in the same way.


rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

This last bit of code calculates the mean absolute error of our model. The mean absolute error is a measure of how accurate our model is by averaging the differences between the values our model predicts and the correct values. The lower, the better!


print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Running the code prints a mean absolute error of $21,857. Not bad!


We can move on to more complicated machine-learning models once we're comfortable with using sklearn and Python. Understanding accuracy metrics such as mean absolute error lets us improve and fine-tune our models!


In short:


  • Random forest models use many decision trees to reduce errors such as under-fitting and over-fitting. Using multiple decision trees prevents any one tree from "memorizing" the training data and not being useful for new data.


  • Sklearn has functions for the RandomForestRegressor, which resemble those of the DecisionTreeRegressor. The main difference is that we split the home_data DataFrame into testing and training data this time.


  • The mean absolute error of a machine-learning model is a measure of how accurate it is. The lower this value is, the closer its predictions are to the correct target values. We can tune the hyperparameters of more complex models to improve them.







95 views0 comments

Comments


bottom of page