This guide walks you through building your first machine learning project: predicting house prices using linear regression in Python. We’ll cover setup on Ubuntu, code walkthrough, and detailed explanations.
Ubuntu Setup
-
Install Python, pip, and venv
sudo apt update sudo apt install python3 python3-pip python3-venv -y
-
Create a project folder and virtual environment
mkdir house-price-predictor cd house-price-predictor python3 -m venv .venv source .venv/bin/activate
-
Install required Python packages
pip install pandas scikit-learn matplotlib jupyter
-
Launch Jupyter Notebook
jupyter notebook
This will open a browser window where you can create a new notebook to write and run Python code interactively.
Step 1: Prepare the Dataset
Create a notebook cell and add:
import pandas as pd
# Sample housing dataset
data = {
'Size (sqft)': [1500, 1600, 1700, 1800, 2000],
'Bedrooms': [3, 3, 3, 4, 4],
'Age (years)': [10, 5, 3, 20, 15],
'Price': [300000, 340000, 360000, 400000, 410000]
}
df = pd.DataFrame(data)
df
Explanation: We’re using Pandas to create a DataFrame — a table-like data structure — from a Python dictionary. Each row represents one house, and each column represents a property (feature) of the house. This is your training data — the data that the machine learning model will learn from.
Step 2: Train the Model
from sklearn.linear_model import LinearRegression
X = df[['Size (sqft)', 'Bedrooms', 'Age (years)']] # Features
y = df['Price'] # Target label
model = LinearRegression()
model.fit(X, y)
Explanation: This is where machine learning starts. We separate the features (inputs) and the target (output we want to predict). In this case, features are size, number of bedrooms, and age. The target is the price.
The model learns a mathematical formula that best maps these inputs to the output. This process is called “training” the model.
Step 3: Predict Price for a New House
new_house = pd.DataFrame([[1900, 4, 10]], columns=['Size (sqft)', 'Bedrooms', 'Age (years)'])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:,.2f}")
Explanation: Now that the model is trained, we can use it to predict the price of a new house. The input must be provided in the same format as the training data. The model uses its learned formula to estimate a price.
Step 4: Visualize (Optional)
import matplotlib.pyplot as plt
plt.scatter(df['Size (sqft)'], df['Price'], color='blue', label='Actual')
plt.xlabel('Size (sqft)')
plt.ylabel('Price')
plt.title('Size vs Price')
plt.legend()
plt.show()
Explanation: Visualization helps you see how data is distributed and how well your model might perform. Here, we use a scatter plot to show the relationship between house size and price.
Step 5: Evaluate the Model
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Explanation: Once you’ve trained and tested the model, it’s important to evaluate how well it’s doing:
- MSE (Mean Squared Error): Average of the squared differences between actual and predicted prices. Lower is better.
- R² Score: A value between 0 and 1 that indicates how well the model explains the data. Closer to 1 means better accuracy.
Summary
- You created a basic machine learning model using linear regression.
- You learned how to prepare a dataset, train a model, make predictions, visualize the data, and evaluate the model’s performance.
This is the foundation of many machine learning workflows. Understanding this will make it easier to tackle more advanced topics. Next up: try using a real-world dataset or move to classification problems like the Iris Flower Classifier.