Exploring Influential Factors to Housing via MLE

Overview

Click the link for my Research Poster.

Introduction

In this study, we aim to identify the main factors such as square, living room, bathroom, subway, kitchen, construction time, and community average influencing housing prices in Beijing. As China’s political and economic hub, instabilities in Beijing’s housing market affect local living standards, affordability, investment decisions, and even national monetary policies. Understanding these factors is critical for designing effective policy interventions.

Several studies have already addressed these aspects in the context of Beijing’s housing market. Xiao Y, Chen X, Li Q, Yu X, Chen J, and Guo J evaluate how structural characteristics like the number of living rooms and bathrooms influence property values. This supports our focus on examining how internal house features affect pricing in Beijing. Similarly, Hui and Yue’s (2006) study on housing price bubbles in Hong Kong, Beijing, and Shanghai. Their analysis showed that higher per capita disposable income boosts affordability and demand, contributing to price appreciation. This report does not discuss the impact of per capita income on house prices, which is one of the limitations caused by the data sources. Furthermore, Li, Chen, and Zhao (2019) examined the role of metro accessibility on housing prices, their results highlight the critical role of public transport infrastructure in shaping property values.

We construct a multiple linear regression model to analyze how various factors influence housing prices. The model focuses on interpretability, using estimated regression coefficients to reveal the direction and magnitude of each factor’s impact. This allows us to describe the effect of each variable while accounting for others.

Methods

The original dataset used in this analysis originates from the Lianjia website. The original dataset covers Beijing house prices during 2011 to 2017, consisting over 300,000 observations with 26 variables. To refine the data for our study, we designed a function to transform the encoding format of the dataset and removed all unrecognizable characters. We also filtered out null observations and data within abnormal range. Our processed dataset includes the following variables: total Price, square, living room, bathroom, subway, Kitchen, construction Time, and community Average.

The dataset with all variables were used to build the model. After checking the significance level of each variable, those predictors that were not significant were removed. By Variance Inflation Factor (VIF) checking, we dropped the predictor that has large VIF and ensured the rest of the predictors’ VIF value within 2, which is even less than common standard of less than 5. We then proceed to F test, in which the p-value shown less than 0.05, indicating there exists at least one beta that has contribution to the response variable. By further doing a partial F test via ANOVA table, we constructed three reduced models that has some variables dropped to investigate the possibilities of better optimization. In the end, by using all-subset methods, we have adjusted R square, RSS, SSreg, AIC, and BIC analyzed. The high adjusted R square and low AIC and BIC would be better. By all the procedures above, the full model was selected to be our final model.

Two essential conditions of the final model were checked before drawing a conclusion from the plots. There exists a single function of a linear combination involving beta, and each predictor is related to each other predictors in no more complicated way than linear. Then, use residual and QQ plots to check the four requirements: linearity, normality, constant variance, and uncorrelated errors. The best way to determine if there is a violation of the model’s assumptions is to use a residual graph. So, the residual plot should have no pattern, cluster, or fanning. The QQ plot can be used to check normality. The plot should be a straight line. If the constant variance is violated, natural logarithmic or power transformations are applied to the response value. A Box-Cox transformation can be used to try to improve normality and linearity violations. It can used on the response, on the predictor, or on the response and predictors simultaneously.

Finally, by identifying problematic observations such as high leverage points and influential points, those with hii larger than 2(p+1)/n are either classified as leverage points, good leverage points improving and maintaining regression model’s fit, bad leverage points trend to outliers when r > 4 or r < -4, since they lead to biased coefficients or reduced model’s predictive accuracy. Influential points were extracted by analyzing Cooks Distance.

Result

Initially, our model was constructed using seven predictor variables: Living Room, Squares, Bathroom, Subway, Community Average, Construction Time, and Kitchen without any transformations, which later we found that there exists large scale of fanning patterns on our residual plots. As a result, we applied a log transformation to the response variable and two square root transformations to community average and construction time in order to stabilize variance. We then constructed our primary model based on these data.

Primary Model

Log_TotalPrice = -4.998 + 0.009020 * square + 0.07134 * livingroom - 0.08517 * bathroom + 0.03911 * subway + 1.339e-05 * communityAverage + 0.004443 * constructiontime + 0.1641 * kitchen

In our primary model, though all variables’ p-values have shown significance, the coefficient for bathroom was unexpectedly negative, conflicting with findings in the literature. A VIF analysis conducted immediately revealed the existence of multicollinearity, particularly for Squares, with a value larger than 4 while all others held less than 2. Consequently, we dropped Squares and constructed a new model based on remaining variables listed in the summary table below. Description of variables could be find in the appendix.

Table1: Summary of the numerical variables used in MLR model.

Variable	mean	median	minimum	maximum	1st quantile	3rd quantile
Total price	3.446176e+2	295	61	1790	205	425
Livingroom	1.999722e	2	0	7	1	2
Bathroom	1.171594e	1	0	5	1	1
Community average	6.359298e+4	59015	20483	183109	46505	75738
construction time	1.999168e+3	2001	1944	2016	1994	2006
kitchen	9.953752e-1	1	0	3	1	1

Optimized Model

Log_TotalPrice = -60.38 + 0.3002 * livingroom + 0.1749 * bathroom + 0.02503 * subway1 + 0.007259 * communityAverage + 1.416 * constructiontime + 0.1373 * kitchen

To ensure our manipulation works in a good direction, we conducted significance tests and VIF verification again, which confirmed all variables were significant (p < 2e−16), and has VIF levels effectively reduced to values that than 1.60. To further investigate the possibilities of optimizing model 2 to a better model, we conducted partial F-tests for Kitchen and Bathroom, where these variables are likely not that influential to our response variables. Our approach here is using ANOVA to compare the performance of the model 2 to these reduced models respectively, aiming to see if any of them (or both) could be dropped to simplify the model. Reduced models are constructed as follows:

Reduced Model 1. Bathroom Removed
Log_TotalPrice = -70.88 + 0.3558 * livingroom + 0.02538 * subway1 + 0.00734 * communityAverage + 1.652 * constructiontime + 0.1685 * kitchen

Reduced Model 2: Kitchen Removed
Log_TotalPrice = -60.00 + 0.3011 * livingroom + 0.02507 * subway1 + 0.007273 * communityAverage + 1.411 * constructiontime + 0.1771 * bathroom

Reduced Model 3: Both Predictors Removed
Log_TotalPrice = -70.58 + 0.3578 * livingroom + 0.02544 * subway1 + 0.007358 * communityAverage + 1.649 * constructiontime

We set the null hypothesis assuming removed beta(s) in each of these models all equal to zero, aiming to see if the resulting p-values will lead us to reject null hypothesis. By performing ANOVA on the reduced model 1, 2, and 3, we saw p-values in the ANOVA tables are all less than 2e-16, indicating those removed variables in each reduced model have significant contributions to the response variable. So, we reject the null hypothesis and set our optimized model as the final model. To further verify validity of our decision, we conducted an all-subsets regression analysis. The results confirmed that the model including all six variables performed the best based on selection criteria (e.g., AIC, BIC, or Adjusted R^2). Therefore, we confirmed that no variable should be excluded. Thus, we have our final model as follow:

Final Model

Log_TotalPrice = -60.38 + 0.3002 * livingroom + 0.1749 * bathroom + 0.02503 * subway1 + 0.007259 * communityAverage + 1.416 * constructiontime + 0.1373 * kitchen

We now proceed to checking the goodness of our final model. By conducting T-test to each variable and F test for our final model, we have the F statistic’s p-value less than 2e-16.

Variable	Coefficient	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-60.38	0.5344	-112.99	<2e-16
Livingroom	0.3002	0.001214	247.36	<2e-16
bathroom	0.1749	0.002164	80.85	<2e-16
subway1	0.02503	0.00162	15.45	<2e-16
community Average	0.007259	2.10e-06	346.49	<2e-16
construction time	1.416	0.01193	118.74	<2e-16
kitchen	0.1373	0.007305	18.8	<2e-16

Table 2. Summary of Our Final Model

We begin by verifying the basic conditions of the model. From two plots below, we saw random diagonal scatter and no identifiable non-linear trend on first plot and no curve or other non-linear pattern in any of the predictor pairwise scatterplots. We conclude that the model satisfies conditional mean and response, allowing us to performing assumption checks by examining plotting.

Figure 1. Response versus Fitted and Predictors Pairwise Scatterplots

We have four model assumptions to verify: Normality, Linearity, Constant Variance and Independence.

We begin by constructing residual plots as follows.

Figure 2: Residual Versus Fitted and Residual QQ Plots

From Residual QQ plot and Residual vs Fitted Scatter Plot, it’s obvious that the QQ plot’s body was tightly attached to the diagonal, with acceptable deviations at head and tail. On the Residual vs Fitted Scatter Plot, we saw no extreme violation to the constant variance assumption as fanning patterns are in acceptable scale and all residual points are tightly attached to each other. We conclude the normality and constant variance assumption has been satisfied.

Meanwhile, as shown on Figure 3(on the next page), scatterplots of predictor variables and fitted values, along with the Residual vs Fitted plot, revealed no non-linear patterns or systematic trends, confirming linearity assumption.

Additionally, boxplots of subways showed consistent residual distributions across groups, and histograms indicated no severe skewness or outliers. To conclude, though minor issues were identified, the final model satisfies key assumptions and provides an interpretable, well-performing solution.

Figure 3. Residual Plots and Fitted Verus Predictor Plots of Selected Predictors. Histogram of Final Model’s Fitted Values.

To conclude, our final model satisfies all four model assumptions, proving interpretable, well-performing identities, allowing us to draw conclusion with confidence.

Conclusion and Limitations

The Multiple linear regression model shows that the housing price in Beijing is significantly affected by many factors, including the number of rooms, the proximity of the subway, the construction time, and the community average. Within all room configuration parameters, the number of living rooms and bathrooms has the biggest positive impact on housing prices as the estimated coefficient for the living room is highest above the rest(0.3002), representing the average change in the house price for a one-unit increase in the living room when all other predictors are held fixed. These results are consistent with the research questions, revealing how multiple factors work together to affect housing prices in Beijing. Additionally, subway accessibility impact positively on prices, which is consistent with the essays’ conclusions, emphasizing the impact of proximity to transportation to prices. Also, newer properties tends to be more expensive than those built in earlier years. However, the minimal impact of community average price on individual housing prices is somewhat surprising. Possible reasons include that the characteristics of the home itself (such as size and proximity to the subway) are more heavily weighted in pricing, while the average neighborhood price may be more reflective of the market context.

On the other hand, several limitations are worth noticing. First, as most of our predictor variables were not normally distributed (shown in histogram in appendix), there exist very extreme distributions for some predictors like the kitchen, with almost all data with value 1 and a minor fraction of them with other values. The vast difference between data may bring more problematic data points that affect the overall performance of the model significantly. What’s more, our MLR model was constructed on the housing prices during a given period, it might no longer be suitable for future housing prices prediction if there happening some unexpected to local region’s policy or market.

Further studies and approaches will be needed to further investigate the housing prices of Beijing in an ever-changing world.

References

Hui, E.C.M., Yue, S. Housing Price Bubbles in Hong Kong, Beijing and Shanghai: A. Comparative Study. J Real Estate Finan Econ 33, 299–327 (2006). https://doi.org/10.1007/s11146-006-0335-2

Li, S., Chen, L. & Zhao, P. The impact of metro services on housing prices: a case. study from Beijing. Transportation 46, 1291–1317 (2019). https://doi.org/10.1007/s11116-017-9834-7

Xiao Y, Chen X, Li Q, Yu X, Chen J, Guo J. Exploring Determinants of Housing Prices in Beijing: An Enhanced Hedonic Regression with Open Access POI Data. ISPRS International Journal of Geo-Information. 2017; 6(11):358. https://doi.org/10.3390/ijgi6110358

Lianjia. (n.d.). Housing price in Beijing. Dataset hosted on Kaggle. Retrieved from https://www.kaggle.com/datasets/ruiqurm/lianjia; Originally from https://bj.lianjia.com/chengjiao

Appendix

Figure 4. Residual Plot of All Predictors

Figure 5. Fitted Verses Predictor of All Variables

Variable	Description
Square	The total size of the property in square meters.
Livingroom	Numerical variable, the number of living rooms in the property, indicates the amount of shared living space.
Bathroom	Numerical Variable, classifying property type by number of bathrooms. An increased number of bathrooms often indicates a higher level of convenience.
Subway	Categorical variable, A binary indicator (1 = near subway, 0 = not near subway) that captures the effect of proximity to public transportation.
Kitchen	Numerical Variable, classifying property type by number of kitchens, indicating potential additional functionality and value.
Construction time	Numerical Variable, the year the property was built. It captures the age of the building, which influences its condition and market appeal.
Community average	Numerical Variable, the average price per square meter for properties in the same community, measured in CNY/sqm.

Table 3. Data Descriptions

R Source Code

knitr::opts_chunk$set(echo = TRUE)

INSTALL

install.packages("dplyr")
library(dplyr)

install.packages("tidyverse")
library(tidyverse)

install.packages("ggplot2") 
library(ggplot2)

install.packages("dplyr")
library(dplyr)

install.packages("gridExtra")
library(gridExtra)

install.packages("GGally")
library(GGally)

str(data)
library(stringr)

install.packages("car")
library(car)

DATASET


setwd("/Users/alexxonmacbookpro/Desktop/HOUSING IN BEIJING")
rawdata <- read.csv("Housing in Beijing - 302.csv", stringsAsFactors = FALSE, fileEncoding = "GB2312")

# Transforming Encoding Format
lines <- readLines("Housing in Beijing - 302.csv", warn = TRUE)
if (!grepl("\\n$", lines[length(lines)])) {
  lines <- c(lines, "")  
  writeLines(lines, "Housing in Beijing - 302.csv")  # Saving Encoded File
}

# Remove Chinese Characters - the dataset contains Chinese characters that can not be recognized,
# this function aim to resolve this issue

remove_chinese <- function(text) {
  if (is.character(text)) {
    cleaned_text <- str_replace_all(text, "[\u4e00-\u9fa5]", "")
    return(ifelse(cleaned_text == "", "0", cleaned_text))
  } else {
    return(text)
  }
}

rawdata2 <- rawdata %>%
  mutate(across(everything(), ~ remove_chinese(.)))

# Filtering
filtered_data <- rawdata2 %>%
  select(totalprice = totalPrice,         
         livingroom = livingRoom,  
         bathroom = bathRoom,    
         subway = subway,
         kitchen = kitchen,
         constructiontime = constructionTime,
         communityAverage = communityAverage ) %>%  
  na.omit()

cleaned_data <- filtered_data %>%
  filter(totalprice > 60 & totalprice < 1800) %>%                 
  filter(livingroom > 0 & livingroom < 5 & bathroom < 4) %>%            
  filter(communityAverage > 20000 & communityAverage < 130000) %>% 
  filter(constructiontime > 1989 & constructiontime < 2020) %>% 
  na.omit()     

# Setting Seed
set.seed(1006747175) #Student Number of ZIXIANG ZHANG

total_rows <- nrow(cleaned_data)

# FULL DATA SELECTION
sampled_cleaned_data <- cleaned_data[sample(1:total_rows, total_rows, replace = FALSE), ]
write.csv(sampled_cleaned_data, "Cleaned_Sampled_Housing_Data.csv", row.names = FALSE)
# Transfer to Numeric
sampled_cleaned_data$livingroom <- as.numeric(sampled_cleaned_data$livingroom)
sampled_cleaned_data$bathroom <- as.numeric(sampled_cleaned_data$bathroom)
sampled_cleaned_data$constructiontime <- as.numeric(sampled_cleaned_data$constructiontime)
summary(sampled_cleaned_data)

n = nrow(sampled_cleaned_data)

train = sampled_cleaned_data # NOT USING TRAINING/TESTING SPLITTING

PRIMARY MODEL CONSTRUCTON


str(train)
summary(train$totalprice)

train$Log_TotalPrice <- log(train$totalprice + 1) 
train$subway <- as.factor(train$subway)
train$communityAverage <- sqrt(train$communityAverage)
train$constructiontime <- sqrt(train$constructiontime)

model_log <- lm(Log_TotalPrice ~ livingroom + bathroom + subway + communityAverage + constructiontime + kitchen, data = train)

log_residuals <- resid(model_log)

residual_data <- data.frame(
  Residuals = log_residuals,
  LivingRoom = train$livingroom,
  BathRoom = train$bathroom,
  Subway = train$subway,
  CommunityAverage = train$communityAverage,
  ConstructionTime = train$constructiontime,
  Kitchen = train$kitchen
)

RESPONSE VS PREDICTOR - PLOTTING


p1 <- ggplot(train, aes(x = livingroom, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Living Room",
       x = "Living Room",
       y = "Log TotalPrice") +
  theme_minimal()

p2 <- ggplot(train, aes(x = bathroom, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Bathroom",
       x = "Bathroom",
       y = "Log TotalPrice") +
  theme_minimal()

p3 <- ggplot(train, aes(x = subway, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Subway",
       x = "Subway (0 = No, 1 = Yes)",
       y = "Log TotalPrice") +
  theme_minimal()

p4 <- ggplot(train, aes(x = communityAverage, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Community Average",
       x = "Community Average",
       y = "Log TotalPrice") +
  theme_minimal()

p5 <- ggplot(train, aes(x = constructiontime, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Construction Time",
       x = "Construction Time",
       y = "Log TotalPrice") +
  theme_minimal()

p6 <- ggplot(train, aes(x = kitchen, y = Log_TotalPrice)) +
  geom_point(color = "black") +  
  geom_smooth(method = "lm", color = "red") +  
  labs(title = "Log TotalPrice vs Kitchen",
       x = "Kitchen",
       y = "Log TotalPrice") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)

RESIDUAL vs VARIABLES


plot1 <- ggplot(residual_data, aes(x = LivingRoom, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Living Room", x = "Living Room", y = "Residuals")

plot2 <- ggplot(residual_data, aes(x = BathRoom, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Bathroom", x = "Bathroom", y = "Residuals")

plot3 <- ggplot(residual_data, aes(x = Subway, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Subway", x = "Subway (0 = No, 1 = Yes)", y = "Residuals")

plot4 <- ggplot(residual_data, aes(x = CommunityAverage, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Community Average", x = "Community Average", y = "Residuals")

plot5 <- ggplot(residual_data, aes(x = ConstructionTime, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Construction Time", x = "Construction Time", y = "Residuals")

plot6 <- ggplot(residual_data, aes(x = Kitchen, y = Residuals)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs Kitchen", x = "Kitchen", y = "Residuals")


grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)

boxplot(Log_TotalPrice ~ subway, data = train,
        main = "Boxplot of Log Total Price by Subway",
        xlab = "Subway (0 = No, 1 = Yes)",
        ylab = "Log Total Price",
        col = c("skyblue", "orange"))

library(ggplot2)

ggplot(train, aes(x = subway, y = log_residuals)) +
  geom_boxplot(aes(fill = subway)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Boxplot of Residuals by Subway",
       x = "Subway (0 = No, 1 = Yes)",
       y = "Residuals") +
  theme_minimal()

RESPONSE vs FITTED / QQ_PLOT


train$FittedPrice <- exp(predict(model_log, newdata = train)) - 1 

ggplot(train, aes(x = totalprice, y = FittedPrice)) +
  geom_point(alpha = 0.5, color = "blue") + 
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") + 
  labs(title = "Response vs Fitted Scatterplots", x = "Actual Price", y = "Fitted Price") +
  theme_minimal()

plot(model_log$fitted.values, resid(model_log),
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values (Log TotalPrice)",
     ylab = "Residuals",
     col = "black", pch = 20)
abline(h = 0, col = "red", lwd = 2)  



qqnorm(resid(model_log), main = "Residual QQ Plot")
qqline(resid(model_log), col = "red", lwd = 2)

numeric_data <- train[, 2:8] 
pairs(numeric_data, 
      main = "Predictor Variables Pairwise Scatterplots", 
      pch = 21, 
      bg = "lightblue", 
      col = "black")

ggplot(train, aes(x = Log_TotalPrice)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of Log TotalPrice",
       x = "Log TotalPrice",
       y = "Frequency") +
  theme_minimal()
  
pp1 <- ggplot(train, aes(x = livingroom)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of livingroom",
       x = "livingroom",
       y = "Frequency") +
  theme_minimal()

pp2 <- ggplot(train, aes(x = bathroom)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of bathroom",
       x = "bathroom",
       y = "Frequency") +
  theme_minimal()

pp3 <- ggplot(train, aes(x = subway)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar Chart of Subway", x = "Subway", y = "Frequency")


pp4 <- ggplot(train, aes(x = communityAverage)) +
  geom_histogram(binwidth = 10000, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of communityAverage",
       x = "communityAverage",
       y = "Frequency") +
  theme_minimal()

pp5 <- ggplot(train, aes(x = constructiontime)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of constructiontime",
       x = "constructiontime",
       y = "Frequency") +
  theme_minimal()

pp6 <- ggplot(train, aes(x = kitchen)) +
  geom_histogram(binwidth = 0.3, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of kitchen",
       x = "kitchen",
       y = "Frequency") +
  theme_minimal()
  
grid.arrange(pp1, pp2, pp3, pp4, pp5, pp6, ncol = 2)

T / PARTIAL F / VIF CHECK


# 1. T Test / F Test[ANOVA for FULL MODEL]
summary(model_log)

# 2. VIF
vif_values <- vif(model_log)
print(vif_values)

# 3. Partial F Test for kitchen

# PRIMARY MODEL
# model_log <- lm(Log_TotalPrice ~ livingroom + bathroom_factor + subway + communityAverage + constructiontime + kitchen_factor, data = train) # kitchen deleted

alter1_model <- lm(Log_TotalPrice ~ livingroom + subway + communityAverage + constructiontime + kitchen, 
                    data = train) # bathroom deleted
alter2_model <- lm(Log_TotalPrice ~ livingroom + subway + communityAverage + constructiontime + bathroom, 
                    data = train) # Kitchen Deleted
alter3_model <- lm(Log_TotalPrice ~ livingroom + subway + communityAverage + constructiontime, 
                    data = train) # Kitchen AND Bathroom Deleted

anova(model_log, alter1_model)
anova(model_log, alter2_model)
anova(model_log, alter3_model)


# 4. T/F Test for Reduced Models

summary(alter1_model)
summary(alter2_model)
summary(alter3_model)

# 5. FINAL MODEL CONSTRUCTION

final_model <- model_log
summary(final_model)
# anova(model_log, final_model)

library(leaps)

all_subsets <- regsubsets(
  Log_TotalPrice ~ livingroom + bathroom + subway + communityAverage + constructiontime + kitchen,
  data = train,
  nbest = 1,          
  nvmax = 6
)

summary(all_subsets)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Overview#

Introduction#

Methods#

Result#

Primary Model#

Optimized Model#

Final Model#

Conclusion and Limitations#

References#

Appendix#

R Source Code#