This analysis, by the collaborative efforts of Greta Moses, a statistics student at WWU, and myself, investigates the relationship between wine quality and physicochemical properties, aiming to determine if objective measures, rather than subjective perceptions, drive wine quality. Properties like density, alcohol, pH, and sugars are key factors in wine certification and quality evaluation. By studying these relationships in red wine vinho verde from Portugal, we aim to enhance winemaking processes and improve overall wine quality. This research is particularly relevant as the wine industry embraces new technologies to support consumer growth and consistency in production.
The data for this analysis was collected from five chemist from UC Irvine Maching Learning Repository.
Based on the nature of our model, it was determined that no variable transformations were necessary. Diagnostic plots, including the residuals vs. fitted plot and the QQ plot, indicate a linear fit is appropriate, with no evident deviations from normality. However, upon examining individual regressions for each variable against wine quality, it was observed that the variables related to sulfur dioxide (both free and total) exhibited abnormal residual vs. fitted plots, demonstrating heteroscedasticity. This violation of the assumption of homoscedasticity suggests that the residuals are not normally distributed and may be dependent on the predictor variable. Despite this, heteroscedasticity cannot be corrected through variable transformations; thus, all predictor variables remain untransformed.
Regarding model selection, a forward/backward stepwise procedure was employed to ascertain the significance of each variable in determining wine quality. Prior to applying model selection methods, our model included diagnostic plots and a table of coefficients for all physicochemical properties, which are shown below.
After using the forward-backward stepwise method to refine our model, the resulting model revealed the variable with the highest beta coefficient in our original regression, density, did not feature in the final, streamlined model. Additionally, our analysis identified a noteworthy presence of outliers and influential points within the dataset. We identified 104 influential points using a cutoff value of 0.015 and identified two outliers, namely points 391 and 441, based on their standardized residuals.
The analysis indicates a strong linear relationship between most predictor variables and the quality of wine. However, we recommend reconsidering the inclusion of measurements for free and total sulfur dioxide due to observed deviations from a normal random distribution in their residuals, as indicated by clear heteroscedasticity in the residual plots. Among the predictors, density, chlorides, and volatile acidity exhibited the most pronounced negative linear relationships with wine quality, while sulfates demonstrated the largest positive increase in quality per unit change. Although our dataset contained relatively few outliers, over 100 influential points were identified, prompting a recommendation to verify the accuracy of data entry for these points.