{"id":546,"date":"2018-12-08T09:53:23","date_gmt":"2018-12-08T09:53:23","guid":{"rendered":"https:\/\/alternative-spaces.com\/blog\/?p=546"},"modified":"2023-05-12T09:19:15","modified_gmt":"2023-05-12T09:19:15","slug":"what-you-must-know-about-weighted-linear-regression-in-r","status":"publish","type":"post","link":"https:\/\/alternative-spaces.com\/blog\/what-you-must-know-about-weighted-linear-regression-in-r\/","title":{"rendered":"What You Must Know About Weighted Linear Regression in R"},"content":{"rendered":"<p dir=\"ltr\">The narrow path to\u00a0machine learning\u00a0(ML) leads through the rough terrain of the land of statistics. If you are striving to become a data specialist, then you could go deeper and learn the ABC\u2019s of weighted linear regression in R (the programming language and the development environment). It\u2019s helpful for organizing job interviews but also for solving some problems that enhance our quality in life.<\/p>\n<p dir=\"ltr\">To be able to handle ML and BI you need to make friends with regression equations. It\u2019s not only enough to learn two or three methods and pass an exam. You\u2019ve got to learn to solve the problems from daily life: to find dependency between variables; ideally, to be able to differentiate signal from noise.<br \/>\nTo make a regression can be a piece of cake. The use of a basic model of &lt;- y~x would suffice. The issue is how to make the model more accurate? Let\u2019s imagine your model gave adjusted R squared (R2)= 0,867; how can you optimize it? Read on to find the answer.<\/p>\n<h2 dir=\"ltr\">Digesting the pie of regression<\/h2>\n<p dir=\"ltr\">Regression is a parametrical tool utilized to predict the dependent variable when independent variables are introduced. It\u2019s called parametric because certain presuppositions are made based on the data set. If the data set corresponds to these assumptions, regression yields great dividends.<\/p>\n<p dir=\"ltr\">However, if you are struggling to achieve greater accuracy, there is no need for panic. We\u2019ll learn some tips on how to achieve more accurate results.<\/p>\n<p dir=\"ltr\">Mathematically, a linear regression is used to predict model (the dependent variable) presented as Y = \u03b2o + \u03b21X + \u2208<\/p>\n<p dir=\"ltr\"><em>Where \u00a0\u2208 &#8211; Error<\/em><\/p>\n<p dir=\"ltr\"><em>\u03b2o &#8211; Intercept (coefficient)<\/em><\/p>\n<p dir=\"ltr\"><em>\u03b21 &#8211; Slope (coefficient)<\/em><\/p>\n<p dir=\"ltr\"><em>Y &#8211; Dependent variable<\/em><\/p>\n<p dir=\"ltr\"><em>X &#8211; Independent variable<\/em><\/p>\n<p dir=\"ltr\">This equation is known as a simple linear regression. It\u2019s linear since we have only one independent variable (x). In multiple regressions, we have very many independent variables. You might recall studying this equation at school.<\/p>\n<p dir=\"ltr\"><em>Y &#8211; is what we try to predict<\/em><\/p>\n<p dir=\"ltr\"><em>X &#8211; is a variable which we use to make a prediction<\/em><\/p>\n<p dir=\"ltr\"><em>\u03b2o is the intercept. It\u2019s the value you get when x=0.<\/em><\/p>\n<p dir=\"ltr\"><em>\u03b21 is the slope. It explicates the alterations in Y when X is diversified by one tiny unit.<\/em><\/p>\n<p dir=\"ltr\"><em>\u2208 stands for the residual value, that is to say, it shows the discrepancy between the real and predicted values.<\/em><\/p>\n<p dir=\"ltr\">The error is part and parcel of our lives. We could choose the most robust algorithm, but there\u2019ll always be an \u2208, reminding us that we simply can\u2019t predict the future precisely.<\/p>\n<div>\n<h2 dir=\"ltr\">Squaring the ordinary least squares<\/h2>\n<p dir=\"ltr\">Despite the fact that error is unavoidable; we can try to minimize it as much as possible. This technique is generally known as Ordinary Least Squares (OLS).<\/p>\n<p dir=\"ltr\">Linear regression gives an estimate that reduces the distance between the fitted line and all other data points. Practically speaking, OLS in regression optimizes the sum of all squared residuals.<\/p>\n<p dir=\"ltr\">Nowadays, with programming languages and free codes, you could do so much more! You could go beyond ordinary least squares to know more about different value. In R, when you plan on doing multiple linear regression with the help of ordinary least squares you need only one line of lm y x data\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/weighted-linear-regression-in-r\" rel=\"no follow\">code<\/a>:<\/p>\n<p dir=\"ltr\"><em>Model \u00a0&lt;- lm(Y ~ X, data = X_data).<\/em><\/p>\n<p dir=\"ltr\">X can be replaced by many other variables. This model can be used to predict from the new data set to add another line of\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/weighted-linear-regression-in-r\" rel=\"no follow\">code<\/a>:<\/p>\n<p dir=\"ltr\"><em>Y_pred &lt;- predict(Model, data = new_X_data)<\/em><\/p>\n<div>\n<h2 dir=\"ltr\">Generation of \u201cperfect\u201d data points<\/h2>\n<p>After that we can produce \u201cperfect\u201d data points for regression without any strain in R. Using\u00a0<em>lm y x<\/em>\u00a0data method we are able to garner the following results:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1901\" height=\"1202\" class=\"aligncenter size-full wp-image-548\" src=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1.jpg\" alt=\"weighted regression in r\" srcset=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1.jpg 1901w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1-150x95.jpg 150w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1-300x190.jpg 300w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1-768x486.jpg 768w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/1-1024x647.jpg 1024w\" sizes=\"auto, (max-width: 1901px) 100vw, 1901px\" \/><\/p>\n<p dir=\"ltr\">We have on the left the-so-called \u201cnoisy data.\u201d The line here is a linear regression. On the right, we have a graph showing residuals. That\u2019s the results of data points envisioned in the form of the histogram, where we can observe the curve of the same line. It helps us to see the superimposition of standard deviation. R provides us with a statistical model which gives an ideal summary. We get the following\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/weighted-linear-regression-in-r\" rel=\"no follow\">linear modeling<\/a>:<\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0<\/em><code><tt>\u00a0<em>\u00a0&gt; summary(Model)<\/em><\/tt><\/code><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0Call:<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0lm(formula =<br \/>\nY_noisy ~ X, data = Y)<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Residuals:<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Min 1Q<br \/>\nMedian \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a03Q \u00a0\u00a0\u00a0\u00a0 \u00a0Max<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0-11.1348<br \/>\n-2.9799 \u00a0\u00a00.3627<br \/>\n2.9478 \u00a010.3814<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Coefficients:<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Estimate<br \/>\nStd. Error t value Pr(&gt;|t|) \u00a0\u00a0<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0(Intercept)<br \/>\n3.51543 \u00a0\u00a0\u00a0\u00a0 \u00a00.98362<br \/>\n3.574 0.000548 ***<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0X \u00a0\u00a0 \u00a0 \u00a0\u00a0\u00a0 \u00a02.11284 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a00.01691<br \/>\n124.946 \u00a0&lt; 2e-16 ***<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0---<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0Signif. codes:<br \/>\n0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0Residual standard error: 4.881 on 98 degrees of freedom<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0Multiple R-squared: \u00a00.9938, Adjusted R-squared: \u00a00.9937<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\"><em><code><tt>\u00a0\u00a0\u00a0\u00a0\u00a0F-statistic:<br \/>\n1.561e+04 on 1 and 98 DF, \u00a0p-value: &lt; 2.2e-16<\/tt><\/code><\/em><\/p>\n<p dir=\"ltr\">We can see a slight deviation of coefficients from the underlying model. On top of that, both model parameter estimates are hugely crucial.<\/p>\n<div>\n<h2 dir=\"ltr\">A more sophisticated plot<\/h2>\n<p dir=\"ltr\">A more complex\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/weighted-linear-regression-in-r\" rel=\"no follow\">code<\/a>\u00a0looks like this:<\/p>\n<p dir=\"ltr\"><em>X_data &lt;- seq(1, 1000, 1)<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0#<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0# Y is linear in<br \/>\nx with uniform, periodic, and skewed noise<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0#<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Y_raw<br \/>\n&lt;- 1.37 + 2.097 * x<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Y_noise<br \/>\n&lt;- (X_data \/ 100) * 25 * (sin(2 * pi * X_data\/100)) *<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0runif(n<br \/>\n= length(X_data), min = 3, max \u00a0= 4.5) +<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(X_data<br \/>\n\/ 100)^3 * runif(n = 100, min = 1, max = 5)<\/em><\/p>\n<p dir=\"ltr\"><em>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Y<br \/>\n&lt;- data.frame(X = X_data, Y = Y_raw + Y_noise)<\/em><\/p>\n<p dir=\"ltr\">We can get the following graphs via a prism of\u00a0<em>lm y x data<\/em><\/p>\n<p dir=\"ltr\"><img loading=\"lazy\" decoding=\"async\" width=\"1900\" height=\"1200\" class=\"aligncenter size-full wp-image-549\" src=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2.jpg\" alt=\"weighted least square in r\" srcset=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2.jpg 1900w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2-150x95.jpg 150w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2-300x189.jpg 300w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2-768x485.jpg 768w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/2-1024x647.jpg 1024w\" sizes=\"auto, (max-width: 1900px) 100vw, 1900px\" \/><\/p>\n<p dir=\"ltr\">On the left, we see a graph of raw data, the red line represents the ordinary least squares line, and the dashed line depicts the \u201cactual Y\u201d which can still be an unknown variable. The right graph again shows the residual values and suggested remedy to improve the situation.<\/p>\n<p dir=\"ltr\">A disclaimer is needed here. First of all, this is quite an unrealistic scenario. If you didn\u2019t know anything, imagine that the real Y was recurring at fixed intervals, amplified with X, assuming that the happening phenomena are most likely linear.<\/p>\n<p dir=\"ltr\">Let\u2019s suppose that there\u2019s a good reason for a good rationale for the existence of Y, considering the red line as too high. Taking all of it into consideration, you have to make your mind up as to what to do with all of these data points.<\/p>\n<p dir=\"ltr\">There\u2019s always an option to sack the statistician who accumulated data with all of the standard errors and start from scratch again. For our purposes, let\u2019s say you are launching some project tomorrow and you need to come up with a workable regression line to predict model for your control program.<\/p>\n<h2 dir=\"ltr\">A \u201cnoisy\u201d graph<\/h2>\n<p dir=\"ltr\">It just so happens that we have a lot of noise here. However, the residual chart to the right looks quite evenly distributed with all the squared residuals.<\/p>\n<p dir=\"ltr\"><img loading=\"lazy\" decoding=\"async\" width=\"1900\" height=\"1200\" class=\"aligncenter size-full wp-image-550\" src=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3.jpg\" alt=\"weighted least square in r\" srcset=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3.jpg 1900w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3-150x95.jpg 150w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3-300x189.jpg 300w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3-768x485.jpg 768w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/3-1024x647.jpg 1024w\" sizes=\"auto, (max-width: 1900px) 100vw, 1900px\" \/><\/p>\n<p dir=\"ltr\">On the left is a representation of residual values versus the fitted Y values. The fundamental premise behind ordinary least squares is that the residuals aren\u2019t dependent on the function\u2019s value, and in this model, the residuals are equally distributed.<\/p>\n<p dir=\"ltr\">On the other hand, the left chart shows the disruption of this pattern. In fact, we are given more information here than in the left histogram. On the right is a typical Q-Q pattern which is modeled in such a way that residual values are all found on the red line. The higher you get, the more problematic it gets. So it\u2019s indicative of what we could have missed if we had looked only at the histogram or the fit.<\/p>\n<p dir=\"ltr\">This type of behavior is called heteroscedasticity, which means the undesirable alterations on the part of residuals. What can be done about that? There are plenty of models, but for our case, we are going to pin down the weighted linear regression model.<\/p>\n<h2 dir=\"ltr\">A weighted linear regression model<\/h2>\n<p dir=\"ltr\">Statistics as a science can be instrumental in a myriad of ways. For instance, it can assist in search of proper weights applicable to raw data points for making the regression model more accurate.<\/p>\n<p dir=\"ltr\">Typically, the \u201cweights argument\u201d works like this: to get the most plausible of the weights of the weighted linear model you need to divide the values of Y by the variance of residuals.<\/p>\n<p dir=\"ltr\">The hot potato question becomes: How do you get the variance values? The pragmatic answer is: A bin size in our case, 10 measurements, have to be defined.<\/p>\n<p dir=\"ltr\">Then a drifting value of the residual variance of 10 measurements in the bin can be calculated. This value can be used for the X that correlates to the bin center.<\/p>\n<p dir=\"ltr\">On the left and right spectrums of data in the weighted linear model, we\u2019ve just used the discrepancy of the starting and the last 10 values accordingly. Voila, the value is now known for every Y value for the divergence of residuals.<\/p>\n<p dir=\"ltr\">Now we can use a robust linear regression that can be used with these\u00a0<a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blogs\/weighted-linear-regression-in-r\" rel=\"no follow\">weights<\/a>:<\/p>\n<p dir=\"ltr\"><em>Weighted_fit &lt;-\u00a0<\/em><em>rlm<\/em><em>\u00a0(Y ~ X, data = Y, weights = 1\/sd_variance)<\/em><\/p>\n<p dir=\"ltr\">Applying\u00a0<em>rlm<\/em>, we get the following results:<\/p>\n<p dir=\"ltr\"><img loading=\"lazy\" decoding=\"async\" width=\"1900\" height=\"1200\" class=\"aligncenter size-full wp-image-551\" src=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4.jpg\" alt=\"weighted linear regression\" srcset=\"https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4.jpg 1900w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4-150x95.jpg 150w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4-300x189.jpg 300w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4-768x485.jpg 768w, https:\/\/alternative-spaces.com\/blog\/wp-content\/uploads\/2018\/12\/4-1024x647.jpg 1024w\" sizes=\"auto, (max-width: 1900px) 100vw, 1900px\" \/><\/p>\n<div class=\"blogPost__container\">\n<div class=\"col l12 blogPost__content\">\n<div>\n<p dir=\"ltr\">On the left, we see a new addition: a green line. Note that that the only new information that we get is residual variance. On the right, one can spot how twisted the residuals are.<\/p>\n<p dir=\"ltr\">Nonetheless, we are closer to the\u00a0<em>true Y<\/em>.<\/p>\n<h2 dir=\"ltr\">Wrapping up:<\/h2>\n<p dir=\"ltr\">The purpose of this article was to get you deeper into solving the regression problem. Sooner or later you\u2019ll come across the issue of low model accuracy, and you\u2019ll need to tackle it. Hopefully, you&#8217;ve got some insight into how to become more adept working with weighted linear regression in R.<\/p>\n<p dir=\"ltr\">Hopefully, you\u2019ll know the importance of weighted least square in R for your future sales. The benefits of this technical process could save the day for your company if you know how to predict future trends more accurately.<\/p>\n<p dir=\"ltr\">Content created by our partner, Onix-systems.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The narrow path to\u00a0machine learning\u00a0(ML) leads through the rough terrain of the land of statistics. If you are striving to become a data specialist, then you could go deeper and learn the ABC\u2019s of weighted linear regression in R (the programming language and the development environment). It\u2019s helpful for organizing job interviews but also for [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":547,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-546","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/posts\/546","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/comments?post=546"}],"version-history":[{"count":4,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/posts\/546\/revisions"}],"predecessor-version":[{"id":2624,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/posts\/546\/revisions\/2624"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/media\/547"}],"wp:attachment":[{"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/media?parent=546"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/categories?post=546"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/alternative-spaces.com\/blog\/wp-json\/wp\/v2\/tags?post=546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}