{"id":5301,"date":"2019-03-29T20:34:44","date_gmt":"2019-03-29T12:34:44","guid":{"rendered":"http:\/\/www.finereport.com\/en\/?p=5301"},"modified":"2019-10-24T16:52:52","modified_gmt":"2019-10-24T08:52:52","slug":"top-8-statistics-in-2019","status":"publish","type":"post","link":"https:\/\/frg.fineres.com\/en\/2019\/03\/29\/top-8-statistics-in-2019\/","title":{"rendered":"Top 8 Statistics Data Analysts Need to Master in 2019"},"content":{"rendered":"\n<p>&nbsp;<\/p>\n<!-- Facebook Pixel Code -->\n<p><script>\n  !function(f,b,e,v,n,t,s)\n  {if(f.fbq)return;n=f.fbq=function(){n.callMethod?\n  n.callMethod.apply(n,arguments):n.queue.push(arguments)};\n  if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0';\n  n.queue=[];t=b.createElement(e);t.async=!0;\n  t.src=v;s=b.getElementsByTagName(e)[0];\n  s.parentNode.insertBefore(t,s)}(window, document,'script',\n  'https:\/\/connect.facebook.net\/en_US\/fbevents.js');\n  fbq('init', '1418292828307332');\n  fbq('track', 'PageView');\n<\/script><\/p>\n<noscript><\/noscript><!-- End Facebook Pixel Code -->\n\n\n\n<p>Statistics is the cornerstone of <a href=\"http:\/\/www.finereport.com\/en\/category\/data-analysis\">data analysis<\/a>. After learning statistics, you will find that many times the analysis is not reliable. For example, many people like to use averages to analyze the results of a thing, but this is often rough and inaccurate. If you learn statistics, then we can look at the data from a more scientific perspective.<\/p>\n\n\n\n<p>Most of the data analysis will use the following knowledge of statistics, you can focus on:<\/p>\n\n\n\n<ul><li>Basic statistics: Mean, Median, Mode, Variance, Standard Deviation, Percentile, etc.<\/li><li>Probability Distribution: Geometric Distribution, Binomial Distribution, Poisson Distribution, Normal Distribution, etc.<\/li><li>Population and Sample: understanding the basic concepts, the concept of sampling<\/li><li>Confidence Interval and Hypothesis Testing: How to Perform Validation Analysis<\/li><li>Correlation and Regression Analysis: Basic Models for General Data Analysis<\/li><li>With basic statistics, you can perform more diverse visualizations for more granular data analysis. And you also need to learn more Excel functions to achieve basic calculations, or some corresponding visualization methods in Python and R.<\/li><\/ul>\n\n\n\n<p>With the concept of population and sample, you will know how to do sample analysis in the face of large-scale data. You can also apply a hypothesis test to make more precise tests of some perceptual assumptions. Using regression analysis, you can make basic predictions about some future data and missing data.<\/p>\n\n\n\n<p>After understanding the principle of statistics, you may not be able to implement it through <a href=\"http:\/\/www.finereport.com\/en\/category\/reporting-tools\">tools<\/a>. Then you need to find the relevant implementation method on the Internet, or read some books.&nbsp;<\/p>\n\n\n\n<p>In addition, you can grasp the principles of some popular algorithms, such as linear regression, logistic regression, decision tree, neural network, correlation analysis, clustering, collaborative filtering, random forest, etc. Going a little deeper, you can also master related algorithms such as text analysis, deep learning, and image recognition. With regard to these algorithms, you need to not only understand the principles, but also explain them fluently. You also need to know some of the application scenarios in various industries. If they are not the must in your current job, it may not be the focus.<\/p>\n\n\n\n<p>This article is a summary of knowledge points.<\/p>\n\n\n\n<p>Summary:<\/p>\n\n\n\n<ol><li>Concentration trend<\/li><li>Variability<\/li><li>Normalization<\/li><li>Normal distribution<\/li><li>Sampling distribution<\/li><li>Estimate<\/li><li>Hypothesis testing<\/li><li>T test<\/li><\/ol>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-1.TheConcentrationTrend\">1. The Concentration Trend<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-1.1Mode\">1.1 Mode<\/h4>\n\n\n\n<p>The number which appears most often in a set of numbers.&nbsp;<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-1.2Median\">1.2 Median<\/h4>\n\n\n\n<p>The median of a finite list of numbers can be found by arranging all the numbers from smallest to greatest.<\/p>\n\n\n\n<p>If there is an odd number of numbers, the middle one is picked. If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the&nbsp;mean&nbsp;of the two middle values.<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-1.3Average\">1.3 Average<\/h4>\n\n\n\n<p>A calculated &#8220;central&#8221; value of a set of numbers.&nbsp;To calculate it: add up all the numbers, then divide by how many numbers there are.<\/p>\n\n\n\n<p>Average is familiar to most of you, but sometimes the average value will be greatly affected by the appearance of certain extremes. For example, there are 20 people in your class. Everyone has a similar income. 19 people are around 5,000. However, one student has successfully started a business and has an annual income of 100 million. At this time, the average income in your class is 5 million. At this time, the \u201cmedian\u201d is more reasonable, reflecting the real situation;<\/p>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.Variability\">2. Variability<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.1Quartile\">2.1 Quartile<\/h4>\n\n\n\n<p>We have just mentioned &#8216;median&#8217; above, divide the sample into 2 parts, and then find the &#8216;median&#8217; of the 2 parts respectively. The sample was divided into 4 parts, and the value of 1\/4 was recorded as Q1. The value at 2\/4 is recorded as Q2, and the value at 3\/4 is recorded as Q3.<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.2InterquartilerangeIQR=Q3-Q1\">2.2 Interquartile Range IQR=Q3-Q1<\/h4>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p9.pstatp.com\/large\/pgc-image\/bed3939cd89d4b1a9d2c3f6b0385e2c4\" alt=\"Interquartile Range IQR=Q3-Q1\"\/><\/figure>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.3Outliers\">2.3 Outliers<\/h4>\n\n\n\n<p>Smaller than Q1-1.5 (IQR) or greater than Q3+1.5 (IQR);<\/p>\n\n\n\n<p>For outliers, we have to eliminate them in the data processing.<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.4Variance\">2.4 Variance<\/h4>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p9.pstatp.com\/large\/pgc-image\/c0baadb6c9664077bc252240615aacd8\" alt=\"Variance\"\/><\/figure>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.5Squaredeviation\">2.5 Square Deviation<\/h4>\n\n\n\n<p>Arithmetic square root of variance<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-2.6.Beziercorrection:correctsamplevariance\">2.6. Bezier Correction: Correct Sample Variance<\/h4>\n\n\n\n<p>Actually, when calculating the variance, the denominator uses n-1 instead of the number n of samples. The reason is that, for example, in a Gaussian distribution, we extract a part of the sample and use the variance of the sample to represent the variance of the large sample data set that satisfies the Gaussian distribution. Since the sample mainly falls near the x=u center value, if the sample is calculated by the following formula, the prediction variance must be smaller than the variance of the big data set (because the data extracted by the edge of the Gaussian distribution is also small). In order to make up for this shortcoming, we change the formula n to n-1 to increase the variance value. This method is called Bessel correction coefficient.<\/p>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-3.Normalization\">3. Normalization<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-3.1Standardscore\">3.1 Standard Score<\/h4>\n\n\n\n<p>The major purpose of standard scores is to place scores for any individual on any variable having any mean and standard deviation on the same standard scale so that comparisons can be made. Without some standard scale, comparisons across individuals and\/or across variables would be difficult to make (Lomax,2001, p. 68). In other words, a standard score is another way to comapre a student&#8217;s performance to that of the standardization sample. A standard score (or scaled score) is calculated by taking the raw score and transforming it to a common scale. A standard score is based on a normal distrbution with a mean and a standard deviation (see Figure 1). The black line at the center of the distribution represents the mean. The turquoise lines represent standard deviations.<\/p>\n\n\n<p><img src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019032903J.jpg\" width=\"750px\"><\/p>\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-4.TheNormalDistribution\">4. The Normal Distribution<\/h3>\n\n\n\n<p>Normal distributions are important in&nbsp;statistics&nbsp;and are often used in the&nbsp;natural&nbsp;and&nbsp;social sciences&nbsp;to represent real-valued&nbsp;random variables&nbsp;whose distributions are not known.<\/p>\n\n\n\n<p>The normal distribution is sometimes informally called the&nbsp;<strong>bell curve<\/strong>.&nbsp;<\/p>\n\n\n<p>&nbsp;<\/p>\n<p><img src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019032904J.png\" width=\"750px\"><\/p>\n<p>&nbsp;<\/p>\n\n\n<p>The red curve is the&nbsp;<em>standard normal distribution<\/em><\/p>\n\n\n\n<p>Many things closely follow a Normal Distribution:<\/p>\n\n\n\n<ul><li>heights of people<\/li><li>size of things produced by machines<\/li><li>errors in measurements<\/li><li>blood pressure<\/li><li>marks on a test<\/li><\/ul>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-5.Samplingdistribution\">5. Sampling Distribution<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-5.1Centrallimittheorem\">5.1 Central Limit Theorem<\/h4>\n\n\n\n<p>The Central Limit Theorem\u00a0(CLT) is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Check\u00a0<a href=\"https:\/\/builtin.com\/data-science\/understanding-central-limit-theorem\">Un<\/a><a href=\"https:\/\/towardsdatascience.com\/understanding-the-central-limit-theorem-642473c63ad8\">derstanding The Central Limit\u00a0Theorem<\/a>\u00a0for more you want to know about CLT.<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-5.2Samplingdistribution\">5.2 Sampling Distribution<\/h4>\n\n\n\n<p>A sampling distribution is a&nbsp;probability distribution&nbsp;of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.<\/p>\n\n\n\n<p>Sampling distributions are important in statistics because they provide a major simplification en route to&nbsp;statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the&nbsp;joint probability distribution&nbsp;of all the individual sample values.<\/p>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-6.Estimate\">6. Estimate<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-6.1Errorbound\">6.1 Error Bound<\/h4>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/22159c0e74d1487eb8b09804449a79a9\" alt=\"Error Bound\"\/><\/figure>\n\n\n\n<h4>6.2 Confidence<\/h4>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-6.3Confidenceinterval\">6.3 Confidence Interval<\/h4>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/833d7a6124814cd9898d6c6f6bb39042\" alt=\"Confidence Interval\"\/><\/figure>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-7.HypothesisTesting\">7. Hypothesis Testing<\/h3>\n\n\n\n<p>Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true.<\/p>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.T-test\">8. T-test<\/h3>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.1Introduction\">8.1 Introduction<\/h4>\n\n\n\n<p>A t-test is a type of inferential&nbsp;statistic&nbsp;used to determine if there is a significant difference between the means of two groups, which may be related in certain features. It is mostly used when the data sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.&nbsp;<\/p>\n\n\n\n<p>You can check Investopedia&#8217;s&nbsp;<em><a href=\"https:\/\/www.investopedia.com\/terms\/t\/t-test.asp\">T-Test Definition<\/a><\/em>&nbsp;for more knowledge here.<\/p>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.2IndependentSampleT-test:\">8.2 Independent Sample T-test:<\/h4>\n\n\n\n<p>The main difference between analyzing whether the height of boys and girls is the same is the source of the data and the problem to be analyzed.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/9833efa7bd0b454ea891de8789b9df73\" alt=\"Independent Sample T-test\"\/><\/figure>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.3PairedSampleT-Test\">8.3 Paired Sample T-Test<\/h4>\n\n\n\n<p>To figure out whether a man&#8217;s height is different in the morning and evening, I found some people to measure their height in the morning and evening. Everyone here has two values. Here, there is a match.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p1.pstatp.com\/large\/pgc-image\/421f65fe90af49eaa3b59762e1960059\" alt=\"Paired Sample T-Test\"\/><\/figure>\n\n\n\n<p>Sample Error<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/056fc212690f49a1a346ab5b12d33769\" alt=\"Sample Error\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/8d92191c8e2b43e5a495deee08f39251\" alt=\"Sample Error\"\/><\/figure>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.4PooledVariance\">8.4 Pooled Variance<\/h4>\n\n\n\n<p>When the average number of samples is different, but the variance is actually considered to be the same, the variance needs to be combined.<\/p>\n\n\n\n<p>Don&#8217;t be scared by the formula, its essence is the weighted average of the two sample variances<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p1.pstatp.com\/large\/pgc-image\/b56a18667db347a19196eafe1dd9e8bf\" alt=\"Pooled Variance\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p1.pstatp.com\/large\/pgc-image\/6ea47ff5ffd14dfa9780b045c51279c3\" alt=\"Pooled Variance\"\/><\/figure>\n\n\n\n<h4 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-8.5Cohen\u2019sd\">8.5 Cohen\u2019s d<\/h4>\n\n\n\n<p>Effect size:&nbsp;In&nbsp;statistics, an&nbsp;effect size&nbsp;is a quantitative measure of the magnitude of a&nbsp;phenomenon.&nbsp;Examples of effect sizes are the&nbsp;correlation&nbsp;between two variables, the&nbsp;regression&nbsp;coefficient in a regression, the&nbsp;mean&nbsp;difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive. For most types of effect size, a larger&nbsp;absolute value&nbsp;always indicates a stronger effect, with the main exception being if the effect size is an odds ratio.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/24e90fd3d65c46d5863cbcd316197d15\" alt=\"Cohen\u2019s d\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"http:\/\/p3.pstatp.com\/large\/pgc-image\/440e0c6ba540486599320fcc1f5eca91\" alt=\"Cohen\u2019s d\"\/><\/figure>\n\n\n\n<p>What statistical technique do you use most to analyze data?&nbsp;Let me know in the comments below!<\/p>\n\n\n\n<p>Follow&nbsp;<a href=\"https:\/\/www.facebook.com\/finereport\/\">FineReport Reporting Software<\/a>&nbsp;on facebook to master data science together!<\/p>\n\n\n\n<h3 id=\"id-190329-\u5b98\u7f51-Top8StatisticalTechniquesDataScientistsNeedtoMasterin2019-References\">References<\/h3>\n\n\n\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Median\">Wikipedia-Median<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.mathsisfun.com\/definitions\/\">Mathsisfun-Definitions<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/www.investopedia.com\/terms\/s\/sampling-distribution.asp\">Investopedia-Sampling Distribution<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/mathworld.wolfram.com\/HypothesisTesting.html\">MathWorld-Hypothesis Testing<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Effect_size\">Wikipedia-Effect Size<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Statistical_hypothesis_testing\">Wikipedia-Statistical Hypothesis Testing<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Statistics is the cornerstone of data analysis. Here is a summary of statistical knowledge points, which covers basic statistics, probability distribution, population and sample, confidence interval, etc. <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[161],"tags":[],"_links":{"self":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/5301"}],"collection":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/comments?post=5301"}],"version-history":[{"count":13,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/5301\/revisions"}],"predecessor-version":[{"id":8563,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/5301\/revisions\/8563"}],"wp:attachment":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/media?parent=5301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/categories?post=5301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/tags?post=5301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}