ANOVA
Crash Course Statistics - S1 - E33
Today we're going to continue our discussion of statistical models by showing how we can find if there are differences between multiple groups using a collection of models called ANOVA. ANOVA, which stands for Analysis of Variance is similar to regression (which we discussed in episode 32), but allows us to compare three or more groups for statistical significance.
Crash Course Statistics: Season 1 - 44 Episode s
1x1 - What Is Statistics
January 24, 2018
Welcome to Crash Course Statistics! In this series we're going to take a look at the important role statistics play in our everyday lives, because statistics are everywhere! Statistics help us better understand the world and make decisions from what you'll wear tomorrow to government policy. But in the wrong hands, statistics can be used to misinform. So we're going to try to do two things in this series. Help show you the usefulness of statistics, but also help you become a more informed consumer of statistics. From probabilities, paradoxes, and p-values there's a lot to cover in this series, and there will be some math, but we promise only when it's most important. But first, we should talk about what statistics actually are, and what we can do with them. Statistics are tools, but they can't give us all the answers.
1x2 - Mathematical Thinking
January 31, 2018
oday we’re going to talk about numeracy - that is understanding numbers. From really really big numbers to really small numbers, it's difficult to comprehend information at this scale, but these are often the types of numbers we see most in statistics. So understanding how these numbers work, how to best visualize them, and how they affect our world can help us become better decision makers - from deciding if we should really worry about Ebola to helping improve fighter jets during World War II!
1x3 - Mean, Median, and Mode: Measures of Central Tendency
February 7, 2018
Today we’re going to talk about measures of central tendency - those are the numbers that tend to hang out in the middle of our data: the mean, the median, and mode. All of these numbers can be called “averages” and they’re the numbers we tend to see most often - whether it’s in politics when talking about polling or income equality to batting averages in baseball (and cricket) and Amazon reviews. Averages are everywhere so today we’re going to discuss how these measures differ, how their relationship with one another can tell us a lot about the underlying data, and how they are sometimes used to mislead.
1x4 - Measures of Spread
February 14, 2018
Today, we're looking at measures of spread, or dispersion, which we use to understand how well medians and means represent the data, and how reliable our conclusions are. They can help understand test scores, income inequality, spot stock bubbles, and plan gambling junkets. They're pretty useful, and now you're going to know how to calculate them!
1x5 - Charts Are Like Pasta - Data Visualization Part 1
February 21, 2018
Today we're going to start our two-part unit on data visualization. Up to this point we've discussed raw data - which are just numbers - but usually it's much more useful to represent this information with charts and graphs. There are two types of data we encounter, categorical and quantitative data, and they likewise require different types of visualizations. Today we'll focus on bar charts, pie charts, pictographs, and histograms and show you what they can and cannot tell us about their underlying data as well as some of the ways they can be misused to misinform.
1x6 - Plots, Outliers, and Justin Timberlake: Data Visualization Part 2
February 28, 2018
Today we’re going to finish up our unit on data visualization by taking a closer look at how dot plots, box plots, and stem and leaf plots represent data. We’ll also talk about the rules we can use to identify outliers and apply our new data viz skills by taking a closer look at how Justin Timberlake’s song lyrics have changed since he went solo.
1x7 - The Shape of Data: Distributions
March 7, 2018
When collecting data to make observations about the world it usually just isn't possible to collect ALL THE DATA. So instead of asking every single person about student loan debt for instance we take a sample of the population, and then use the shape of our samples to make inferences about the true underlying distribution our data. It turns out we can learn a lot about how something occurs, even if we don't know the underlying process that causes it. Today, we’ll also introduce the normal (or bell) curve and talk about how we can learn some really useful things from a sample's shape - like if an exam was particularly difficult, how often old faithful erupts, or if there are two types of runners that participate in marathons!
1x8 - Correlation Doesn’t Equal Causation
March 14, 2018
Today we’re going to talk about data relationships and what we can learn from them. We’ll focus on correlation, which is a measure of how two variables move together, and we’ll also introduce some useful statistical terms you’ve probably heard of like regression coefficient, correlation coefficient (r), and r^2. But first, we’ll need to introduce a useful way to represent bivariate continuous data - the scatter plot. The scatter plot has been called “the most useful invention in the history of statistical graphics” but that doesn’t necessarily mean it can tell us everything. Just because two data sets move together doesn’t necessarily mean one CAUSES the other. This gives us one of the most important tenets of statistics: correlation does not imply causation.
1x9 - Controlled Experiments
March 21, 2018
We may be living IN a simulation (according to Elon Musk and many others), but that doesn't mean we don't need to perform simulations ourselves. Today, we're going to talk about good experimental design and how we can create controlled experiments to minimize bias when collecting data. We'll also talk about single and double blind studies, randomized block design, and how placebos work.
1x10 - Sampling Methods and Bias with Surveys
March 28, 2018
Today we’re going to talk about good and bad surveys. Surveys are everywhere, from user feedback surveys to telephone polls, and those questionnaires at your doctor's office. Still, with their ease to create and distribute, they're also susceptible to bias and error. So today we’re going to talk about identifying good and bad survey questions, and how groups (or samples) are selected to represent the entire population since it's often just not feasible to ask everyone.
1x11 - Science Journalism
April 11, 2018
We’ve talked a lot in this series about how often you see data and statistics in the news and on social media - which is ALL THE TIME! But how do you know who and what you can trust? Today, we’re going to talk about how we, as consumers, can spot flawed studies, sensationalized articles, and just plain poor reporting. And this isn’t to say that all science articles you read on facebook or in magazines are wrong, but that it's valuable to read those catchy headlines with some skepticism.
1x12 - Henrietta Lacks, the Tuskegee Experiment, and Ethical Data Collection
April 18, 2018
Today we’re going to talk about ethical data collection. From the Tuskegee syphilis experiments and Henrietta Lacks’ HeLa cells to the horrifying experiments performed at Nazi concentration camps, many strides have been made from Institutional Review Boards (or IRBs) to the Nuremberg Code to guarantee voluntariness, informed consent, and beneficence in modern statistical gathering. But as we’ll discuss, with the complexities of research in the digital age many new ethical questions arise.
1x13 - Probability Part 1: Rules and Patterns
April 25, 2018
Today we’re going to begin our discussion of probability. We’ll talk about how the addition (OR) rule, the multiplication (AND) rule, and conditional probabilities help us figure out the likelihood of sequences of events happening - from optimizing your chances of having a great night out with friends to seeing Cole Sprouse at IHop!
1x14 - Probability Part 2: Updating Your Beliefs with Bayes
May 2, 2018
Today we're going to introduce bayesian statistics and discuss how this new approach to statistics has revolutionized the field from artificial intelligence and clinical trials to how your computer filters spam! We'll also discuss the Law of Large Numbers and how we can use simulations to help us better understand the "rules" of our data, even if we don't know the equations that define those rules.
1x15 - The Binomial Distribution
May 9, 2018
Today we're going to discuss the Binomial Distribution and a special case of this distribution known as a Bernoulli Distribution. The formulas that define these distributions provide us with shortcuts for calculating the probabilities of all kinds of events that happen in everyday life. They can also be used to help us look at how probabilities are connected! For instance, knowing the chance of getting a flat tire today is useful, but knowing the likelihood of getting one this year, or in the next five years, may be more useful. And heads up, this episode is going to have a lot more equations than normal, but to sweeten the deal, we added zombies!
1x16 - Geometric Distributions and The Birthday Paradox
May 16, 2018
Geometric probabilities, and probabilities in general, allow us to guess how long we'll have to wait for something to happen. Today, we'll discuss how they can be used to figure out how many Bertie Bott's Every Flavour Beans you could eat before getting the dreaded vomit flavored bean, and how they can help us make decisions when there is a little uncertainty - like getting a Pikachu in a pack of Pokémon Cards! We'll finish off this unit on probability by taking a closer look at the Birthday Paradox (or birthday problem) which asks the question: how many people do you think need to be in a room for there to likely be a shared birthday? (It's likely much fewer than you would expect!)
1x17 - Randomness
May 23, 2018
There are a lot of events in life that we just can’t predict, but just because something is random doesn’t mean we don’t know or can’t learn anything about it. Today, we’re going to talk about how we can extract information from seemingly random events starting with the expected value or mean of a distribution and walking through the first four “moments” - the mean, variance, skewness, and kurtosis.
1x18 - Z-Scores and Percentiles
May 30, 2018
Today we’re going to talk about how we compare things that aren’t exactly the same - or aren’t measured in the same way. For example, if you wanted to know if a 1200 on the SAT is better than the 25 on the ACT. For this, we need to standardize our data using z-scores - which allow us to make comparisons between two sets of data as long as they’re normally distributed. We’ll also talk about converting these scores to percentiles and discuss how percentiles, though valuable, don’t actually tell us how “extreme” our data really is.
1x19 - The Normal Distribution
June 6, 2018
Today is the day we finally talk about the normal distribution! The normal distribution is incredibly important in statistics because distributions of means are normally distributed even if populations aren't. We'll get into why this is so - due to the Central Limit Theorem - but it's useful because it allows us to make comparisons between different groups even if we don't know the underlying distribution of the population being studied.
1x20 - Confidence Intervals
June 13, 2018
Today we’re going to talk about confidence intervals. Confidence intervals allow us to quantify our uncertainty, by allowing us to define a range of values for our predictions and assigning a likelihood that something falls within that range. And confidence intervals come up a lot like when you get delivery windows for packages, during elections when pollsters cite margin of errors, and we use them instinctively in everyday decisions. But confidence intervals also demonstrate the tradeoff of accuracy for precision - the greater our confidence, usually the less useful our range.
1x21 - How P-Values Help Us Test Hypotheses
June 27, 2018
Today we're going to begin our three-part unit on p-values. In this episode we'll talk about Null Hypothesis Significance Testing (or NHST) which is a framework for comparing two sets of information. In NHST we assume that there is no difference between the two things we are observing and and use our p-value as a predetermined cutoff for if something seems sufficiently rare or not to allow us to reject that these two observations are the same. This p-value tells us if something is statistically significant, but as you'll see that doesn't necessarily mean the information is significant or meaningful to you.
1x22 - P-Value Problems
July 11, 2018
Last week we introduced p-values as a way to set a predetermined cutoff when testing if something seems unusual enough to reject our null hypothesis - that they are the same. But today we’re going to discuss some problems with the logic of p-values, how they are commonly misinterpreted, how p-values don’t give us exactly what we want to know, and how that cutoff is arbitrary - and arguably not stringent enough in some scenarios.
1x23 - Playing with Power: P-Values Pt 3
July 18, 2018
We're going to finish up our discussion of p-values by taking a closer look at how they can get it wrong, and what we can do to minimize those errors. We'll discuss Type 1 (when we think we've detected an effect, but there actually isn't one) and Type 2 (when there was an effect we didn't see) errors and introduce statistical power - which tells us the chance of detecting an effect if there is one.
1x24 - You Know I’m All About that Bayes
July 25, 2018
Today we’re going to talk about Bayes Theorem and Bayesian hypothesis testing. Bayesian methods like these are different from how we've been approaching statistics so far, because they allow us to update our beliefs as we gather new information - which is how we tend to think naturally about the world. And this can be a really powerful tool, since it allows us to incorporate both scientifically rigorous data AND our previous biases into our evolving opinions.
1x25 - Bayes in Science and Everyday Life
August 1, 2018
Today we're going to finish up our discussion of Bayesian inference by showing you how we can it be used for continuous data sets and be applied both in science and everyday life. From A/B testing of websites and getting a better understanding of psychological disorders to helping with language translation and purchase recommendations Bayes statistics really are being used everywhere!
1x26 - Test Statistics
August 8, 2018
Test statistics allow us to quantify how close things are to our expectations or theories. Instead of going on our gut feelings, they allow us to add a little mathematical rigor when asking the question: “Is this random… or real?” Today, we’ll introduce some examples using both t-tests and z-tests and explain how critical values and p-values are different ways of telling us the same information. We’ll get to some other test statistics like F tests and chi-square in a future episode.
1x27 - T-Tests: A Matched Pair Made in Heaven
August 15, 2018
Today we're going to walk through a couple of statistical approaches to answer the question: "is coffee from the local cafe, Caf-fiend, better than that other cafe, The Blend Den?" We'll build a two sample t-test which will tell us how many standard errors away from the mean our observed difference is in our tasting experiment, and then we'll introduce a matched pair t-tests which allow us to remove variation in the experiment. All of these approaches rely on the test statistic framework we introduced last episode.
1x28 - Degrees of Freedom and Effect Sizes
August 22, 2018
Today we're going to talk about degrees of freedom - which are the number of independent pieces of information that make up our models. More degrees of freedom typically mean more concrete results. But something that is statistically significant isn't always practically significant. And to measure that, we'll introduce another new concept - effect size.
1x29 - Chi-Square Tests
August 29, 2018
Today we're going to talk about Chi-Square Tests - which allow us to measure differences in strictly categorical data like hair color, dog breed, or academic degree. We'll cover the three main Chi-Square tests: goodness of fit test, test of independence, and test of homogeneity. And explain how we can use each of these tests to make comparisons.
1x30 - P-Hacking
September 5, 2018
Today we're going to talk about p-hacking (also called data dredging or data fishing). P-hacking is when data is analyzed to find patterns that produce statistically significant results, even if there really isn't an underlying effect, and it has become a huge problem in science since many scientific theories rely on p-values as proof of their existence! Today, we're going to talk about a few ways researchers have "hacked" their data, and give you some tips for identifying and avoiding these types of problems when you encounter stats in your own lives.
1x31 - The Replication Crisis
September 26, 2018
Replication (re-running studies to confirm results) and reproducibility (the ability to repeat an analyses on data) have come under fire over the past few years. The foundation of science itself is built upon statistical analysis and yet there has been more and more evidence that suggests possibly even the majority of studies cannot be replicated. This "replication crisis" is likely being caused by a number of factors which we'll discuss as well as some of the proposed solutions to ensure that the results we're drawing from scientific studies are reliable.
1x32 - Regression
October 3, 2018
Today we're going to introduce one of the most flexible statistical tools - the General Linear Model (or GLM). GLMs allow us to create many different models to help describe the world - you see them a lot in science, economics, and politics. Today we're going to build a hypothetical model to look at the relationship between likes and comments on a trending YouTube video using the Regression Model. We'll be introducing other popular models over the next few episodes.
1x33 - ANOVA
October 10, 2018
Today we're going to continue our discussion of statistical models by showing how we can find if there are differences between multiple groups using a collection of models called ANOVA. ANOVA, which stands for Analysis of Variance is similar to regression (which we discussed in episode 32), but allows us to compare three or more groups for statistical significance.
1x34 - ANOVA Part 2: Dealing with Intersectional Groups
October 17, 2018
Do you think a red minivan would be more expensive than a beige one? Now what if the car was something sportier like a corvette? Last week we introduced the ANOVA model which allows us to compare measurements of more than two groups, and today we’re going to show you how it can be applied to look at data that belong to multiple groups that overlap and interact. Most things after all can be grouped in many different ways - like a car has a make, model, and color - so if we wanted to try to predict the price of a car, it’d be especially helpful to know how those different variables interact with one another.
1x35 - Fitting Models Is like Tetris
October 24, 2018
Today we're going to wrap up our discussion of General Linear Models (or GLMs) by taking a closer looking at two final common models: ANCOVA (Analysis of Covariance) and RMA (Repeated Measures ANOVA). We'll show you how additional variables, known has covariates can be used to reduce error, and show you how to tell if there's a difference between 2 or more groups or conditions. Between Regression, ANOVA, ANCOVA, and RMA you should have the tools necessary to better analyze both categorical and continuous data.
1x36 - Supervised Machine Learning
October 31, 2018
We've talked a lot about modeling data and making inferences about it, but today we're going to look towards the future at how machine learning is being used to build models to predict future outcomes. We'll discuss three popular types of supervised machine learning models: Logistic Regression, Linear discriminant Analysis (or LDA) and K Nearest Neighbors (or KNN). For a broader overview of machine learning, check out our episode in Crash Course Computer Science!
1x37 - Unsupervised Machine Learning
November 7, 2018
Today we're going to discuss how machine learning can be used to group and label information even if those labels don't exist. We'll explore two types of clustering used in Unsupervised Machine Learning: k-means and Hierarchical clustering, and show how they can be used in many ways - from book suggestions and medical interventions, to giving people better deals on pizza!
1x38 - Intro to Big Data
November 14, 2018
Today, we're going to begin our discussion of Big Data. Everything from which videos we click (and how long we watch them) on YouTube to our likes on Facebook say a lot about us - and increasingly more and more sophisticated algorithms are being designed to learn about us from our clicks and not-clicks. Today we're going to focus on some ways Big Data impacts on our lives from what liking Hello Kitty says about us to how Netflix chooses just the right thumbnail to encourage us to watch more content. And Big Data is necessarily a good thing, next week we're going to discuss some of the problems that rise from collecting all that data.
1x39 - Big Data Problems
November 21, 2018
There is a lot of excitement around the field of Big Data, but today we want to take a moment to look at some of the problems it creates. From questions of bias and transparency to privacy and security concerns, there is still a lot to be done to manage these problems as Big Data plays a bigger role in our lives.
1x40 - Statistics in the Courts
November 28, 2018
As we near the end of the series, we're going look at how statistics impacts our lives. Today, we're going to discuss how statistics is often used and misused in the courtroom. We're going to focus on three stories in which three huge statistical errors were made: the handwriting analysis of French officer Alfred Dreyfus in 1894, the murder charges of mother Sally Clark in 1998, and the expulsion of student Jonathan Dorfman from UC San Diego in 2011.
1x41 - Neural Networks
December 12, 2018
Today we're going to talk big picture about what Neural Networks are and how they work. Neural Networks, which are computer models that act like neurons in the human brain, are really popular right now - they're being used in everything from self-driving cars and Snapchat filters to even creating original art! As data gets bigger and bigger neural networks will likely play an increasingly important role in helping us make sense of all that data.
1x42 - War
December 19, 2018
Today we're going to discuss the role of statistics during war. From helping the Allies break Nazi Enigma codes and estimate tank production rates to finding sunken submarines, statistics have and continue to play a critical role on the battlefield.
1x43 - When Predictions Fail
January 2, 2019
Today we’re going to talk about why many predictions fail - specifically we’ll take a look at the 2008 financial crisis, the 2016 U.S. presidential election, and earthquake prediction in general. From inaccurate or just too little data to biased models and polling errors, knowing when and why we make inaccurate predictions can help us make better ones in the future. And even knowing what we can’t predict can help us make better decisions too.
1x44 - When Predictions Succeed
January 9, 2019
In our series finale, we're going to take a look at some of the times we've used statistics to gaze into our crystal ball, and actually got it right! We'll talk about how stores know what we want to buy (which can sometimes be a good thing), how baseball was changed forever when Paul DePodesta created a record-winning Oakland A's baseball team, and how statistics keeps us safe with the incredible strides we've made in weather forecasting. Statistics are everywhere, and even if you don't remember all the formulae and graphs we've thrown at you in this series, we hope you take with you a better appreciation of the many ways statistics impacts your life, and hopefully we've given your a more math-y perspective on how the world works. Thanks so much for watching DFTBAQ!