Skip to main content

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.


Popular posts from this blog

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network. Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement.This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.However, there are lots of problems that data/code availability requirements do not address.…

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.I've found the various R methods for doing this hard to remember and usually need to look at old blogposts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command–slide–to the DataCombine R package (v0.1.5).Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997. ExamplesNot…

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines'). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.So, I’ve been putting together the simPH R p…