Skip to main content

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.


Unknown said…
The Article on What type of data should we use when teaching social science statistics is give detail information about it .Thanks for Sharing the information about importance of Data. hire data scientists
WaldoEffertz said…
Hi! Unlock new possibilities of you AI systems with our flexible Outsource Geospatial Data Annotation and Data Entry BPO services. Our team of skilled annotators specializes in various domains, providing you with comprehensive support. From image annotations to 3d models labeling and more, we have the expertise to optimize your business processes.

Popular posts from this blog

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines' ). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models , especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values. So, I’ve been putting together th

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name