Why Social Science Research Is Suspect
"... in the end, we often had to choose between keeping a social norm, and getting quantifiable data."
Human behavior is really, really hard to quantify.
In the early 2000s I participated in a linguistic field research project in Asia. The goal was to determine whether two local languages could be considered dialects of each other or were, in fact, mutually unintelligible. Word lists from both languages had already been taken and compared, but they fell into that grey area (about 70% similar) that made further testing necessary.
Our method was called Recorded Text Testing. To do this, you record a person telling a two- or three-minute story in one language. You take it into the other language area, play it for the people there, and see how well they understand it. Then you do the same thing in reverse.
Sounds simple, but it was astoundingly hard to prove - or, in some cases, determine - how much of the story people actually understood.
Our team had originally tried to measure comprehension by giving a comprehension test. But they found that answering an oral quiz about a story was not natural to the people. When asked content questions, they simply started re-telling the entire story. Obviously this meant they understood it a bit, but since they had not answered the questions, comprehension had not been actually proved. So the team changed their method. We would go with what came naturally. We would ask people to retell the story.
However, this still was not good enough for a research project, because simply having someone re-tell a story does not yield a number, and for research you need numbers. You cannot say the subject re-told the story "pretty well." So, we counted up the significant details in the story and graded each person's re-telling based on how many of them they hit.
Even with this method, our quantification process was imperfect, to say the least. For one thing, the story was so long that people often forgot parts of it when retelling. (Was this because of language or not?) Sometimes our intuitive sense of how much the person understood, conflicted with their score on our "test." Some people, instead of re-telling the story, wanted to give commentary on it or react to it with questions of their own. We came away with a decent informal sense that people understood it pretty well, but this sense was undermined by our total inability to prove this quantitatively.
To get this basically useless data, we had had to impose a great deal on those lovely people. They allowed us into their homes, and we sat there with clipboards, asked a bunch of questions in our broken trade language, and all this seemed an awful lot like either a test at school or an interview at a government office (neither one a happy association). Of course, we did our best to be gentle and respectful and explain that we were just researching these languages, that there were no wrong answers, etc., etc., but in the end, we often had to choose between keeping a social norm, and getting quantifiable data. The low point on our survey was probably when we asked each participant: "How similar would you say this language is to your own: 0 to 25%, 25 to 50%, 50 to 75% ...?"
That last ridiculous question is not too different from the methods that we often read about in various social science studies, where participants are asked to rate their own or another person's personal qualities, such as attractiveness, intelligence, likeability, etc., on a scale of one to five or one to ten. These kinds of qualities are even more abstract and subjective than whether another dialect is understandable. Though people generally rate them about the same, and this is thought to show that they are reliable, it's worth remembering that quantifying another person's character is not a task anyone is asked to do, ever, in everyday life. Of course we make character judgments all the time, and base decisions on them, but we don't use numbers.
I myself do not even like to be asked, as we often are at a hospital, to "rate your pain on a scale of one to ten."
"One is no pain at all, and ten is the worst thing you've ever felt."
The first time I was asked this, I was in labor. I thought, How should I know? I'm uncomfortable now, but I've been told it's going to get a lot worse. Maybe I have never felt a 10. What if I say 6 now, and later it turns out it was only a 2?
Statistical significance is weird.
For a real-life example of this quantification problem, take a recent article in The Economist (July 25, 2015). The headline is, "Science and Justice: Looks Could Kill."
According to The Economist,
[The researchers] selected 371 prisoners on death row and a further 371 who were serving life sentences. ... A group of 208 volunteers ... were then invited to rate photographs of each convict's face for trustworthiness, on a scale of one to eight, where where one was "not trustworthy at all" and eight was "very trustworthy."
The results of all this work revealed that the faces of prisoners who were on death row had an average trustworthiness of 2.76 and that those serving life sentences averaged 2.87. Not a huge difference, but one that was statistically significant (it, or something larger, would have happened by chance less often than one time in 100).
... In Floridian courts, at least, it seems that your face really is your fortune. (page 64)
The difference between the two groups' average scores may be statistically significant, but it is - how to put it? - tiny. Not even a full point of difference. As a lay person, I couldn't help noticing that the death row group "scored" 2 3/4, and the lifer group scored not quite 3.
Is a difference like that really significant in real life? I'm not sure. But I'm wary.
This hit home for me in 2009 when I read an op-ed that took evangelical Christians to task because apparently, in polls, torture (e.g. of suspected terrorists) was approved of by just over half of Evangelicals ... and just under half of the non-religious. It was something like 55% of Evangelicals approving, compared to 45% of non-evangelicals approving. That is a pretty big difference statistically. But, in real life, it means that in both groups about half of the people said they would approve of torture and about half said they wouldn't. The numbers were close enough to 50% in both groups to show that, whether Christian or not, people were deeply conflicted about the morality of torture. Yet on the strength of this statistically-significant-yet-real-life-ambiguous difference, the author asserted that Christians don't stand up to evil, just as they didn't stand up to Hitler. (Yep, he mentioned Hitler. In his opening paragraph.) We have learned that teeny tiny differences in numbers are hugely significant, that they can even help us to see into the souls of entire groups of people. But I am not so sure.
Social scientists will often measure something other than what they think they're measuring.
Because of the impossibility of actually measuring the things in the human mind that we really want to measure, a popular method is to measure something related. Sometimes the researchers seems to think that they actually are measuring the target thing, but they're not.
For example, you may have heard of "pregnancy brain." Many women report that they are more forgetful during pregnancy. During my childbearing years I read a blurb somewhere (probably reporting on this study) that said this idea of pregnancy brain had been disproved. It turns out that pregnant women usually do just as well on memory tests as the non-pregnant. Sorry, ladies, it looks like your excuse is gone.
Except, of course, that taking a memory test in a controlled setting is a completely different mental skill from being able to concentrate and remember all your duties and commitments in the whirlwind of actual, everyday life.
The only way to determine whether some pregnant women have a harder time with the latter task is to ask them. But this is not "hard data," so we will test something related instead.
I remember a similar study from my college years that "proved" that children under a certain age can't feel empathy. The experiment involved sitting a child down at a sand table, across from a large doll. On the sand table were three hills - one close to the child, two closer to the doll. The children were asked to "draw what the hills look like to the dolly."
They drew three hills, one closer to the viewer and two farther away.
At the time I heard about this study, I accepted that it meant young children aren't capable of empathy. Only years later did I realize that all it proved is that they find it hard to visually rotate an object in their mind.
The study did not prove a lack of empathy at all. The kids might be unable to imagine what visual scene another person is facing, but perfectly able to read and identify with their emotions.
Social science findings often "prove" what most human beings already know.
Much as you might not suspect it from the title of this article, I enjoy reading social science books and articles. Part of what I enjoy is the mental "click" when a study confirms my sense of how things are in the real world. It's nice to feel that your common sense has been vindicated by Science. And now you can sound smart when you talk about it! But all too often, this ends up meaning that "studies" tell us what everyone already knows, as in the example above, where most of us could have guessed that defendants with shiftier faces would get harsher sentences.
Many people (myself included) also enjoy reading a study that finds the opposite of what we would expect. Then we feel that the conclusion is truly insightful, precisely because it's unexpected. And we get double smart points for knowing something from Science that most people wouldn't guess! Unfortunately, when it comes to human behavior, a scientific conclusion that sounds nonsensical is often precisely that.
When a study doesn't "prove" what the average person has already known for millennia, it may instead "prove" whatever the researcher, or the research community, know or want to be true. This is not to say that social science researchers are dishonest. Failing to notice things that we didn't expect to see happens to everyone, and it's called Confirmation Bias.
With the scientific method, you start with an hypothesis and set out to prove or disprove it. If the hypothesis has to do with the physical behavior of moving objects or gasses or chemicals, it's usually pretty clear whether or not your hypothesis was right. But when the hypothesis has to do with human behavior, with all it ambiguities and difficulties of interpretation, often the results are inconclusive. But since the researchers themselves have to interpret the results, inconclusive results look like a confirmation of the hypothesis, whatever it might be.
Even in the hard(er) sciences, findings tend to change with changing social values, and even with fads. Think about how many foods have been "proven" to be unhealthy in study after study, only to be rehabilitated after a couple of decades. It wasn't that all the studies were completely invalid, it was just that they didn't prove as much as was first thought. Think about how much greater this problem is when the subject of study is people.
Conclusions reached from suspect data get reported as unassailable fact because it's "Science."
To finish the rant part of this article before I propose a solution ...
We all know the great respect accorded to Science. If something was scientifically studied, it is considered authoritative and indisputable. This works pretty well for the very hard sciences, like astrophysics, biochemistry, and so on. (Although even there, scientists lament that the way their studies get reported is often so oversimplified as to be misleading.) But this idea of scientific-studies-as-proven-fact does not work so well for fields of study that, to put it mildly, do not lend themselves to a quantifiable, laboratory approach.
The reader can probably see the danger of having magazines and other news sources report as sober fact the results of studies that, for all the reasons given above, are very difficult to conduct accurately ... to interpret ... and to guard from becoming little more than the codifying of a subculture's biases. But it's so hard to resist. When we like the implications of a "conclusion" that has been reached, we cheerfully share it. It makes us feel smart, and vindicated. Only on rare occasions does a study come along that "proves" something that we know from personal experience to be untrue. When that does happen, we might question how well the study was carried out. But we know our motives taint the questioning. So, my solution is to become skeptical of all "studies" involving people, short amounts of time, and quantifiable data.
What kind of study would get my respect?
When it comes to social behavior, detailed descriptions are better than numerical "data."
One social science researcher whom I respect deeply is Deborah Tannen. She is a Georgetown linguistics professor and is the author of several books about human communication. The ones with the catchiest titles are her books That's Not What I Meant, about communication styles, and You Just Don't Understand, about the dynamics of communication between men and women.
What I appreciate about Tannen's books is the amount of detailed description of these tricky dynamics. There is a lot of analysis. People's actual words are mentioned. So are their communication norms (this is a big theme of Tannen's), and their intentions, what they are trying to do. This is so much deeper than a misleading quantifiable study. Tannen's methods, often, are interviews, reading, analysis and comparing the results to her personal experience. She does this in a thorough way. This stuff does not tend to yield numbers very well, so Tannen does not rely much on methods like polls and people rating each other on a scale. Instead, she just describes what is going on.
That is the way to research people, and to report on it.
The best social science research is the kind that takes a long time.
In this article, I do not mean to say that there is no useful information among the reams that have been written coming from the field of social science. Of course there are helpful studies out there. Some are better than others, depending upon what is being studied. For example, the recorded-text testing that I described above was relatively hard to do well. Some other kinds of language tests, such as taking word lists and sentence-repetition tests, yield better data (though they do not measure the exact same things).
Of course, when comparing two languages or language communities, the best picture of the situation would come from living in one or both of them, learning the language, and doing a thorough analysis of the languages' sound, grammar, and rhetoric. But that takes a very, very long time.
Similarly, with other social science research, the best data are the data that take a lot of time and expense to get. Interviews, case histories, longitudinal studies that follow the research subjects throughout their lives... And, of course, large sample sizes.
All of these very involved research methods are hard to do in a semester with graduate students on a low budget. Also, they tend to yield data that is rich, complex, and difficult to distill into a headline or even a single article.
That's why my favorite examples of good social science tend to be books like Tannen's. Particularly when those books are written by people who have been in their field for years, acquiring wisdom. These people are in a unique position to provide detailed and insightful descriptions of the human behaviors that we most want to understand. Though their main conclusions may, in a sense, only confirm what we already know, their detailed descriptions of the dynamics of how people relate can, indeed, help us understand each other better.
And the very best social science study of all? That might just be a really well-researched, honestly written memoir or history book.