Open Data and Evolution

Swamidass · October 22, 2016, 10:32pm

I am getting a chance to talk about my work with BioLogos and AAAS at an upcoming scientific conference (in early January). I’m looking forward to sharing about this with my scientific colleagues, and many of them have been very supportive of my efforts.

Title: Open Data and Evolution Education

Abstract: Darwin’s theory of common descent, published nearly 160 years ago, is one of the most important discoveries in all of science. Nothing in biology makes sense outside of evolution. Now, in the genomic age, the genetic evidence for evolution increases exponentially. The public, however, remains skeptical, with more than 44% of the United States rejecting evolution on religious grounds. Several national efforts, including both the AAAS Science for Seminaries Program and Dr. Francis Collin’s organization BioLogos, are devoted to better engaging skeptical religious audiences with the evidence for evolution. Even with these efforts, the public is often left weighing arguments about data instead engaging the data itself. The rise of open data, reproducible science, and constantly growing genomic databases presents an opportunity. Rather than asking a skeptical public to “trust us,” we might instead enable them to directly engage the genetic for evidence for evolution themselves. In the genomic age, anyone can download and analyze genomes. Bioinformatics requires only a personal computer, open data and software, and clear scientific thinking. Here, I will share my experiences working with AAAS and BioLogos to educate religious audiences about evolution, and explain a role that open data might play in the public debate. In particular, we will examine specific cases where open data and tools have substantially altered the debate, and even convinced skeptical critics of evolution. Perhaps, in the near future, this could be be the way forward. Open data could open minds.

I’m curious what people on the forum think of this notion. Do you think open data could open minds about evolution? I think back to how @vjtorley’s shifts were influenced by open data and tools, and also how the debate about egg-yolk genes played out.

I’m also very curious to hear how my colleagues respond to this idea. It should be fun =). Let me know what you think.

jammycakes · October 22, 2016, 11:54pm

Having been involved with open data myself, I can say it’s definitely the way to go. However, it needs to be easy to find, easy to use, and easy to understand. This means the following:

The data needs to be presented in an easily accessible format such as JSON, XML or plain text as appropriate. If I want to compare genetic sequences, I want to see text files consisting of strings of A, C, G and T. I don’t want to see it all compressed into some strange binary format where I have to do a whole lot of bit-twiddling to get it into a usable format. If you need to compress it to reduce the download size, use standard compression tools such as gzip.
The code to process it needs to be open source and easily accessible. This means one thing: develop and release it on GitHub, where anyone can review it in their web browser, comment on it, submit pull requests, and so on. It isn’t sufficient in 2016 to just bundle your source code into a zip file as a “take it or leave it” bundle. It also needs to be written in easily accessible, freely available programming languages such as Python or Go. I don’t want to have to shell out for a MATLAB licence in order to tinker with it.
It needs to be well documented. It would be good to have some tutorials on how bioinformatics works, how genomes are compared, and so on.
For non-programmers, we need tools that can visualise the data. For example, we’re told that human and chimp DNA are 98% similar – is there any way of visualising that? Or is there any way of highlighting things such as endogenous retroviruses, for instance? Again, these tools need to be open source.

There are some other datasets that I’d like to see released as open data beyond bioinformatics. For example, one thing I’d love to see would be a comprehensive dataset of worldwide radiometric dating results, right down to the individual points on the isochron plots, and also including such things as geolocation data.

AMWolfe · October 23, 2016, 8:00pm

As a layman, I think this would be excellent. I always love it when folks like Dennis Venema put unvarnished chunks of genetic code in their blog posts so you can look at the actual differences between various species. (Of course, I realize it’s not quite “unvarnished” in the sense that Dennis had to know how to line up the relevant portions of the genetic code in order for us to easily compare them, so we got to see them neatly lined up like a problem set for homework, rather than as one might find them “in the wild,” as it were. What I mean is that I get to see A/C/T/Gs in strings instead of just someone’s interpretation of the data.) How much cooler would it be if we felt like we had unmediated access to the data to “see for ourselves.”

Would it change minds? I think so. Sure, the Scriptural issues have to be addressed, and usually they have to be addressed first. At the same time, there’s this idea you alluded to that evolutionary biologists just expect people to trust that they know what they’re talking about, and people aren’t willing to do that. But if you’re able to say, “Don’t take my word for it; see for yourself!” it’s more welcoming and I think you’ll get people wading into the data and finding that there isn’t some big conspiracy to pull the wool over their eyes.

I think the trick would be making it user-friendly. I’ve seen in a few different fields, sometimes there are programs that are incredibly powerful and open-source, but you have to take a course or a seminar or a long series of tutorials in order to be able to use them properly. Ideally this would instead be something that would be virtually self-explanatory (if possible). I second James’s suggestion about color-coding things and marking retroviruses, etc. Never having looked at a full genome myself, I would imagine that during the development of the software, someone would probably have to put a lot of hours into lining up two different genomes so that they could be compared (this human chromosome goes with that orangutan chromosome, etc.) and then highlighting points of interest so people could find their way around (“this is the GULO pseudogene over here,” etc.). Otherwise, I can only imagine it would be quite overwhelming to wade into millions and millions of base pairs without a guide. That would have the opposite of the desired effect.

All the best in this. Sounds really promising. I hope you’ll let us know how the presentation is received.

[Edit: for clarity]

jammycakes · October 23, 2016, 8:43pm

Just another couple of suggestions here.

First of all, is comparative genomics something that can be crowdsourced? If you can get the general public involved in finding and interpreting matches between the different genomes, not only will you be able to convince them a whole lot more easily, you will also get a whole lot of the grunt work of research done for free. I know that some astronomers have experimented with this approach in trying to identify exoplanets by asking members of the public to pick out dips in the light curves of distant stars, for example. (See https://www.planethunters.org/).

Secondly, if it can be crowdsourced, it can also be gamified. You could award points or badges for identifying matches, which could then be upvoted or downvoted as appropriate. There are a lot of online communities that use this approach, for example, Stack Overflow and the Stack Exchange network of Q&A sites award increased editing privileges based on your reputation on the site. The Discourse software used to run this forum also offers various gamification features in the form of badges. This can provide some people with a pretty strong incentive to return to the task over and over again. Especially if you make their reputation something that they can include on their CV.

Argon · October 23, 2016, 9:00pm

My biggest concern is with ‘Garbage in. Garbage out.’

There is so much more to sequence analysis and modelling than simply running comparison programs. Some serious background knowledge and understanding of the tools is required to interpret and analyze the results. The overwhelming majority of tinkerers are not going to reach that level. It’s bad enough that we can find simple mistakes in many papers but see what problems others can get themselves into when they really want to misinterpret the data. (e.g. human DNA is < 70% similar to chimps, & etc.)

jammycakes · October 23, 2016, 9:13pm

That’s why I emphasised that the code (complete with test cases) needs to be out in the open as well as the data. The general idea behind open source software is, “with many eyeballs, all bugs are shallow.”

Well written software will have a comprehensive suite of unit tests and integration tests, which would be broken by a pull request showing human DNA and chimp DNA to be only 70% similar. For starters, it would likely show that human DNA is only 75% similar to itself.

Every comparative genomics program should have a unit test that asserts that that two copies of the same genome are indeed reported as 100% identical. If it doesn’t, its test coverage is not fit for purpose.

Argon · October 23, 2016, 9:21pm

OK. Absolutely the tools should be available for open debugging.

I was thinking about the application of the tools and how the tools are used for analyses. End users are buggier than most programs.

Larry_Bunce · October 25, 2016, 12:29am

The real problem with computer analysis is “garbage in, gospel out.” With billions of base pairs to compare, genome analysis requires a computer, but trusting the computer’s analysis requires a certain amount of faith when the comparison is not a straightforward 1 for 1 process. Computer programs need to allow for transposition of segments, shifts of a few base pairs before continuing identical patterns, and probably huge numbers of other things I wouldn’t know about. The details of processing all of this can get extremely messy. Trying to debug the comparison program can become more difficult than analyzing the genome itself.

People who don’t accept the idea of common descent are likely not to accept a computer program’s analysis of raw data as telling them anything about how related two species are. They will make their own conclusions, based on their own presuppositions, with or without plenty of impartially obtained data.

Swamidass · October 26, 2016, 3:13am

This is certainly the risk. But I am hopeful.

That would be one of the first cases I would want to go after.

Exactly.

So any how, is any one interested in joining a working group to build a resource like this to see if it could help?

Casper_Hesp · October 26, 2016, 7:29am

@Swamidass

Great idea Joshua! What about including a module which allows people to play with evolutionary algorithms to discover the creative power of evolution? Larry Yaeger (currently working at Google) developed a program called Polyworld. This is a virtual world in which the “dna” determines the makeup and neural architecture of the creatures. You can witness all sorts of “species” evolve, including adaptations like mimicking behavior for warding off predators. In my own research on the underpinnings of emotion I’m also working with such algorithms to evolve minimal agents to develop communicative abilities. It often feels a bit like a game and I think lay people can also have fun with that. Something like this genetic algorithm car game but then focused on the evolution of species.

Jay313 · October 26, 2016, 12:43pm

I wouldn’t be much help in building it, but I can certainly serve as a test monkey, er, ignorant layman, to try it out.

Relates · October 26, 2016, 8:00pm

@Swamidass

The problem is that evolution is NOT common descent. Evolution is Variation plus Natural Selection.

Genetics only covers Variation. Darwinism has not explained Natural Selection.

gbrooks9 · November 18, 2016, 11:46pm

@Relates

No, of course not. “Evolution” is not identical to “Common Descent”. But if you accept the former, it is virtually impossible to reject the latter. All the mainstream theories of Evolution include the idea that with enough change, a population will inevitably change it’s genetics, it’s behavior and it’s appearance.

And while you are right to say that “Genetics only covers Variation” … I’m not sure you are making much sense by saying “Darwinism has not explained Natural Selection.” The usual Bon Mot is that Darwin didn’t understand the genetics of Natural Selection - - but he Certainly Understood how Natural Selection worked! That was one of his great contributions. What exactly do you think Darwin wrote about in his seminal book? Since he didn’t know the mechanism for variation - - it means he must have been discussing the more visible side of his theories - - Natural Selection and the environmental factors shaping Natural Selection!

Since his day, Natural Selection can be mathematically defined, measured and even forecasted by quantifying the rates of offspring reproduction 1) over a given period of time, and 2) measured as an average per individual during that time period.

I certainly do agree with you that Evolution is Variation plus Natural Selection!

gbrooks9 · November 18, 2016, 11:47pm

@Swamidass…Swami! Over here… my hand is waving … I would love to be of help … if I can be!

George

Relates · November 20, 2016, 3:10pm

That is not an explanation.

gbrooks9 · November 20, 2016, 3:26pm

@Relates

It is not an explanation for what ?

Once you develop a way to quantify Natural Selection, then it becomes possible to make hypotheses regarding what physical or behavioral features will have the biggest affect on reproductive success.

Sometimes scientists develop good predictions… sometimes they don’t and it is back to the drawing board.

But success in reproduction (measured quantifiably) speaks for itself.

Swamidass · January 5, 2017, 6:23am

As a quick update, I gave that talk to day to a room full of scientists. I talked about the AAAS Science for Seminaries program and BioLogos.

I explained how open data reshaped the dialogue between the Discovery Institute and myself (remember the vitellogenin pseudogene in humans?). I also told the story of Glen Richardson (aka roohif, Is 1% a myth? – roohif), the programmer who took Thomkins at AiG to task about the similarity between humans and chimps. Yes, we are about 98% similar, not 70% or 88% as they think.

In particular, I was encouraged by the agreement that scientists needed to engage the public to serve the common good. As if perfectly timed, Nature published a note making just this case: Why researchers should resolve to engage in 2017 | Nature.

It was well received. My “boss” even tweeted about me https://twitter.com/prpayne5/status/816471198800363520.

Let us hope that “open data might open minds.”