Archive

Editorial

Note: This article first appeared in Research Europe and Research Professional News.

The Covid-19 pandemic has triggered an explosion of knowledge, with more than 200,000 papers published to date. At one point last year, scientific output on the topic was doubling every 20 days. This huge growth poses big challenges for researchers, many of whom have pivoted to coronavirus research without experience or preparation.

Mainstream academic search engines are not built for such a situation. Tools such as Google Scholar, Scopus and Web of Science provide long, unstructured lists of results with little context.

These work well if you know what you are looking for. But for anyone diving into an unknown field, it can take weeks, even months, to identify the most important topics, publication venues and authors. This is far too long in a public health emergency.

The result has been delays, duplicated work, and problems with identifying reliable findings. This lack of tools to provide a quick overview of research results and evaluate them correctly has created a crisis in discoverability itself

The pandemic has highlighted this, but with three million research papers published each year, and a growing diversity of other outputs such as datasets, discoverability has become a challenge in all disciplines.

For years a few large companies have dominated the market for discovery systems: Google Scholar; Microsoft’s soon-to-be-closed Academic; Clarivate’s Web of Science, formerly owned by Thomson Reuters; and Elsevier’s Scopus. 

But investment hasn’t kept pace with the growth of scientific knowledge. What were once groundbreaking search engines have only been modestly updated, so that mainstream discovery systems are now of limited use. 

This would not be a problem if others could build on companies’ search indices and databases. But usually they can’t. 

A new openness
In the shadows of these giants, however, an alternative discovery infrastructure has been created, built on thousands of public and private archives, repositories and aggregators, and championed by libraries, non-profit organisations and open-source software developers. Unlike the commercial players, these systems make their publication data and metadata openly available. 

Building on these, meta-aggregators such as Base, Core and OpenAIRE have begun to rival and in some cases outperform the proprietary search engines. Their openness supports a rich ecosystem of value-added services, such as the visual discovery system Open Knowledge Maps, which I founded, or the open-access service Unpaywall

This open infrastructure has become the strongest driver of innovation in discovery, enabling the quick development of a variety of discovery tools during the pandemic. Technologies such as semantic search, recommendation systems and text and data mining are increasingly available.

Many open systems, though, are not sustainably funded. Some of the most heavily used make ends meet with a tiny core team. Half, including Open Knowledge Maps, rely on volunteers to provide basic services. 

The funding options for non-profit organisations and open-source projects are very limited. Most rely on research grants, which are meant as a jumping off point, not a long-term solution. 

The academic community needs to step up and secure the future of this crucial infrastructure. The shape of research infrastructure depends on institutions’ buying decisions. If most of the money goes to closed systems, these will prevail.

A first step would be to create dedicated budget lines for open infrastructures. The initial investment would be relatively small, as their membership fees are usually orders of magnitude cheaper than the license fees of their proprietary counterparts. Over time, strengthening open infrastructure will enable research institutions to cancel their proprietary products. 

It’s not just about money. Open infrastructures do not lock institutions into closed systems, and save them from selling off their researchers’ user data, an issue gaining prominence as large commercial publishers become data analytics businesses.

The coronavirus pandemic has shown that the challenges of our globalised world demand international collaboration. That requires building on each others’ knowledge. 

This is not possible with closed and proprietary discovery infrastructures that have fallen behind the growth of scientific knowledge. Instead, we need to guarantee the sustainability of the open discovery infrastructure, so that we can rely on it for today’s and tomorrow’s challenges. 

Note: This is a reblog from the OKFN Science Blog. To my excitment and delight, I was recently awarded a Panton Fellowship. As part of my duties, I will be regularly blogging there about my activities concerning open data and open science.

Peter Kraker at Barcamp Graz 2012. Photo by Rene Kaiser

Photo by Rene Kaiser

Hi, my name is Peter Kraker and I am one of the new Panton Fellows. After an exciting week at OKCon, I was asked to introduce myself and what I want to achieve during my fellowship, which I am very happy to do. I am a research assistant at Know-Center of Graz University of Technology and a late-stage PhD student at University of Graz. Like many others, I believe that an open approach is essential for science and research to making progress. Open science to me is about reproducibility and comparability of scientific output. Research data should therefore be put into the public domain, as called for in the Panton Principles.

In my PhD, I am concerning myself with research practices on the web and how academic literature search can be improved with overview visualizations. I have developed and open-sourced a knowledge domain visualization called Head Start. Head Start is based on altmetrics data rather than citation data. Altmetrics are indicators of scholarly activity and impact on the web. Have a look at the altmetrics manifesto for a thorough introduction.

In my evaluation of Head Start, I noticed that altmetrics are prone to sample biases. It is therefore important that analyses based on altmetrics are transparent and reproducible, and that the underlying data is openly available. Contributing to open and transparent altmetrics will be my first objective as a Panton Fellow. I will establish an altmetrics data repository for the upcoming open access journal European Information Science. This will allow the information science community to analyse the field based on this data, and add an additional data source for the growing altmetrics community. My vision is that in the long run, altmetrics will not only help us to evaluate science, but also to connect researchers around the world.

My second objective as a Panton Fellow is to promote open science based on an inclusive approach. The case of the Bermuda Rules, which state that DNA sequences should be rapidly released into the public domain, has shown that open practices can be established, if the community stands together. In my opinion, it is therefore necessary to get as many researchers aboard as possible. From a community perspective, it is the commitment to openness that matters, and the willingness to promote this openness. The inclusive approach puts the researcher in his or her many roles at the center of attention. This approach is not intended to replace existing initiatives but to make researchers aware of these initiatives and helping them with choosing their approach to open science. You can find more on that in on my blog.

Locally, I will be working with the Austrian Chapter of the Open Knowledge Foundation to promote open science based on this inclusive approach. Together with the Austrian Student’s Union, we will be having workshops with students, faculty, and librarians. I will also make the case for open science in the research communities that I am involved in. For the International Journal on Technology Enhanced Learning for example, I will develop an open data policy.

I am very honored to be selected as a Panton Fellow, and I am excited to get started. If you want to work with me on one or the other objective, please do not hesitate to contact me. You can also follow my work on Twitter and on my blog. Looking forward to furthering the cause of open data and open science with you!

by Alan Cleaver

Image by alancleaver

I am usually not a fast blogger. This post, however, has been rather long in the making, even by my standards. I first started to explore the topic of post privacy – i.e. the notion that the (almost) total loss of privacy is inevitable – in 2010. My interest was based on two observations. But before we get to these observations, let’s look at the term privacy first.

Defining privacy

Recently, there was an interesting discussion on the W3C mailing list on the definition of privacy. It quickly emerged that data protection and confidentiality (“the right to be left alone”) are two important concepts in that context. But as Kasey Chappelle put it, privacy is more than that. He defined privacy as informational self-determination, meaning the individual right to decide which information is shared about oneself and under what circumstances. On top of that, I would put Seda Gürses’s definition of privacy as a practice: not only the individual can decide on the use of personal information, but it is also a social convention on what is acceptable and what not. This “what is acceptable and what not” is fluent and subject to a social negotiation process.

The loss of privacy

Now for the observations that ignited my interest:

  1. All data about us is stored in digital form. Most of this data is with third parties, such as the state, insurance companies and so on. There is a lot of data  about us which we would never think of: location data collected by cashback cards and digital traffic surveillance for example, connection data in telecommunication… And this is not even taking into account the data about us that we or others put into the world – such as photos, tweets etc.
  2. Digital data is fugitive. It is in the nature of digital data that it can easily be copied and replicated, and we have a hard time protecting it. Countermeasures such as encryption are not widely adopted. Also, different entities have different interests;  Facebook is not in the data protection business after all.

In a highly interconnected world, these two factors spell trouble. In a recent keynote at WWW 2012 (the World Wide Web Conference), Tim Berners-Lee addressed further issues. One of them is jigsaw identification. Jigsaw identification relates to the fact that while information from one source might not be suitable to identify someone, the combination of information from different sources might well be. For example, if one source publishes post code and age of a person, then that information will not be enough to identify a person: these characteristics usually apply to more than one person. But if another source publishes gender and profession of the same person (which by themselves apply to several people as well), the combination of these four characteristics might be enough to uniquely identify a person. And with the eternal memory of the web, the dates of publication might be far from each other.

All of that leads me to the conclusion that data protection and confidentiality are a lost cause in a digital and highly interconnected world. And with all the data out there, informational self-determination will become impossible. As Tim said, we cannot know which data will be published about us in the future. If a potential employer can buy my health records from a data provider, then the whole notion of privacy as we know it is bound to fail. Furthermore, the data that we publish voluntarily is only the tip of the iceberg. More important is the data that we expose involuntarily (e.g. connection data), data that others expose voluntarily or involuntarily about us (see a data loss scandal near you), and information that can be inferred based on data from various sources (e.g. our social graph). As time passes by, the evidence grows for me that we are headed towards the loss of privacy.

What will happen?

Interestingly enough, most discussions that I had on the consequences of these developments followed the same pattern. The two most prominent views are: a) you are wrong, because I am the one who controls which data is out there (by setting everything private on Facebook, disallowing photo tagging etc.), and b) let’s simply abandon privacy. If all of the data is out there, we actually level the playfield for everyone. In that we discover that everyone has faults, we will attribute less importance to these faults, and society will come out better as a whole.

I think that both of these statements are wrong. Regarding the former statement, I pointed out earlier that the main problem is not the data that we voluntarily put on the web, but rather what others expose about us, what we expose involuntarily and what can be inferred from different data sources. I do not believe that anyone will have everybody’s data at their fingertips. But I do think that all data will be somehow obtainable. The gray market for data that was lost or stolen from third parties, is already huge, and it will continue to grow. It will supplemented by companies who explicitely seek to infer data from various sources, exploiting the jigsaw effect.

With the latter the argument is not so easy. I think it is an intriguing idea. But it I have a hard time in believing that abandoning privacy will make the world a better place, mainly out of two reasons: 1. even though all the data is out there, we will not have an even playfield. We will still have different capabilities in processing the data to get something meaningful out of it. After all, we need to make sense of the data first, and even though everyone can theoretically access it, processing capabilities will not be evenly distributed. Therefore there will be parties with a competitive advantage that they can use to exert power on those with fewer processing cpapbilities. 2. Even when assuming that scoiety will get more tolerant, there will still be things that are more frowned upon than others on a moral scale. A lot will also depend on the presentation of facts to others. The “shitstorms” that we already witness on social media are often based on incomplete or outright false facts.

What can we do?

Now we get to the question that I think is the really important one: How can we deal with the loss of privacy? The only concept that I know of so far is information accountability. It was postulated by Weitzner and al., and builds on informational self-determination. Information accountability is a different paradigm: not the sender is protected, but the receiver is guarded. That means that you only take note in case something happens (you do not get a job, or an insurance because of leaked data). In that event, the offending party would have to present which data they used to make the decision. Bearing that in mind, one of the major questions is: how can we assure accountability on a technological level?

One proposal comes from Oshani Seneviratne. In her PhD at MIT, she develops HTTPa, an accountability-aware web protocol. In essence, the protocol enables you to tell the receiver what he is allowed to do with the transmitted data. This is kind of a creative commons for personal data. A network of provenance trackers stores logs of those permissions and can be consulted in case something goes wrong. There is a lot more to Oshani’s work, and I suggest to check out this presentation as a start.

So should we abolish data protection and confidentiality now, and move solely to information accountability? I do not think so. I see accountability as a good addition which may be suitable to deal the new requirements of a digital and heavily interconnected world. As Oshani points out, accountability is quite compatible with anonymity – because you only have to reveal your identity in case something goes wrong. Apart from the technical solutions, we also need to discuss legal frameworks for accountability to work. Otherwise there will be no way to hold people accountable in court. Therefore, we need to have a broad debate on what is acceptable and what not on a social level. Thankfully, that debate has already started and gets more and more attention. Just like Seda Gürses put it with privacy as a practice. After all, technology can only give us the tools, but what we want to do with them is up to us.

If you made it that far, thanks for reading. Below are a few slides that are meant as a short summary. Of course, I would love to hear your comments and ideas on the subject! Does it make sense to you? Which concepts am I missing?

In the spirit of the upcoming RDSRP’11, I decided to list a few Research 2.0 communities that I check with more or less frequently. That means communities specifically on the topic of Research 2.0, not just Web 2.0 tools for science. Without further ado:

I am sure, I missed tons of places here. What are your favourite Research 2.0 hangouts?

Welcome back in 2011! I haven’t written too many posts lately (due to a lot of work), so it is a nice incentive that the stats team from WordPress.com thinks this blog did quite well last year. Below is a high level overview of blog stats since its inception in March 2010 – courtesy of Andy, Joen, Martin, Zé, and Automattic at WordPress.com with some amendments and reformulations from me:

Healthy blog!

The Blog-Health-o-Meter™ reads This blog is doing awesome! – as you can see this is a rather close call though 😉

Crunchy numbers

Featured image

A helper monkey made this abstract painting, inspired by your stats.

The Leaning Tower of Pisa has 296 steps to reach the top. This blog was viewed about 1,100 times in 2010. If those were steps, it would have climbed the Leaning Tower of Pisa 4 times

In 2010, there were 11 new posts, not bad for the first year! (seeing post counts from various other scientific bloggers leaves with some doubt about that statement)

The busiest day of the year was October 11th with 37 views. The most popular post that day was Blinded peer reviews – a thing of the past?.

Where did they come from?

The top referring site in 2010 was twitter.com (by far) . This is not surprising as I announce all new posts there.

People that came via search engines searched mostly for science 2.0, for me, or for a combination of both. The most popular searches contentwise related to conducting a group discussion.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Blinded peer reviews – a thing of the past? October 2010
1 comment

2

Barcamp Graz 2010 – A weekend in review May 2010
1 comment

3

A Publication Feed Ecosystem for Technology Enhanced Learning [UPDATED] July 2010
1 comment

4

Reminder: Research 2.0 Workshop at ECTEL 2010 June 2010

5

IJTEL Young Researcher Special Issue CfP and CfR September 2010

With that little overview I would like to say “Thank you!” to my readers. I wish all of you a successful year 2011!

Hi, my name is Peter Kraker and I am a research assistant at Know-Center (Graz University of Technology). Currently, I am involved in STELLAR, an EU-funded  Network of Excellence revolving around Technology Enhanced Learning. My main research interest and the topic of my PhD thesis is “Science 2.0”: the way in which researchers use Web 2.0 for their work and the effects this has on science itself.

I will use this blog to report about my research, to cover important developments in the area, and to publish interesting stuff I came across. I am looking  forward to your input and I sincerely hope that this will lead to a fruitful exchange!