The world without Wikipedia would be a much sadder place. Everyone knows about it and everyone uses it – daily. The dreamlike promise of a great internet encyclopedia, accessible to anyone, anywhere, all the time – for free, has become a reality. Wikidata is a lesser-known part of the Wikimedia family, represents a data backend system of all Wikimedia projects, and fuels Apple’s Siri, Google Assistant, and Amazon’s Alexa among many other popular and widely-used applications and systems.
Wikipedia is one of the most popular websites in the world. It represents everything glorious about the open web, where people share knowledge freely, generating exponential benefits for humanity. Its economic impact can’t be calculated; being used by hundreds of millions, if not billions of people worldwide, it fuels everything from the work of academics to business development.
Wikipedia is far more than just a free encyclopedia we all love. It’s part of the Wikimedia family, which is, in their own words: “a global movement whose mission is to bring free educational content to the world.” To summarize its vision: “Imagine a world in which every single human being can freely share in the sum of all knowledge.”
Not that many people know enough about Wikidata, which acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
Goran Milovanović is one of the most knowledgeable people I know. I invited him to lecture about Data Science potential in the public administration at the policy conference that I organized five years ago. We remained friends and I enjoy talking with him about everything that has ever popped in the back of my head. Interested in early Internet development and the role of RAND in immediate postwar America? No worries, he’ll speak 15 minutes about it in one breath.
Goran earned a Ph.D. in Psychology (a 500+ pages long one, on the topic of Rationality in Cognitive Psychology) from the University of Belgrade in 2013, following two years as a graduate student in Cognition and Perception at NYU, United States. He spent a lot of years doing research in Online Behaviour, Information Society Development, and Internet Governance, co-authoring 5 books about the internet in Serbia. He provides consultancy services in Data Science for Wikidata to Wikimedia Deutschland since 2017, where he is responsible for the full-stack development and maintenance of several analytical systems and reports on this complex knowledge base, and runs his own Data Science boutique consultancy DataKolektiv from Belgrade.
He’s a perfect mixture of humanities, math, and engineering, constantly contemplating the world from a unique perspective that takes so many different angles into account.
We focused our chat around Wikidata and the future of the web, but, as always, touched many different phenomena and trends.
Before we jump to Wikidata: What is Computational Cognitive Psychology and why were you so fascinated with it?
Computational Cognitive Psychology is a theoretical approach to the study of the mind. We assume that human cognitive processes – processes involved in the generation of knowledge: perception, reasoning, judgment, decision making, language, etc. – can essentially be viewed as computational processes, algorithms running not in silico but on a biologically evolved, physiological hardware of our brains. My journey into this field began when I entered the Department of Psychology in Belgrade, in 1993, following more than ten years of computer programming since the 80s and a short stay at the Faculty of Mathematics. In the beginning, I was fascinated by the power of the very idea, by the potential that I saw in the possible crossover of computer science and psychology. Nowadays, I do not think that all human cognitive processes are computational, and the research program of Computational Cognitive Psychology has a different meaning for me. I would like to see all of its potential fully explored, to know the limits of the approach, and then try to understand, do describe somehow, even intuitively only, what was left unexplained. The residuum of that explanatory process might represent the most interesting, significant aspect of being human at all. The part that remains irreducible to the most general scientific concept that we have ever discovered, the concept of computation, that part I believe to be very important. That would tell us something about the direction that the next scientific revolution, the next paradigm change, needs to take. For me, it is a question of philosophical anthropology: what is to be human? – only driven by an exact methodology. If we ever invent true, general AI in the process, we should treat it as a by-product, as much as the invention of the computer was a by-product of Turing’s readiness to challenge some of the most important questions in the philosophy of mathematics in his work on computable numbers. For me, Computational Cognitive Psychology, and Cognitive Science in general, do not pose a goal in themselves: they are tools to help us learn something of a higher value than how to produce technology and mimic human thinking.
What is Wikidata? How does it work? What’s the vision?
Wikidata is an open knowledge base, initially developed by parsing the already existing structured data from Wikipedia, then improved by community edits and massive imports of structured data from other databases. It is now the fastest-growing Wikimedia project, recently surpassing one billion edits. It represents knowledge as a graph in which nodes stand for items and values and links between them for properties. Such knowledge representations are RDF compliant, where RDF stands for Resource Description Framework, a W3C standard for structured data. All knowledge in systems like Wikidata takes a form of a collection of triples, or basic sentences that describe knowledge about things – anything, indeed – at the “atomic” level of granularity. For example, “Tim Berners-Lee is a human” in Wikidata translates to a sentence in which Q80 (the Wikidata identifier for “Tim Berners Lee”) is P31 (the Wikidata identifier for the “instance of” property of things) of Q5 (the Wikidata identifier for a class of items that are “humans”). So, Q80 – P31 – Q5 is one semantic triple that codifies some knowledge on Sir Timothy John Berners-Lee, who is the creator of the World Wide Web by the invention of the Hypertext Transfer Protocol (HTTP) and 2016. recipient of the Turing Award. All such additional facts about literally anything can be codified as semantic triples and composed to describe complex knowledge structures: in Wikidata, HTTP is Q8777, WWW is Q466, discoverer or inventor is P61, etc. All triples take the same, simple form: Subject-Predicate-Object. The RDF standard defines, in a rather abstract way, the syntax, the grammar, the set of rules that any such description of knowledge must follow in order to ensure that it will always be possible to exchange knowledge in an unambiguous way, irrespectively of whether the exchange takes place between people or computers.
Wikidata began as a project to support structured data for Wikipedia and other Wikimedia projects, and today represents the data backbone of the whole Wikimedia system. Thanks to Wikidata, many repetitions that might have occurred in Wikipedia and other places are now redundant and represent knowledge that can be served to our readers from a central repository. However, the significance of Wikidata goes way beyond what it means for Wikipedia and its sisters. The younger sister now represents knowledge on almost one hundred million things – called items in Wikidata – and grows. Many APIs on the internet rely on it. Contemporary, popular AI systems like virtual assistants (Google Assistant, Siri, Amazon Alexa) make use of it. Just take a look at the number of research papers published on Wikidata, or using its data to address fundamental questions in AI. By means of the so-called external identifiers – references from our items to their representations in other databases – it represents a powerful structured data hub. I believe Wikidata nowadays has the full potential to evolve into a central node in the network of knowledge repositories online.
Wikidata External Identifiers: a network of Wikidata external identifiers based on their overlap across tens of millions of items in Wikidata, produced by Goran S. Milovanovic, Data Scientist for Wikidata @WMDE, and presented at the WikidataCon 2019, Berlin
What’s your role in this mega system?
I take care about the development and maintenance of analytical systems that serve us to understand how Wikidata is used in Wikimedia websites, what is the structure of Wikidata usage, how do human editors and bots approach editing Wikidata, how does the use of different languages in Wikidata develop, whether it exhibits any systematic biases that we might wish to correct for, what is the structure of the linkage of other online knowledge systems connected with Wikidata by means of external identifiers, how many pageviews we receive across the Wikidata entities, and many more. I am also developing a system that tracks the Wikidata edits in real-time and informs our community if there are any online news relevant for the items that are currently undergoing many revisions. It is a type of position which is known as a generalist in the Data Science lingo; in order to be able to do all these things for Wikidata I need to stretch myself quite a bit across different technologies, models and algorithms, and be able to keep them all working together and consistently in a non-trivial technological infrastructure. It is also a full-stack Data Science position where most of the time I implement the code in all development phases, from the back-end where data acquisition (the so-called ETL) takes place in Hadoop, Apache Spark, SPARQL, through machine learning where various, mostly unsupervised learning algorithms are used, towards the front-end development where we finally serve our results in interactive dashboards and reports, and finally production in virtualized environments. I am a passionate R developer and I tend to make use of the R programming language consistently across all the projects that I manage, however it ends up being pretty much a zoo in which R co-exists with Python, SQL, SPARQL, HiveQL, XML, JSON, and other interesting beings as well. It would be impossible for a single developer to take control of the whole process if there were no support from my colleagues in Wikimedia Deutschland and the Data Engineers from the Wikimedia Foundation’s Analytics Engineering team. My work on any new project feels like solving a puzzle; I face the “I don’t know how to do this” situation every now and then; I learn constantly and the challenge is so motivating that I truly suspect there can be many similarly interesting Data Science positions like this one. It is a very difficult position, but also one professionally very rewarding.
If you were to explain Wikidata as a technological architecture, how would you do it in a few sentences?
Strictly speaking, Wikidata is a dataset. Nothing more, nothing less: a collection of data represented so as to follow important standards that makes it interoperable, usable in any imaginable context where it makes sense to codify knowledge in an exact way. Then there is Wikibase, a powerful extension of the MediaWiki software that runs Wikipedia as well as many other websites. Wikibase is where Wikidata lives, and where it is served from wherever anything else – a Wikipedia page, for example – needs it. But Wikibase can run any other dataset that complies to the same standards as Wikidata, of course, and Wikidata can inhabit other systems as well. If by the technological architecture you mean the collection of data centers, software, and standards that make Wikidata join in wherever Wikipedia and other Wikimedia projects need it – well, I assure you that it is a huge and a rather complicated architecture underlying that infrastructure. If you imagine all possible uses of Wikidata, external to the Wikimedia universe, run in Wikibase or otherwise… then… it is the sum of all technology relying on one common architecture of knowledge representation, not of the technologies themselves.
How does Wikidata overcome different language constraints and barriers? It should be language-agnostic, right?
Wikidata is and is not language-agnostic at the same time. It would be best to say that it is aware of many different languages in parallel. At the very bottom of its toy box full of knowledge, we find abstract identifiers for things: Q identifiers for items, P identifiers for properties, L for lexemes, S for senses, F for forms… but those are just identifiers, and yes they are language agnostic. But things represented in Wikidata do not have only identifiers, but labels, aliases, and descriptions in many different languages too. Moreover, we have tons of such terms in Wikidata currently: take a look at my Wikidata Languages Landscape system for a study and an overview of the essential statistics.
What are the knowledge graphs and why they are important for the next generation of the web?
They are important for this generation of the web too. To put it in a nutshell: graphs allow us to represent knowledge in the most abstract and most general way. They are simply very suitable to describe things and relations between them in a way that is general, unambiguous, and in a form that can quickly evolve into new, different, alternative forms of representation that are necessary for computers to process it consistently. By following common standards in graph-based knowledge representation, like RDF, we can achieve at least two super important things. First, we can potentially relate all pieces of our knowledge, connect anything that we know at all so that we can develop automated reasoning across vast collections of information and potentially infer new, previously undiscovered knowledge from them. Second, interoperability: if we all follow the same standards of knowledge representation, and program our APIs that interact over the internet so to follow that standard, then anything online can easily enter any form of cooperation. All knowledge that can be codified in an exact way thus becomes exchangeable across the entirety of our information processing systems. It is a dream, a vision, and we find ourselves quite far away from it at the present moment, but a one rather worth of pursuit. Knowledge graphs just turn out to be the most suitable way of expressing knowledge in a way desirable to achieve these goals. I have mentioned semantic triples, sentences of the Subject-Predicate-Object form, that we use to represent the atomic pieces of knowledge in this paradigm. Well, knowledge graphs are just sets of connected constituents of such sentences. When you have one sentence, you have a miniature graph: a Subject points to the Object. Now imagine having millions, billions of sentences that can share some of the constituents, and a serious graph begins to emerge.
A part of the knowledge graph for English (Q1860 in Wikidata)
Where do you think the internet will go in the future? What’s Wikidata’s role in that transformation?
The future is, for many reasons, always a tricky question. Let’s give it a try: on the role of Wikidata, I think we have clarified that in my previous responses: it will begin to act as a central hub of Linked Open Data sooner or later. On the future of the Internet in general, talking from the perspective of the current discussion solely: I do not think that the semantic web standards like RDF will ever reach universal acceptance, and I do not think that is even necessary for that to happen to enter the stage of internet evolution where complex knowledge is almost seamlessly interacting all over the place. It is desirable, but not necessary in my opinion. Look at the de facto situation: instead of evolving towards one single, common standard of knowledge and data representation, we have a connected network of APIs and data providers exchanging information by following similar enough, easily learnable standards – enough to not make software engineers and data scientists cry. Access to knowledge and data will ease and governments and institutions will increasingly begin to share more open data, increasing the data quality along the way. It will become a thing of good manners and prestige to do so. Data openness and interoperability will become one of the most important development indicators, tightly coupled with questions of fundamental human rights and freedoms. To have your institution’s open data served via an API and offered in different serializations that comply with the formal standards will become as expected as publishing periodicals on your work is now. Finally, the market: more data, more ideas to play with.
A lot of your work is relying on technologies used in natural language processing (NLP), typically handling language at scale. What are your impressions of Open AI’s GPT-3 which is quite a buzz recently?
It is fascinating, except for it works better in language production that can fool someone than in language production that exhibits anything like the traces of human-like thinking. Contemporary systems like GPT-3 make me think if the Turing test was ever a plausible test to detect intelligence in something – I always knew there was something I didn’t like about it. Take a look, for example, at what Gary Marcus and Ernest Davis did to GPT-3 recently: GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about. It is a clear example of a system that does everything to language except for it does not understand it. Its computational power and the level up to which it can mimic the superficial characteristics of the language spoken by a natural speaker are fascinating. But it suffers – and quite expectedly, I have to add – from a lack of understanding of the underlying narrative structure of the events, the processes it needs to describe, the complex interaction of language semantics and pragmatics that human speakers face no problems with. The contemporary models in NLP are all essentially based on an attempt to learn the structure of correlations between linguistic constituents of words, sentences, and documents, and that similarity-based approach has very well known limits. It was Noam Chomsky who in the late 50s – yes, 50s – tried to explain to the famous psychologist B. F. Skinner that just observing statistical data on the co-occurrences of various constituents of language will never provide for a representational structure powerful enough to represent and process natural language. Skinner didn’t seem to care back in time, and so didn’t the fans of the contemporary Deep Learning paradigm which is essentially doing exactly that, just in a way orders of magnitude more elaborated than anyone ever tried. I think we are beginning to face the limits of that approach with GPT-3 and similar systems. Personally, I am more worried about the possible misuse of such models to produce fake news and fool people into silly interpretations and decisions based on false, simulated information, than to question if GPT-3 will ever grow up to become a philosopher because it will certainly not. It can simulate language, but only the manifest characteristics of it; it is not a sense-generating machine. It does not think. For that, you need some strong symbolic, not connectionist representation, engaged in the control of associative processes. Associations and statistics alone will not do.
Do humanities have a future in the algorithmic world? How do you see the future of humanities in the fully data-driven world?
First, a question: is the world ever going to be fully data-driven, and what does that mean at all? Is a data-driven world a one in which all human activity is passivized and all our decisions transferred to algorithms? It is questionable if something like that is possible at all, and I think that we all already agree that it is certainly not desirable. While the contemporary developments in Data Science, Machine Learning, AI, and other related fields, are really fascinating, and while our society is becoming more and more dependent upon the products of such developments, we should not forget that we are light years away from anything comparable to true AI, sometimes termed AGI (Artificial General Intelligence). And I imagine only true AI would be powerful enough to run the place so that we can take a permanent vacation? But then comes the ethical question, one of immediate and essential importance, of would such systems, if they ever come to existence, be possible to judge human action and act upon the society in a way we as their makers would accept as moral? And only then comes the question of do we want something like that in the first place: wouldn’t it be a bit boring to have nothing to do and go meet your date because something smarter than us has invented a new congruence score and started matching people while following an experimental design for further evaluation and improvement?
Optimization is important, but it is not necessarily beautiful. Many interesting and nice things in our lives and societies are deeply related to the fact that there are real risks that we need to take into account because irreducible randomness is present in our environment. Would an AI system in full control prevent us from trying to conquer Mars because it is dangerous? Wait, discovering the Americas, the radioactive elements, and the flight to the Moon were dangerous too! I can imagine that humanity would begin to invent expertise in de-optimizing things and processes in our environments if some fully data-driven and AI-based world would ever come to existence. Any AI in a data-driven world that we imagine nowadays can be no more than our counselor, except for if the edge case that it develops true consciousness turns out to be a realistic scenario. If that happens, we would have to seriously approach the question of how to handle our relationship to AI and its power ethically.
I do not see how humanities could possibly be jeopardized in this world that is increasingly dependent on information technologies, data, and automation. To understand and discover new ways to interpret the sense generating narratives of our lives and societies was always a very deep, a very essential human need. Does industrialization, including our own Fourth age of data and automation, necessarily conflicts with the world aware of the Shakespearean tragedy, as it does in Huxley’s “Brave New World”? I don’t think it is necessarily so. I enjoy the dystopian discourse very much because I find that it offers so many opportunities to reflect upon our possible futures, but I do not see us living in a dystopian society anytime soon. It is just that somehow the poetics of the dystopian discourse are well-aligned, correlated with the current technological developments, but if there ever was a correlation from which we should not infer any fatalistic causation that is the one.