Technology: A genealogy of data portals

In this post, with a nod to Jonathan Gray’s Towards a Genealogy of Open Data I’ll attempt to trace a partial history of the software tools that power many different data platforms, focussing in particular on how they have self-described their roles, and the particular concepts or features they have brought into the portal space.

The early portals

It’s not uncommon for researchers to note the relatively small field of portal platforms. If you don’t want to roll your own, implementing an open data portal involves a choice between two broad families. On the one-hand, there are the open source ‘KANs’: CKAN, DKAN and even JKAN. These are oriented towards the developer in their out-of-the-box instantiations, though widely customisable. On the other, there are the commercial offerings, such as Socrata, OpenDataSoft, Junar and Knoema, promoted much more for their visualisations and analytics features than their internal architecture, and increasingly framed as tools for enterprise, with a public sector offering in the mix, rather than framed as open government data portals first and foremost.

So, how did we get here? We need to go back to 2006 to find the early roots of portal software. That’s when the first version of CKAN emerged, the initials standing for the ‘Comprehensive Knowledge Archive Network’ in reference to the Perl programming language package repository CPAN. One of a number of software projects hosted in the Open Knowledge Foundations ‘Knowledge Forge’, CKAN represents a conscious attempt to bring ideas from open source software development into the production of knowledge. In 2010, CKAN was described as creating a ‘debian of data’ invoking an parallel to software package management in which the data analyst would be able to tap a few commands at a terminal and get hold of ready-to-use and dependency-managed datasets to work with. This brought with it an idea of the data package: a container that wraps a datasets along with its metadata - although few of the dataset links on modern portals point to data packaged in this way.

The developer-orientation of CKAN was also evident in the feature set of early versions, which lacked any substantive content management features. This required the data catalogue CKAN provided to be embedded into a wider Content Management System in order to provide any user-facing documentation or information other than dataset meta-data. In the case of data.gov.uk, and a number of other local and national portals around the world, Drupal was the website of choice.

Just a year after the first CKAN code emerged, in 2007 a small company in Seattle released an online database product named blist. Shortly after the platform was used by President Obama’s transition team to publish their financial disclosures, the company, now Socrata, made a pivot to focus on public sector transparency, delivering open government data portals across the US, and gaining global reach when used by the World Bank in 2010 for the Kenya Open Data Portal. Given it’s database heritage, online data tables, rather than meta-data, were at the heart of Socrata’s early offering.

A growing field

As the data portal phenomena gathered pace from 2010 onwards, a number of new platforms entered the scene, both on the open source and commercial sides of the landscape. In 2012, DKAN emerged as an CKAN clone, but built in PHP (rather than CKAN’s python) as a module for the Drupal Content Management System (CMS). This offers, in theory, the potential for datasets to be treated as first-class objects integrated with other website content, although in practice DKAN is offered as a standalone Drupal distribution. DKAN 2.0, released in 2020 adopts an API-first model, with separate front-end and backend, and places emphasis on interoperability of meta-data through use of common schema. A second CKAN clone emerged in 2016, in the form of JKAN. The J stands for Jekyll, the Ruby-based Static Site generator. Whereas CKAN and DKAN rely on database backends, and relatively complex hosting infrastructure, in JKAN, datasets are defined with simple markdown files, used to generate a static catalogue website. Although broadly copying the interface of CKAN, JKAN can be read as a challenge to the growing complexity and cost of data catalogues.

2016 also saw the first code commits on the latest open source portal to be released, Magda.io. Developed in Australia, Magda has an emphasis on federated catalogue management, big and small data, and supporting data publishers to create high-quality meta-data, as well as adopting a micro-service architecture, allowing extensions to be written in almost any programming language. udata, the platform behind the data.gouv.fr portal has also been available as an open source project since 2016, developing a range of social features, placing emphasis on inviting submission of dataset re-use, and supporting user-submitted as well as government-published datasets. Described as a “customizable and skinnable social platform dedicated to (open)data.”, udata has seen adoption in Tunisia, Portugal and Luxembourg, but appears to have had surprisingly limited influence to date on wider portal design practices.

On the commercial side, OpenDataSoft, launched in France in 2011, and the JUNAR Data Platform was launched in Chile in 2012. Both platforms placed emphasis on their data visualisation capabilities, and ability to host datasets as well as meta-data about them. Knoema, publicly launched through a collaboration with the African Development Bank in 2014, also foregrounds data analysis rather than access, with an emphasis on time-series data. Informed by the needs of national statistical agencies, and with built-in SDMX (Statistical Data and Metadata eXchange) compatibility, the Koema platform places emphasis on search for facts, observations and trends, rather than specific datasets.

Another market entrant came in 2015, when ESRI, the commercial firm behind one of the Geographic Information Systems most widely used in the public sector, launched an offering to enable their existing customers users to create data portals from selected geographic datasets - highlighting the potential for firms already involved in aspects of government data management to enter the data portal space. ESRI’s entry in the market also points to the growing importance since at least 2012 of geospatial features. In part sparked by requirements of the European INSPIRE directive, CKAN and other open source platforms increasingly gained geospatial features for harvesting geospatial datasets, providing geospatial meta-data and supporting geospatial data exploration. The open source Geonode platform, under development since 2009 and initially targeted at international development contexts, can also perform a data portal function for geospatial datasets.

A directory compiled in 2017/18 by the Open Data Institute of data publishing tools also identified a number of other offerings in the data catalogue or portal space, including DataDock, EntryScape and qri (pronouned Query). DataDock was created as an open source .Net project in 2018 providing a workflow for publishing CSV files to GitHub and a shared data portal, whilst also adding optional semantic enrichment to create RDF triples from published data. EntryScape Catalog, framed as a data management platform, is a commercial offering used by a number of Swedish municipalities, and is paired with a ‘Registry’ platform aimed at supporting improved data quality management.

qri, launched in 2020, takes a different track, framing itself as a ‘data bazaar’, and focussing on issues of data versioning and archival, alongside effective packaging of dataset meta-data. The qri.cloud platform provides dataset previews and change-logs, as well as supporting update-triggered workflows that can process a dataset. Behind the scenes, qri uses IPFS (the distributed Interplanetary File System) to manage data resources.

Shifting focus

Over time, the framing of different portal offerings has shifted. In 2014, Socrata launched their ‘Open Data Network’ designed to support greater two-way flow of data between public and private sector, and in 2017 they launched a product focussed on internal government data, signalling a shift towards a primary focus on providing enterprise data collaboration tools first, and open data portals second.

OpenDataSoft, part of the Open Data Institute StartUp programme in 2014, has followed a similar path. Now framed as “the data sharing platform teams use to access, reuse and share data that grows business”, OpenDataSoft targets a number of private sector industries, as well as government data sharing. Much like Socrata, the OpenDataSoft platform places an emphasis on data-exploration and visualisation tools. Through their Open Data Network they seek to provide their customers with easier access to data from others on the OpenDataSoft platform, and have also placed emphasis on using published datasets to create KPI dashboards for internal and external organisational use. JUNAR Data Platform perhaps stay closest to a conventional open data framing, currently describing their focus as allowing customers to “transform your hard-to-find and useless data assets into dynamic tables, visualizations, maps, dashboards and APIs – so citizens, developers and companies can re-use them for their interests in a simple way.”. Although still powering a number of statistical open data portals, Knoema’s focus appears to have shifted to serving data marketplace platforms, and acting as an ‘alternative data provider’ primarily to the finance sector.

Open source portals have also seen some shifts in how they are framed. CKAN, for example, now describes itself as ‘the world’s leading open source data management system’, and is supported by an active network of commercial providers offering hosting and support, including to enterprise customers.

New routes or dead ends?

From 2015 to 2018, an EU Horizon 2020 project, ROUTE-TO-PA (short for ‘Raising Open and User-friendly Transparency-Enabling Technologies for Public Administrations’) undertook research into ways to make data from open data portals more accessible. The resulting work found use of a range of other software tools used as open data platforms in the EU, including Semantic Media Wiki, and DataTank, and put forward design of two software tools to sit alongside existing portals: the Social Platform for Open Data (SPOD) which introduced discussion rooms around datasets, and the Transparency Enhancing Toolkit (TET) which prototyped a stripped back search interface, and enhanced visualisation tools, for dataset discovery. However, like much academic research (see later post in this series), these experiments appear to have left little mark on the data portal software space - highlighting an apparent ongoing challenge when it comes to bringing new concepts into open source data portal design.

Semantics and standards

Throughout this history, two other strands in portal development are worth noting: semantics, and standardisation.

When data.gov.uk launched, the software stack around it also included a number linked data components (building on a research programme from a number of years prior): converting selected datasets into RDF, providing the ability to query certain datasets with SPARQL, and assigning URIs to key data elements. From 2011 - 2012, Talis’ Kasabi platform explored provision of a commercial linked data marketplace offering, that might have become a proto-portal, although the platform closed after struggling to find a sustainable business model. The idea of providing knowledge graph access as part of a data catalogue remains in play through data.world, which converts uploaded datasets into triples for SPARQL querying, and through considerable ongoing academic experimentation. Organisation and dataset-level linked data publishing tools also exist in the form of platforms like PublishMyData.

The one place RDF has secured a stable foothold in the open data portal world is through the DCAT standard for portal metadata. First released as a W3C Recommendation in 2014, DCAT provides an RDF vocabulary for representing data catalogues, resources, datasets, distributions and services making datasets available to access. Version 2 of DCAT, published in 2020, introduced new models to represent ‘loosely structured catalogs’ (i.e. those that do not follow the CKAN model of packages, distributions and downloads), and added more detail on representation of data provenance and quality, as well as outlining alignment with the schema.org dataset. DCAT terms provide the basis for the widely used data.json format for catalogue federation, originally defined in the US, but also used in a number of other countries and contexts.

In parallel to the development of DCAT, in 2013 schema.org introduced a dataset type based on the draft DCAT v1 (proposed in 2012), and responding to growth in Open Government Data publication. This provided a language for meta-data about datasets anywhere on the web to be marked up for easier discovery, whether contained within a data catalogue or not. For example, schema.org dataset markup is used on gov.uk in a number of places outside of data.gov.uk. Minor updates to the schema.org dataset class were introduced between 2017-2019primarily responding to demand from scientific data community.

Adoption of DCAT and Schema.org dataset markup was pivotal in enabling the launch in 2018 (beta; full release in 2020) of Google Dataset Search. Some have feared that Google Dataset Search may render non-catalogue front-end features of data portals redundant, as users turn to Google’s global search and bypass local sites. To date, this fear does not appear to have been realised.

Enterprise meta-data management

Google are also active in another, quite distinct, area of the data catalogue space. Their Google Cloud Data Catalogue is one of a number of products seeking to address enterprise data management challenges. Along with open source tools like LinkedIn’s Data Hub, and Amundsen, originally developed at Lyft, these ‘metadata management platforms’ aim to “enable data discovery, data observability and federated governance” within large corporate environments. Able to assume that data, and it’s analysis, will be managed through database platforms such as Postgres, BigQuery, Redshift and Redash, these tools aim to capture ‘table level’ meta-data, keeping track of the properties of the columns inside a dataset, as well as describing the dataset container. These tools introduce a number of new concepts, such as Amundsen’s search result prioritisation based on dataset use (by looking at where datasets are referenced in machine-learning notebooks, or queries in dashboards), and DataHub’s ‘dataset lineage’ to show how one dataset is related to another, and search over data uses as well as datasets.

These enterprise tools have their own genealogy, linked to the evolution of data warehouse, data lake and meshes, and to a focus on machine learning and data analytics. Their nearest ‘open’ cousins are perhaps found in platforms like Kaggle, which whilst not framed as a data portal, never-the-less provides access to 100s of datasets, and situates them alongside code used to process them. In connecting data with specific analytical tools like Pandas and R, (and working with analysis-ready tidy data) Kaggle naturally choses to generate summary stats and graphics for each column in a source table.

Where next?

For the open source data portal platforms, we can take a look at where they might be heading by turning to published roadmap documents and discussions.

DKAN’s published roadmap for late 2021outlines a future focus on data quality analysis, and providing tools to improve data quality, as well as providing tools for interaction with data users. It highlights goals of better understanding “How do data consumers and data publishing organisations want to communicate through the catalog?” and “What role should visualisation and other more graphic elements play in a catalog?”

For new entrant Magda, future work is planned on “adding an integrated, customizable authorization system into Magda based on Open Policy Agent”as well as work on improved interfaces for meta-data creation and dataset de-duplication. Longer-term plans for the project include surfacing metrics on “Subjective Data Usefulness/Usability/Ease-of-Use/Interest”and providing features for “Dataset Feedback/Collaboration” aimed at “Closing the loop between data users and custodians by providing a Quora-styled Q&A interface that allows users to ask questions / report problems with the data, that causes data custodians (as identified in the dataset’s metadata) to be notified and invited to join the discussion with the user.”.

CKAN does not currently appear to have an agreed public roadmap outside it’s issue tracker, but discussions in 2020 covering a possible version 3 suggested work on decoupling front and back-end, improving the extensions ecosystem, and building in better per-dataset versioning, as well as providing additional features for data validation and providing data summaries. Rich permissions structures to allow public meta-data, but private data were also mentioned.

It may be instructive too to look at the roadmaps of the enterprise metadata tools, but for the moment that is beyond this inquiry.

Where does this leave us?

A genealogical approach highlights the non-linear, messy nature of a fields evolution. There is no overall design, but rather an exchange of ideas between independent and interdependent projects. After all, software platforms are collections of ideas, as much as they are collections of code. Choices and configurations of portal software affect what portals can achieve, and how organisations have to be arranged to make the most of them. Understanding the way the data portal software landscape has evolved can help us to ask about the opportunities for future change.

For example, does ESRIs move from internal GIS tool, to open data portal, point to future opportunities for internal data management platforms to become next generation portals? Or does the long-standing, but unrealised, goal of embedding social features further into portal platforms tell us something about the approach to take in future? Are there new ideas that need to become embedded across a wide range of portals? Or is portal specialisation to be embraced?

In a future post, I’ll look at a number of examples of portals in practice, seeking to pick out different ways in which the software tools described above have been appropriated, adapted and configured with particular focus and features - with the goal of sparking conversation about the kinds of features future portals might be based around.