In this post, I want to look to take a, somewhat scattershot, look at a range of data portals and projects that, in my view, point to a number of concepts and configurations that might be productive to explore in future portal development. The selection is based on nothing more than portals I’ve encountered in different contexts over recent years, and is farfrom exhaustive.
First stop, London
First up on this brief journey: the London Datastore. Launched in 2010, London’s portal has evolved to cover more of the data spectrum, adding the ability to host data collaborations as well as open datasets. As the London Office of Technology and Innovation describe it:
Data collaboration projects are complex and require alignment of outcomes, information governance processes and technical architecture across multiple stakeholders. The London DataStore is a free service for London that enables boroughs and their partners to share data without building bespoke infrastructure for each data sharing project.
There are also two aspects of the London Datastore worth noting. The first is the evident framing of the Datastore not just as a website, but also as a service, backed by projects and people (‘The City Data Team’). The second is the presence of well-curated pages for selected datasets that provide custom visualisations: such as the COVID Case Data or Google Mobility Data(also a good example of a dataset of public value, originating from the private sector, and available from the public portal).
One format, many publishers
The next stop on this tour is the IATI Registry, one of the earliest CKAN instances I worked with, and still more-or-less unchanged over the decade. It does one thing, and does it, well… that’s an open question. Central though to this portal is that it curates just two kinds of dataset: International Aid Transparency Initiative Activity, and Organisation, files - and that the meta-data is maintained by 1000s of IATI publisher organisations each needing their own portal accounts and access. The comparable 360 Giving Registry is a custom-built platform (with a design informed by learning from the IATI Registry), that fetches dataset information from a Salesforce database (where it is managed as part of a wider CRM), and carrying out dataset validation and summarisation, as well as generating a standardised data.json file.
Do not leave any packages unattended
Then we come to another long-standing CKAN instance: https://datahub.io/ The DataHub has been through a number of iterations, but in its latest revision it offers a clear template of what a data-package centric data portal can be. Offering open hosting of data, combined with curated and ‘certified’ data packages, it has an emphasis on basic visualisation, data validation, and providing field-level meta-data - building on the tooling around frictionless data packages. The documentation points towards all the skills that a data portal management team might need if the ‘debian of data’ vision of early CKAN is to be fully realised. And if, like me, you are not sure that many open government data portal teams will have the capacity to manage the breadth of data their portals might host in this way, it raises the question of how portals should be architected for the skills-as-they-are in government data management.
Simplicity takes a lot of work
Turning to the Humanitarian Data Exchange, we find a technically robust platform, but designed around a lowest-common-denominator approach. HDX, as a heavily modified CKAN portal, tightly integrates support for HXL, the Humanitarian Exchange Language. Rather than requiring data packaging and schemas, HXL asks data providers to add minimal #tag based markup to spreadsheets or other formats, helping guide best-efforts data integration and visualisation. The design principle was more or less “It should be possible for someone, in an emergency response team in a disaster zone, working on a laptop that’s rapidly running out of battery, to quickly access, manipulate and share interoperable data.”. There is lots of explore on HDX. The site hosts visual data stories and thematic data explorers, and has developed ‘Data Grids’ to report on the coverage and quality of key datasets for each country where UNOCHA may have operations. By focussing on the known data needs of portal users, it provides an immediate sense of whether key data is available and usable, rather than leaving users to discover that the data they are looking for is not there, or is not up to scratch. There is also a big friendly button inviting users to ‘Add Data’, leading to a journey of either uploading a dataset, or simply declaring the meta-data for a dataset that others might request, through a service called HDX Connect.
A bespoke search
This role of the portal as a data broker is central to the fifth and final stop on our tour. OpenEnergy is a prototype portal, developed by the IceBreaker One project. Long-term, the project intends the creation of a ‘marketplace’ of both open and shared data, and has been developing a metadata model based on the idea of pre-emptive licensing. Through the Access Control and Capability Grant Language, data owners are encouraged to describe the different conditions for dataset access, the capabilities granted to data users (e.g. to redistribute or not), and the obligations placed upon users. Coupled with services to validate the identity and credentials of data users, the idea (as I understand it) is that this could enable a data owner to articulate that: “Dataset X can be downloaded and analysed for free by accredited researchers, subject to citation of the data source”, whilst “Dataset X can also be accessed via API under a Service Level Agreement by commercial users, on payment of a fee.”. In addition to introducing a few other useful ideas, such as a ‘heartbeat service’ that data owners should provide to enable portals to better monitor uptime of APIs and other means of data access, OpenEnergy points towards a future where portals could be much more aware of who their users are: providing results oriented towards the users role and interests, rather than giving the same search results to everyone.
Whilst researchers have generated somewhat exhaustive lists of commonportal features, asking the simple question: What one thing strikes you as distinct about this portal, can be very instructive. For all the common code behind the portals I’ve explored above (most are CKAN instances), there is a significant amount of experimentation evident. In fact, it’s probably because they’re built on open source CKAN, rather than one of the commercial offerings, that this variety has been able to emerge.
However, the question for the future is whether any of the features we’ve found on this journey: or that you might find in surveying your favourite portals, could or should become part of the default portal feature set?
I can’t not to mention that most of open data portal are not actually data analyst friendly. Most of analysts use commercial and other common tools to process and visualize data. For example Tableau, PowerBI and other.
Another practical issue is that most existing open data portals doesn’t actually support huge datasets with billion records and doesn’t provide API proxing components instead of simple metadata description of API.
That’s why API catalogues like api.gouv.fr created outside data catalogues. I see evidences that government agencies slowly shift to commercial and open sourced enterprise tools to provide better developer experience.