An Update on punctum books Usage Data: Interoperability, Interpretability, and AI
punctum books has been collecting usage data of its ebooks since our founding in 2011, and over the years both the collection and processing of these data have been expanded as our catalog of open access books has grown. Since 2020, thanks to the collaboration with our digital infrastructure partner Cloud68.co, we have been using phpMyAdmin as a MySQL database client and Metabase as data visualization software. I have previously written blog posts here and here on our process of usage data collection and processing.
This internal process is still very much a manual one. Once a month I gather all the different ebook usage data output files from their various sources, clean up the files, and import them into the database. Whenever a new source becomes available (for example, we recently started adding usage data from EBSCO eBooks), tables and visualizations need to be manually added and adjusted. We currently provided with monthly ebook usage data from the following platforms: OAPEN, JSTOR, Project MUSE, Internet Archive, Google Books, and EBSCO eBooks.
The last five years have given us a sense of what it means to collect and process these data over an extended period of time, and the challenges that remain in terms of interpreting and evaluating these data in any meaningful manner.
Visualization
Even though Metabase is a powerful open source data visualization platform, persistent bugs in the software make it impossible to effectuate specific types of table merges and data visualization necessary for us to summarize and display our usage data in an optimal manner. As a result, we often still rely on manual work to provide our authors and libraries with specific usage data relative to their books or institutions, respectively, despite these data being readily available. Ideally, a visualization dashboard is interactive and customizable for a specific end user, but with Metabase we have been unable to accomplish this since it does not offer this functionality.
As a result, we have been looking into other open software options, such as Apache Superset, but moving platforms would put a considerable strain on our and Cloud68.co’s resources. On the one hand, moving would imply for Cloud68.co adding an entirely new software platform to their roster of products and maintaining it in perpetuity, while for us it would imply recreating every single query from the start in addition to a significant amount of manual coding, as Superset does not have the same visual table-merging interface as Metabase. On the other hand, Superset’s visualization options are much more powerful and interactive.
Interoperability
Since 2020, nothing has improved in terms of the interoperability of usage data across the open access book landscape. Different platforms appear to be simply locked in to business as usual, and there appears to be not enough external pressure (from publishers, infrastructures, funders, libraries) to improve usage data reporting to even align in some minimal measure, such as by adopting persistent identifiers and international standards.
Of those platforms who provide us with institution-specific data (for example, the university network from which a specific book or chapter was accessed), JSTOR and Project MUSE, neither have standardized their representation of institutions using the ROR persistent identifier, posing an ongoing challenge to meaningful data comparison on an institutional level.
Of those platforms who provide us with country-specific data, OAPEN, JSTOR, Project MUSE, and the Internet Archive, only the Internet Archive does so using an international standard, such as ISO 3166-2 alpha-2, that would allow easy correlation between data. As with institutions, JSTOR and Project MUSE use their own proprietary country lists, and OAPEN no longer provides any meaningful way of exporting country-specific data from their usage data backend since 2021. Similarly, timestamp formats are anything but harmonized between data providers.
Shifting Formats
Related to the interoperability of usage data between different platforms is the interoperability of usage data within a single platform. As perhaps can be gathered from the above graph, there are three different colors representing OAPEN data, and two colors representing Project MUSE data. This is the result of these platforms changing usage data reporting formats to such an extent that this necessitated a complete change of the data ingest workflow on our end, creating new database tables and data visualization algorithms to process the data.
In the case of OAPEN, the shift from COUNTER 4 to COUNTER 5 reporting (see below) produced a significant disruption of usage data reporting and the loss of country-specific book usage data. Later changes to the data export format necessitated further changes on our end and the creation of new tables and ingest protocols.
In the case of Project MUSE, the transition happened in 2024, and coincided with the introduction of chapter-level usage reporting. A later change in 2024 included the introduction of an additional column in the output spreadsheet for the publisher, but fortunately this did not force us into another major redesign.
It should be said that all of these changes in the specific ways usage data are reported back to the publisher are hardly ever clearly announced or explained, and publishers, one of the end users of these data, are not consulted on this. Nevertheless, changes in reporting have a significant impact on our data processing and visualization workflows, which in turn impacts the ease with which authors, libraries, and others can navigate the data that we make publicly available. For example, exporting Project MUSE usage data to a particular supporting library now requires me to export data from two different, incompatible tables.
Interpretability
Related to the question of inter- and intra-platform interoperability of data is their interpretability.1 What precisely is counted as an “interaction,” “download,” or “view”? In theory, many of the above platforms produce data that are “COUNTER compliant.” The OAPEN Dashboard provides “COUNTER conformant usage statistics”; JSTOR publisher reports for books are “built on the same reporting system as our COUNTER 5 reports” but “technically not COUNTER compliant because the report types JSTOR offers are not mandated by the COUNTER code of practice”; and Project MUSE usage statistics are “compliant with Release 5 of the COUNTER Code of Practice (COP).”
In particular, with regard to what is counted as a proper usage, this means the following:
Usage data collected by content providers for the usage reports to be sent to customers should meet the basic requirement that only intended usage is recorded and that all requests that are not intended by the user are removed.
Yet at the same time, we have seen a significant increase in usage of our books on Project MUSE ever since the 2024 switch to chapter-level reporting, despite the fact that it hosts only part of our back catalog and no new books have been added in that year. The likely interpretation is that the newly introduced reporting of chapter-level data has artificially inflated the number of interactions with punctum books publications on the Project MUSE platform. Without these reporting systems being open to external auditing, it is difficult to interpret these changes.
Another interesting example is provided by Google Books usage data which, as far as we know, are not COUNTER 5 compliant. Google does not remove “robot and crawler ” data from its usage reporting, which as a result appears at times highly inflated. The enormous usage spikes in 2024, for example, can be ascribed to basically a single book in our catalogue, Porno-graphics and Porno-tactics, which in 2022 was featured in a YouTube video from the right-wing hate group Moms for Liberty, generating, as a result, a huge interest in that particular title.
What is more interesting is the spike in “book visits” in January 2025, which we may compare to July 2024, a somewhat ordinary month. These were the top five publications in both months:
Google Books July 2024 | |
Porno-graphics & Porno-tactics | 54,794 |
Dotawo 3 | 660 |
Warez | 380 |
Hippolytus | 379 |
The Retro-Futurism of Cuteness | 228 |
Google Books January 2025 | |
Critique of Fantasy, Vol. 1 | 15,191 |
Object Oriented Environs | 12,025 |
Matches | 11,468 |
Queer and Bookish | 11,396 |
Sappho | 10,201 |
The only reasonable thing to conclude from these data is that the massive usage seen from our catalog on Google Books in January 2025 of more than half a million “book visits” has nothing to do with human usage, but everything with the crawling of our catalog by AI platform bots, in particular Google’s own Gemini. While we have nothing against nonhumans reading our books, this significantly skews usage data unless their activity is filtered out.
Another example, on the other extreme end, is the usage data provided by EBSCO eBooks, which we also assume not to be COUNTER compliant. These data are not automatically generated, and need to be requested by email. The reporting format is not standardized in any way, and shows a host of interaction categories, including “eMail” and “Export to Google Drive” that are difficult to interpret. Despite a request from our end, no guidance has so far been offered as to how to interpret any of these. Moreover, considering the market share of EBSCO, a significant commercial player, the extent to which their ebook platform appears to be actually used through library discovery systems is surprisingly underwhelming.
What Next?
From the above, it is apparent that making open access ebook usage data both interoperable and interpretable still has a long way to go. As a scholar-led open access presses, punctum books is committed to contributing to developing the necessary open infrastructure to do so.
As one of the pilot partners of the Open Access Ebook Usage Data Trust, we are actively contributing to the development of an open protocol that will allow the secure and reliable transfer of usage data between publishers, funders, libraries, and other end users.
Through the Open Book Futures project, punctum books is engaged in the development and management of Thoth Open Metadata. Thoth is currently working on expanding the usage data visualization tool developed for the OPERAS Metrics Service2 to be integrated into the suite of metadata solutions offered by Thoth, including publisher reports and website widgets. An OPERAS working group has been recently set up including key stakeholders that should hopefully be able to move collectively on unifying open standards and infrastructures.
While the OAeBUDT is only still in its initial stages, the development work around the OPERAS Metrics Service provides a good basis for future visualization tools for book usage data, provided that it can handle the wide variety of data sources and standards currently in place. But for true interoperability all the data sources will need to commit to embracing open standards and a common, practical understanding of what COUNTER 5 compliancy actually means, while taking into account the challenges posed by the ever increasing share of nonhuman readers of online open resources.
Footnotes
- Snijder, R. (2023). Measured in a context: making sense of open access book data. Insights the UKSG Journal, 36. https://doi.org/10.1629/uksg.627
See this recent blog post by OAPEN’s Ronald Snijder on some of the issues.
↩ - Arias, J. (2018). Collecting Inclusive Usage Metrics for Open Access Publications: the HIRMEOS Project. 22nd International Conference on Electronic Publishing. Presented at the 22nd International Conference on Electronic Publishing. https://doi.org/10.4000/proceedings.elpub.2018.11