The rise of the middle author: Investigating collaboration and division of labor in biomedical research using partial alphabetical authorship

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft * E-mail: philippe.mongeon@umontreal.ca Affiliation École de bibliothéconomie et des sciences de l’information, Université de Montréal, Montréal, Québec, Canada

Roles Conceptualization, Writing – original draft Affiliation École de bibliothéconomie et des sciences de l’information, Université de Montréal, Montréal, Québec, Canada ⨯

Roles Software Affiliation Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada ⨯

Roles Funding acquisition, Project administration, Supervision, Writing – review & editing Affiliations École de bibliothéconomie et des sciences de l’information, Université de Montréal, Montréal, Québec, Canada, Observatoire des Sciences et des Technologies (OST), Centre Interuniversitaire de Recherche sur la Science et la Technologie (CIRST), Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada ⨯

The rise of the middle author: Investigating collaboration and division of labor in biomedical research using partial alphabetical authorship

Philippe Mongeon,
Elise Smith,
Bruno Joyal,
Vincent Larivière

Published: September 14, 2017
https://doi.org/10.1371/journal.pone.0184601
See the preprint

Figures

Abstract

Contemporary biomedical research is performed by increasingly large teams. Consequently, an increasingly large number of individuals are being listed as authors in the bylines, which complicates the proper attribution of credit and responsibility to individual authors. Typically, more importance is given to the first and last authors, while it is assumed that the others (the middle authors) have made smaller contributions. However, this may not properly reflect the actual division of labor because some authors other than the first and last may have made major contributions. In practice, research teams may differentiate the main contributors from the rest by using partial alphabetical authorship (i.e., by listing middle authors alphabetically, while maintaining a contribution-based order for more substantial contributions). In this paper, we use partial alphabetical authorship to divide the authors of all biomedical articles in the Web of Science published over the 1980–2015 period in three groups: primary authors, middle authors, and supervisory authors. We operationalize the concept of middle author as those who are listed in alphabetical order in the middle of an authors’ list. Primary and supervisory authors are those listed before and after the alphabetical sequence, respectively. We show that alphabetical ordering of middle authors is frequent in biomedical research, and that the prevalence of this practice is positively correlated with the number of authors in the bylines. We also find that, for articles with 7 or more authors, the average proportion of primary, middle and supervisory authors is independent of the team size, more than half of the authors being middle authors. This suggests that growth in authors lists are not due to an increase in secondary contributions (or middle authors) but, rather, in equivalent increases of all types of roles and contributions (including many primary authors and many supervisory authors). Nevertheless, we show that the relative contribution of alphabetically ordered middle authors to the overall production of knowledge in the biomedical field has greatly increased over the last 35 years.

Citation: Mongeon P, Smith E, Joyal B, Larivière V (2017) The rise of the middle author: Investigating collaboration and division of labor in biomedical research using partial alphabetical authorship. PLoS ONE 12(9): e0184601. https://doi.org/10.1371/journal.pone.0184601

Editor: Miguel A. Andrade-Navarro, Johannes Gutenberg Universitat Mainz, GERMANY

Received: December 13, 2016; Accepted: August 14, 2017; Published: September 14, 2017

Copyright: © 2017 Mongeon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Restrictions apply to the availability of the raw bibliometric data, which is used under license from Clarivate Analytics. Readers can contact Clarivate Analytics at the following URL: http://clarivate.com/scientific-and-academic-research/research-discovery/web-of-science/. Data used for Figs 2-5 is available on FIGSHARE: https://doi.org/10.6084/m9.figshare.5363944.

Funding: Funded by Social Sciences and Humanities Research Council of Canada.

Competing interests: Vincent Larivière is currently a member of PLOS ONE's Editorial Board. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction

With the increasing costs, complexity and interdisciplinarity of modern science [1], research collaboration has become the norm [2]. Scientific knowledge is now being produced by increasingly large teams [3,4], often involving researchers from multiple disciplines, institutions and countries [5]. Many funding agencies encourage and facilitate collaboration [6–8] and there is evidence that funded research is indeed more collaborative [9,10]. A growing body of evidence also suggests that collaborative research has more impact and that increasingly large and diverse teams are necessary to achieve greater impact [4].

Larger teams translate into a larger number of authors listed in the byline of scholarly articles. The term ‘team size’ is thus used hereafter to refer to the number of authors on an article. In certain cases, there may be hundreds of authors on a paper; a phenomenon coined as ‘hyperauthorship’ [11]. Larger teams, but also the diversity of collaboration types [12], team composition [13], and work division within the team [14], greatly complicates the attribution of credit and responsibility to individual team members [15]. This is an important issue since the advancement of researchers’ careers largely depends on the credit they obtain for their work [16,17]. Because it is so important, conflicts regarding authorship are becoming commonplace [16,17] and may introduce tensions in the workplace. The growing complexity of credit attribution is also potentially detrimental for the scientific system as a whole, which works best when excellence is properly identified and rewarded [18].

While it may be difficult for an external observer to assess the respective contributions of individual authors of a collaborative work, their position on the byline may be used as a proxy for the extent and nature of their contributions, since names are typically ordered following implicit disciplinary norms [19]. For example, in the biomedical field, as in most lab-based disciplines, authorship order is based on the importance and type of the contribution as well as the hierarchical position within the team or laboratory. Generally, the first and last position are given the most importance. The first author is a PhD student or a postdoctoral fellow who contributed most to the research, and the last is the lab director [20]. Between the first and last authors are listed an increasingly large number of ‘middle authors’ who typically played a less significant role in the research [14]. Another way to obtain information about individual authors’ contributions to a given work is the contribution statement that many scientific journals (e.g., JAMA, BMJ, the Lancet, NEJM and PLoS) require. These statements are intended to provide information about an individual author’s contribution. However, their value is limited because of significant reporting biases [21,22], and because they address the type of work performed by each author but not the relative value or importance of the work. Nonetheless, several analyses of the relation between the authors rank on the byline and their reported contributions [e.g., 14,23] confirmed the polarization of ‘core’ contributors towards the first and the last position of the authors list, while authors who made fewer types of contributions were listed in the middle. Therefore, in this paper we divide the bylines of biomedical articles into three distinct groups using a terminology similar to the one proposed by Baerlocher and colleagues [23]:

Primary authors: main contributors to the experimental work;
Supervisory authors: senior researchers who supervised the research; and
Middle authors: individuals with relatively small contributions to the research who are listed between the primary and supervisory authors.

This raises a difficult question: how can we distinguish primary, middle and supervisory authors? In other words, where does the middle begin and where does it end? Previous bibliometric analyses of biomedical research [24–26] have typically avoided this question by defining the middle authors as all those listed between the first and last position. This poorly reflects reality since it allows only one primary author and only one supervisory author. This is problematic, as collaborative research (especially inter-institutional or interdisciplinary research) is likely to have multiple primary authors leading perhaps different part of the experimental work, and also multiple supervisory authors [27]. Thus, the ‘first author + middle authors + last author’ model is an arbitrary division of authors that might unfairly tag as middle authors some researchers who played major roles in the research.

In this paper, we use partial alphabetical authorship as a tool to identify the primary, middle, and supervisory authors in a given team. As Harriet Zuckerman [28] pointed out, listing a subset of authors in alphabetical order creates a clear distinction between those who are listed alphabetically and those who are not. For instance, if an article has twenty authors, and the six main contributors (the first four and the last two) are not listed in alphabetical order, while authors from the fifth to the eighteenth position are, a distinction is made; the sequence of authors in alphabetical order in the middle of the byline serves to distinguish the primary, middle and supervisory authors. In this paper, the term ‘primary authors’ thus refers to those authors appearing before an alphabetical sequence, ‘middle authors’ refers to those listed in the alphabetical sequence, and ‘supervisory authors’ refers those listed after the alphabetical sequence.

The purpose of this study is to empirically explore the relative contribution of primary authors, middle authors and supervisory authors to research articles in the biomedical field. More specifically we provide answers to the following research questions:

How prevalent are alphabetically ordered middle authors in biomedical research?
What are the proportions of primary, middle and supervisory authors in the articles’ bylines?
How has the overall contribution of middle authors to the biomedical literature evolved over the last 35 years?

Methods

Data

This study is based on all biomedical research and clinical medicine articles published between 1980 and 2015, which were authored by 4 to 100 individuals, and indexed in Clarivate Analytics’ Web of Science (WoS). Access to the WoS data in a relational database format was provided by the Observatoire des sciences et des technologies (http://www.ost.uqam.ca). The discipline of the articles was determined by the NSF classification of the journal in which they are published. Because trends observed were almost identical in the two biomedical disciplines studied (Biomedical Research and Clinical Medicine), they are combined in the results presented below. We identified middle authors using the following three steps: 1) identifying alphabetical sequences, 2) correcting broken sequences, and 3) distinguishing intentional and incidental alphabetical sequences. While we used proprietary WoS data for this study, other investigations could be performed using non-proprietary data such as PubMed, which also provides an extensive coverage of biomedical literature.

Identifying middle authors

We used an approach similar to that of Waltman [29] to detect sequences of authors in alphabetical order by giving each author of a byline an alphabetical rank based on their last name, and then their initials. An alphabetically ordered sequence of authors is formed when a group of consecutive authors are listed in alphabetical order. Consider for example, an article authored by Wilson, B., Smith, J., Albert, S., Carter, B., Miller, D., Ford, R., and Clark, P.; it includes a group of three authors (Albert, S., Carter, B., and Miller, D.) in alphabetical order starting from the 3 rd position and ending at the 5 th position.

Correcting broken sequences

Depending solely on names and initials to identify alphabetical sequences has some limitations. Errors can occur because of special character conversion, compound names and names with prefixes, indexation errors, and human errors in the alphabetical ordering. In our dataset, spaces and hyphens are removed from last names (e.g. van Gogh becomes vanGogh), and special characters are converted into the basic Latin alphabet (e.g. Lübeck becomes Luebeck). Also, the prefixes of Dutch names (e.g., van, von) are not taken into account in the alphabetical ordering. It may also happen that the first of two last names of an author is treated as a second first name during the indexation process. Fig 1 shows an example where authors from the second to the second to last positions have been ordered alphabetically. However, the sequence breaks at the 10 th author (Starr Koslow Mautner) because her last name (Koslow) has been indexed as a second initial. There may also be cases of human errors, for example when two names are inverted in a long list of otherwise alphabetically ordered authors. Finally, alphabetical ordering conventions differ by language and country, so different individuals may alphabetically order the same list of names in a different way. These conventions also contain rules regarding alphabetical ordering of special characters, which can create further errors since these characters are no longer present in the indexed names.

PowerPoint slide larger image original image Fig 1. Example of a sequence break due to multiple last names.

To reduce as much as possible the occurrence of the errors mentioned above, we concatenated consecutive alphabetically ordered sequences which met one of the following conditions: or or Where:

R is the combined length (r) of the alphabetical sequences preceding and following the break.
X is the first letter of the author name before the one causing the break.
Y₁ is the first letter of the author name causing the break.
Y₂ is the first letter of the author name causing the break after removing potential prefixes.
Y₃ is the last initial of the author name causing the break.
Z is the first letter of the author name after the one causing the break.

The value of R is important because the longer the consecutive sequences, the higher the probability that they actually constitute a single sequence that has been broken into two distinct parts. Therefore, to maximize the precision of the alphabetical sequence break detection, we manually verified a random sample of 100 broken alphabetical sequences for different values of R, and we selected the minimum value of R for which the proportion of false positive was 5% or lower. A total of 192,716 broken alphabetical sequences were fixed: 28,779, 77,332 and 86,605 sequences for which the (R = 8 and X ≤ Y₁ ≤ Z), (R = 6 and X ≤ Y₂ ≤ Z), and (R = 6 and X ≤ Y₃ ≤ Z) conditions were met, respectively. The resulting dataset comprises more than 6.7 million articles authored by a total of more than 44 million authors, among which 13 million alphabetical sequences where found.

Probability of intentional vs. chance alphabetical order

There is always a possibility that a given sequence of authors in alphabetical order results from pure chance and is not intentional. Distinguishing intentional and chance alphabetical order is crucial since alphabetical sequences that occur randomly cannot be used to distinguish middle authors from the others. Thus, for each sequence found, we calculated P_i, which is the probability that the authors are intentionally listed in alphabetical order, and the opposite of the probability P_c that authors are listed alphabetically by chance. P_i is determined by two variables: the number of authors in the sequence (r) and the team size (N). For example, there are 3,628,800 possible combinations of N = 10 authors out of which 156,002 contain an alphabetical sequence of r = 5 authors. Thus, a sequence of r = 5 has a 156,002/3,628,800 = 4.3% probability of occurring by chance (P_c), and therefore a 95.7% chance of being intentional (P_i). Fig 2 shows the relation between N and P_i for different values of r. We see that for short alphabetical sequences of 3 or 4 authors P_i increases rapidly as the byline gets longer, while the Pi of sequences of 6 and 7 remains very high, even for articles with up to 100 authors. More details on the calculation of P_i and P_c can be found in the supporting information (S1 File)