The abstract of the work, as an inverted index, which encodes information about the abstract's words and
their positions within the text. Like Microsoft Academic Graph, OpenAlex doesn't include plaintext abstracts due to legal constraints.
Information about the paid APC (article processing charge) for this work. The object contains:
value: Integer
currency: String
provenance: String — currently either openapc or doaj, but more will be added; see below for details.
value_usd: Integer — the APC converted into USD
You can find the listed APC price (when we know it) for a given work using apc_list. However, authors don’t always pay the listed price;
often they get a discounted price from publishers. So it’s useful to know the APC actually paid by authors, as distinct from the list price. This is our effort to provide this.
Our best source for the actually paid price is the OpenAPC project. Where available, we use that data, and so apc_paid.provenance is openapc.
Where OpenAPC data is unavailable (and unfortunately this is common) we make our best guess by assuming the author paid the APC list price, and apc_paid.provenance will be set to wherever we got the list price from.
Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you'll get fun values like "Spring" and "Inside cover."
List of dehydrated Concept objects.
Each Concept object in the list also has one additional property:
score (Float): The strength of the connection between the work and this concept (higher is stronger). This number is produced by AWS Sagemaker, in the last layer of the machine learning model that assigns concepts.
Concepts with a score of at least 0.3 are assigned to the work. However, ancestors of an assigned concept are also added to the work, even if the ancestor scores are below 0.3.
Works.cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many times this work was cited.
Any citations older than ten years old aren't included. Years with zero citations have been removed so you will need to add those in if you need them.
The DOI for the work. This is the Canonical External ID for works.
Occasionally, a work has more than one DOI--for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version.
However, this field always has just one DOI, the DOI for the published work.
All the external identifiers that we know about for this work. IDs are expressed as URIs whenever possible. Possible ID types:
doi (String: The DOI. Same as Work.doi)
mag (Integer: the Microsoft Academic Graph ID)
openalex (String: The OpenAlex ID. Same as Work.id)
pmid (String: The Pubmed Identifier)
pmcid (String: the Pubmed Central identifier)
True if we think this work is paratext.
In our context, paratext is stuff that's in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples:
True if we know this work has been retracted.
This field has high precision but low recall. In other words, if is_retracted is true, the article is definitely retracted.
But if is_retracted is False, it still might be retracted, but we just don't know. This is because unfortunately, the open sources for retraction data aren't currently very comprehensive, and the more comprehensive ones aren't sufficiently open for us to use here.
The language of the work in ISO 639-1 format. The language is automatically detected using the information we have about the work. We use the langdetect software library on the words in the work's abstract, or the title if we do not have the abstract. The source code for this procedure is here. Keep in mind that this method is not perfect, and that in some cases the language of the title or abstract could be different from the body of the work.
A few things to keep in mind about this:
We don't always assign a language if we do not have enough words available to accurately guess.
We report the language of the metadata, not the full text. For example, if a work is in French, but the title and abstract are in English, we report the language as English.
In some cases, abstracts are in two different languages. Unfortunately, when this happens, what we report will not be accurate.
The license applied to this work at this host. Most toll-access works don't have an explicit license (they're under "all rights reserved" copyright), so this field generally has content only if is_oa is true.
It lists groups of words and phrases (n-grams) that make up a work, as obtained from the Internet Archive. See The Ngram object and Get N-grams for background on n-grams, how we use them, and what this API call returns.
A Location object with the primary location of this work.
The primary_location is where you can find the best (closest to the version of record) copy of this work. For a peer-reviewed journal article,
this would be a full text published version, hosted by the publisher at the article's DOI URL.
The day when this work was published, formatted as an ISO 8601 date.
Where different publication dates exist, we select the earliest available date of electronic publication.
This date applies to the version found at Work.url. The other versions, found in Work.locations, may have been published at different (earlier) dates.
The year this work was published.
This year applies to the version found at Work.url. The other versions, found in Work.locations, may have been published in different (earlier) years.
OpenAlex IDs for works related to this work. Related works are computed algorithmically; the algorithm finds recent papers with the most concepts in common with the current paper.
The United Nations' 17 Sustainable Development Goals are a collection of goals at the heart of a global "shared blueprint for peace and prosperity for people and the planet." We use a machine learning model to tag works with their relevance to these goals based on our OpenAlex SDG Classifier, an mBERT machine learning model developed by the Aurora Universities Network, trained on Elsevier data. The score represents the model's predicted probability of the work's relevance for a particular goal.
How often the ngram occurred in the work.
Caution: This data was taken directly from the General Index and we've not tested term_frequency against actual articles.
You can read about their data extraction process on the Internet Archive website. If you compare term_frequency against articles we would like to hear how it went!
The type of the work.
You can see all of the different types along with their counts in the OpenAlex api here: https://api.openalex.org/works?group_by=type.
Most works are type article. This includes what was formerly (and currently in type_crossref) labeled as journal-article, proceedings-article, and posted-content. We consider all of these to be article type works, and the distinctions between them to be more about where they are published or hosted:
Journal articles will have a primary_location.source.type of journal
Conference proceedings will have a primary_location.source.type of conference
Preprints or "posted content" will have a primary_location.version of submittedVersion
(Note that distinguishing between journals and conferences is a hard problem, one we often get wrong. We are working on improving this, but we also point out that the two have a lot of overlap in terms of their roles as hosts of research publications.)
So, here is how you can filter for only non-preprint articles:
https://api.openalex.org/works?filter=type:article,primary_location.version:!submittedVersion
Works that represent stuff that is about the venue (such as a journal)—rather than a scholarly work properly speaking—have type paratext. These include things like front-covers, back-covers, tables of contents, and the journal itself (e.g., https://openalex.org/W4232230324).
We also have types for letter , editorial , and erratum (corrections). Coverage is low on these but will improve.
Other work types follow the Crossref "type" controlled vocabulary—see type_crossref.
Legacy type information, using Crossref's "type" controlled vocabulary.
These are the work types that we used to use, before switching to our current system (see type).
You can see all possible values of Crossref's "type" controlled vocabulary via the Crossref api here: https://api.crossref.org/types.
Where possible, we just pass along Crossref's type value for each work. When that's impossible (eg the work isn't in Crossref), we do our best to figure out the type ourselves.
The last time anything in this Work object changed, expressed as an ISO 8601 date string (in UTC). This date is updated for any change at all, including increases in various counts.