Skip to content

feat: published data schema update#2649

Open
nitrosx wants to merge 11 commits intomasterfrom
published_data_schema_update
Open

feat: published data schema update#2649
nitrosx wants to merge 11 commits intomasterfrom
published_data_schema_update

Conversation

@nitrosx
Copy link
Copy Markdown
Member

@nitrosx nitrosx commented Apr 5, 2026

Description

This PR adds fields to store proposals and samples in the published data record.

Motivation

Some data policies requires that a DOi is assigned to the whole proposal as soon as it is approved. This PR will allow to assign and register a new DOI as soon as a proposal is accepted and saved in SciCat.
Also, more and more, DOI can be register for a non homogeneous group of information which can contain proposals, samples and datasets alike.

Changes:

  • published data schema

Tests included

N/A

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

@nitrosx nitrosx requested a review from a team as a code owner April 5, 2026 20:46
Copy link
Copy Markdown
Member

@HayenNico HayenNico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, but I think we should not make the datasetPids field optional at this time. The DOI registration at DataCite is currently hardcoded to use Dataset as the resourceTypeGeneral (see e.g. here in the v4 controller). This is not suitable if a PublishedData instance were to include only samples or proposals.
It would be fine to add links to sample and proposal documents under a publishedData instance so long as these do not have their own DOIs (standalone sample DOI publications should for example use IGSNs).

@nitrosx
Copy link
Copy Markdown
Member Author

nitrosx commented Apr 14, 2026

The code looks fine, but I think we should not make the datasetPids field optional at this time. The DOI registration at DataCite is currently hardcoded to use Dataset as the resourceTypeGeneral (see e.g. here in the v4 controller). This is not suitable if a PublishedData instance were to include only samples or proposals. It would be fine to add links to sample and proposal documents under a publishedData instance so long as these do not have their own DOIs (standalone sample DOI publications should for example use IGSNs).

I'm fine to keep datasetPids mandatory for now, but for our data policy, we will have registered published data with only proposal until the datasets acquired under the proposal will become public. I assume that we would need to update the logic for resourceTypeGeneral if we would like to accomodate this usecase or register a second published data once the datasets are public.
Thoughts?

@nitrosx
Copy link
Copy Markdown
Member Author

nitrosx commented Apr 14, 2026

@HayenNico which resourceTypeGeneral would you pick for published data with proposal, datasets and potentially samples? Collection?
You can find the list here

@nitrosx nitrosx requested a review from HayenNico April 14, 2026 08:00
@Junjiequan
Copy link
Copy Markdown
Member

From what I found on the list, they have sample-ish (PhysicalObject) and proposal-ish (Project) resourcesType.

@HayenNico
Copy link
Copy Markdown
Member

@nitrosx @Junjiequan Which resourceTypeGeneral you choose depends a lot on what the main content of the resource you're publishing is. I think that for publishedData documents this should always be Dataset, that's what the name and schema communicate.
And your use case would to my understanding still ultimately be a publication of all datasets belonging to a proposal - just with the added requirement that the datasets do not exist at the point of publishedData creation, correct? Bundling the proposal with that data is not an issue in my opinion, the main resource is still a Dataset. Collection could be a suitable fallback, but you lose a layer of granularity and semantic annotation.

My concern is that the schema in this form could be used to register publishedData with no data. Standalone publications of samples or proposals should probably live in their own collections (e.g. publishedSample, publishedProposal) if we want to allow that. Samples should generally be PhysicalObject, for proposals StudyRegistration fits best.

Perhaps an easy workaround to support your use case would be to make datasetPids optional, but throw an error when trying to register a DOI for a publishedData instance with an empty list there? Or does your data policy explicitly require the DOI to be registered with DataCite immediately?

@nitrosx
Copy link
Copy Markdown
Member Author

nitrosx commented Apr 14, 2026

@HayenNico great points. Let's discuss at the meeting.

As we stand now, we would like to create a published data record and register it as soon as the proposal is accepted.
Datasets will be added when they become public (end of embargo period or manually published). So the end objectives is to have a published data which will contains datasets and relative proposal and maybe samples.

Regarding published data, I always interpreted with a wider meaning. It is public and registered, I do not know what it contains. It can be a mix of datasets, proposals or samples.

I like your suggestion that we can make datasets, proposals and samples optional with the condition that when I make them public and register the published data record, the combined list of the three has to contain at least one element.

@HayenNico
Copy link
Copy Markdown
Member

@nitrosx Following up on the discussion at the meeting: For your use case, I think you'd want resourceTypeGeneral to be Dataset for the final result, the question for me is what should be done in the intermediate stage where you only have the proposal.

DataCite supports different status levels for registered DOIs: Draft (DOI registered at DataCite, but not the global Handle system), Registered (Registered globally, but not discoverable via DataCite search) and Published. The current implementation in SciCat only uses the Published status, maybe adding some form of pre-registration with the Registered status would make sense? This is assuming that the initial publishedData with just the proposal is not really something you want incorporated in downstream applications like knowledge graphs (which depend on correct resource annotation with resourceTypeGeneral to be useful).

We need to have a look at the publishedData lifecycle in any case - as far as I remember, the datasetPids list is supposed to get locked down and be uneditable after DOI registration, which would block your workflow. Idk if that was actually implemented, but will need to be revisited if so.

@HayenNico
Copy link
Copy Markdown
Member

HayenNico commented Apr 15, 2026

As an aside:
For samples specifically, we use standalone persistent identifiers called IGSNs in DAPHNE - these are DataCite DOIs with some narrower restrictions. The two main ones: resourceTypeGeneral must be PhysicalObject, and IGSNs must have a dedicated DOI-prefix space not used for other resources.
I had been considering an integration of IGSN with SciCat, and due to the second restriction they would have to be a separate entity publishedSample since they can't share the same DOI-prefix as publishedData. An additional aspect is that we'd want a separate DataCite schema configuration, since a site's standard descriptions of a sample and a dataset would most likely have different requirements.
Given the size of this project, I hadn't brought it up yet since I can't realistically dedicate enough time

@nitrosx
Copy link
Copy Markdown
Member Author

nitrosx commented Apr 15, 2026

@HayenNico maybe should do a brainstorming session and continue the discussion somewhere else.
Can review this PR and see if your comments have been addressed?

Copy link
Copy Markdown
Member

@HayenNico HayenNico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitrosx There's a few spots with minor inconsistencies for the handling of the different pid arrays. Looking good otherwise.
I'll pre-approve since the fixes are small.

" are part of the published data record.",
})
@Prop({ type: [String], required: true })
@Prop({ type: [String], required: false })
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be required: true here to match ApiProperty information

" are part of this published data record.",
})
@Prop({ type: [String], required: false })
proposalIds: string[];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be proposalIds?: string[] if optional;
could also make this required for the schema with a default empty array

" are part of this published data record.",
})
@Prop({ type: [String], required: false })
sampleIds: string[];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as proposalIds

*/
@ApiProperty({
type: [String],
required: false,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go back to required: true to match schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants