Improve validate processor 

`DF.validate()` does some basic checks but doesn't validate everything that is possible based on Table Schema. In particular, it does not validate primary keys and we have noted that this creates other currently untraced bugs (e.g.: load from a package with invalid primary keys and try to dump again, the package will be incomplete).

We need to explore one of:

- Support more features of table schema when validating rows, by enhancing the existing validator
- Use the resource validator in Frictionless ( e.g.: primary key check here https://github.com/frictionlessdata/frictionless-py/blob/1d8cc6cf2ad2521963fa82da8a78f368de4d1fd1/frictionless/resource.py#L934 )

The problem with adopting Frictionless is that it can't be incrementally adopted AFAIK - the validation is built into the Resource class and I don't know just from reading the code where that leads (if / how it complicates our code when we use different libraries for managing Frictionless Data specs). Also, it sets state in memory (seen data for primary keys and foreign keys), and I guess based on other patterns in Dataflows we would want to store that data outside of the running python process ( e.g.: using https://github.com/akariv/kvfile ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve validate processor #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve validate processor #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions