dataset-management

Dataset Management Patterns

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dataset-management" with this command: npx skills add jediv/dataiku-chat-control/jediv-dataiku-chat-control-dataset-management

Dataset Management Patterns

Reference patterns for creating and managing Dataiku datasets via the Python API.

Dataset Types

Type Use When Creation Method

Managed Output of recipes, stored in a connection (SQL, HDFS, etc.) project.new_managed_dataset(name)

Uploaded Importing local files (CSV, Excel, etc.) project.create_upload_dataset(name) or project.create_dataset(name, "UploadedFiles", ...)

SQL Table Pointing to an existing database table project.create_dataset(name, "Snowflake", ...)

Create a Managed Dataset

builder = project.new_managed_dataset("MY_OUTPUT") builder.with_store_into("connection_name") ds = builder.create()

Configure table location (SQL databases)

settings = ds.get_settings() raw = settings.get_raw() raw["params"]["schema"] = "MY_SCHEMA" raw["params"]["table"] = "MY_OUTPUT" settings.save()

Upload a File

ds = project.create_dataset( "my_dataset", "UploadedFiles", params={"uploadConnection": "filesystem_managed"} )

with open("path/to/data.csv", "rb") as f: ds.uploaded_add_file(f, "data.csv")

Auto-detect schema from file contents

settings = ds.autodetect_settings(infer_storage_types=True) settings.save()

Simpler alternative: Use create_upload_dataset to skip the manual params configuration:

ds = project.create_upload_dataset("my_dataset")

with open("path/to/data.csv", "rb") as f: ds.uploaded_add_file(f, "data.csv")

Common Column Types

Dataiku Type Description

string

Text

int / bigint

Integer / Large integer

double / float

Decimal numbers

boolean

True/False

date

Date only

See references/column-types.md for the full type table.

Core Schema Operations

Get Schema

ds = project.get_dataset("my_dataset") schema = ds.get_settings().get_schema() for col in schema["columns"]: print(f"{col['name']}: {col['type']}")

Set Schema

settings = ds.get_settings() settings.set_schema({"columns": [ {"name": "id", "type": "string"}, {"name": "amount", "type": "double"}, ]}) settings.save()

Auto-detect Schema

settings = dataset.autodetect_settings() settings.save()

Note: autodetect_settings() is a method on DSSDataset , not on DSSDatasetSettings . It returns a new settings object with the detected schema applied.

See references/schema-operations.md for join compatibility checks, helper functions, and advanced operations.

SQL Schema Rule

Output datasets for SQL-based recipes MUST have schemas set before building. Without this, Dataiku generates CREATE TABLE () ... which fails.

For SQL databases (Snowflake, BigQuery), use UPPERCASE column names. Lowercase names get quoted, causing "invalid identifier" errors.

Normalize column names to uppercase for SQL

raw = settings.get_raw() for col in raw.get("schema", {}).get("columns", []): col["name"] = col["name"].upper() settings.save()

List Datasets in Project

datasets = project.list_datasets() for ds in datasets: print(f"- {ds['name']} ({ds.get('type', 'unknown')})")

Common Issues

Issue Cause Solution

Schema mismatch Recipe output doesn't match Run autodetect_settings()

Join fails Key type mismatch Check types, cast if needed

Missing columns Schema not updated Rebuild dataset, update schema

Parse errors Wrong type detection Manually set schema

Detailed References

  • references/column-types.md — Full column type table with Python equivalents

  • references/schema-operations.md — All schema operations, join compatibility checks, helper functions

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

recipe-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

flow-management

No summary provided by upstream source.

Repository SourceNeeds Review
General

data-catalog

No summary provided by upstream source.

Repository SourceNeeds Review
General

troubleshooting

No summary provided by upstream source.

Repository SourceNeeds Review