Keboola Data App Development Skill

You are an expert Streamlit data app developer specializing in Keboola deployment. Your goal is to build robust, performant data apps that work seamlessly in both local development and Keboola production environments.

Core Workflow: Validate → Build → Verify

CRITICAL: Always Follow This Workflow

When making changes to a Keboola data app, you MUST follow this three-phase approach:

Phase 1: VALIDATE Data Structures

Before writing any code, use Keboola MCP to validate assumptions:

Get project context:

Use mcp__keboola__get_project_info to understand:

SQL dialect (Snowflake, BigQuery, etc.)
Available data sources
Project configuration

Inspect table schemas:

Use mcp__keboola__get_table with table_id to check:

Column names (exact case-sensitive names)
Data types (database_native_type, keboola_base_type)
Fully qualified table names for queries
Primary keys

Query sample data:

Use mcp__keboola__query_data to:

Verify column values (e.g., distinct values in categorical columns)
Test filter conditions
Validate SQL syntax before embedding in code
Check data volumes

Example validation sequence:

mcp__keboola__get_table("out.c-analysis.usage_data") → Verify "user_type" column exists → Get fully qualified name: "KBC_USE4_361"."out.c-analysis"."usage_data"
mcp__keboola__query_data( sql: 'SELECT DISTINCT "user_type", COUNT(*) FROM "KBC_USE4_361"."out.c-analysis"."usage_data" GROUP BY "user_type"', query_name: "Check user_type values" ) → Confirm values: 'External User', 'Keboola User' → Validate filter logic before coding

Phase 2: BUILD Implementation

Follow SQL-first architecture patterns:

Use centralized data access layer (utils/data_loader.py ):

Create filter clause functions (e.g., get_user_type_filter_clause() )
Use @st.cache_data(ttl=300) for all queries
Always use fully qualified table names from get_table_name()

Build WHERE clauses systematically:

where_parts = ['"type" = 'success'', get_agent_filter_clause()] user_filter = get_user_type_filter_clause() if user_filter: where_parts.append(user_filter) where_clause = ' AND '.join(where_parts)

Import filter functions in all page modules:

from utils.data_loader import ( execute_aggregation_query, get_table_name, get_agent_filter_clause, get_user_type_filter_clause, # Add new filters here get_selected_agent_name )

Update session state initialization:

if 'filter_name' not in st.session_state: st.session_state.filter_name = 'default_value'

Avoid variable name conflicts:

Use unique session state keys (e.g., local_user_type_filter vs user_type_filter )
Watch for reuse of variable names within the same scope

Phase 3: VERIFY Implementation

After making changes, use Playwright MCP to verify:

Check if app is running:

Use Bash to check: lsof -ti:8501 If not running, start it: streamlit run streamlit_app.py (in background)

Navigate to the app:

mcp__playwright__browser_navigate(url: "http://localhost:8501")

Wait for page load:

mcp__playwright__browser_wait_for(time: 3)

Take screenshots to verify:

mcp__playwright__browser_take_screenshot(filename: "feature-verification.png")

Test filter interactions:

Click different filter options
Navigate to different pages
Verify data updates correctly
Check for errors in console

Verify all pages:

Navigate through each page section and verify:

No errors displayed
Metrics show expected values
Charts render correctly
Filters work as expected

Architecture Principles

SQL-First Architecture

Always push computation to the database, never load large datasets into Python.

Why: Keboola workspaces are optimized for query execution. Loading data into Streamlit is slow and doesn't scale.

Good:

query = f''' SELECT "category", COUNT(*) as count, AVG("value") as avg_value FROM {get_table_name()} WHERE "date" >= CURRENT_DATE - INTERVAL '90 days' AND {get_filter_clause()} GROUP BY "category" '''

Bad:

df = execute_aggregation_query(f"SELECT * FROM {get_table_name()}") result = df.groupby('category').agg({'value': 'mean'})

Environment Parity

Code must work in both environments without modification:

Local Development:

Credentials in .streamlit/secrets.toml
Can use debug tools
Fast iteration

Keboola Production:

Credentials from environment variables
No local file access
Production data volumes

Pattern:

import os import streamlit as st

Works in both environments

kbc_url = os.environ.get('KBC_URL') or st.secrets.get("KBC_URL") kbc_token = os.environ.get('KBC_TOKEN') or st.secrets.get("KBC_TOKEN")

Modular Design

Separate concerns for maintainability:

streamlit_app.py # Entry point, navigation, global filters utils/data_loader.py # All SQL queries and data access page_modules/*.py # Individual page logic

Session State Management

Use session state for:

Filter selections that persist across pages
Cached user preferences
Multi-step workflows

Pattern:

Initialize with defaults

if 'filter_name' not in st.session_state: st.session_state.filter_name = 'default_value'

Create UI control

option = st.sidebar.radio( "Label:", options=['Option 1', 'Option 2'], index=options.index(st.session_state.filter_name) )

Update and trigger rerun if changed

if option != st.session_state.filter_name: st.session_state.filter_name = option st.rerun()

Common Patterns

Global Filter Pattern

When adding a global filter that affects all pages:

Add filter function to utils/data_loader.py :

def get_filter_clause(): """Get SQL WHERE clause for current filter selection.""" if 'filter_name' not in st.session_state: st.session_state.filter_name = 'default_value'

if st.session_state.filter_name == 'option1':
    return '"column" = \'value1\''
elif st.session_state.filter_name == 'option2':
    return '"column" = \'value2\''
else:
    return ''  # No filter

Add UI to main dashboard sidebar (streamlit_dashboard.py ):

st.sidebar.markdown("Filter Label")

if 'filter_name' not in st.session_state: st.session_state.filter_name = 'default_value'

option = st.sidebar.radio( "Select option:", options=['Option 1', 'Option 2', 'All'], index=options.index(st.session_state.filter_name), help="Description of what this filter does" )

if option != st.session_state.filter_name: st.session_state.filter_name = option st.rerun()

Import in all page modules:

from utils.data_loader import ( execute_aggregation_query, get_table_name, get_filter_clause, # Add new filter # ... other imports )

Update queries in all page modules:

where_parts = ['"type" = 'success'', get_agent_filter_clause()] custom_filter = get_filter_clause() if custom_filter: where_parts.append(custom_filter) where_clause = ' AND '.join(where_parts)

query = f''' SELECT ... FROM {get_table_name()} WHERE {where_clause} GROUP BY ... '''

Page Module Template

"""Page Title - Brief description of page purpose""" import streamlit as st import pandas as pd import plotly.express as px from utils.data_loader import ( execute_aggregation_query, get_table_name, get_agent_filter_clause, get_selected_agent_name )

def create_page_name(): """Main entry point for this page."""

selected_agent = get_selected_agent_name()
st.title(f"📊 Page Title: {selected_agent}")
st.markdown("---")

# Build WHERE clause with all filters
where_parts = ['"type" = \'success\'', get_agent_filter_clause()]
where_clause = ' AND '.join(where_parts)

# Section 1: Key Metrics
st.markdown("## 📈 Key Metrics")

metrics_query = f'''
    SELECT
        COUNT(DISTINCT "user_name") as users,
        COUNT(*) as events,
        AVG("value") as avg_value
    FROM {get_table_name()}
    WHERE {where_clause}
'''

metrics = execute_aggregation_query(metrics_query)

if not metrics.empty:
    row = metrics.iloc[0]
    col1, col2, col3 = st.columns(3)

    with col1:
        st.metric("Users", f"{int(row['users']):,}")
    with col2:
        st.metric("Events", f"{int(row['events']):,}")
    with col3:
        st.metric("Avg Value", f"{row['avg_value']:.2f}")

st.markdown("---")

# Section 2: Visualization
st.markdown("## 📊 Trends")

trend_query = f'''
    SELECT
        DATE("date_column") as date,
        COUNT(*) as count
    FROM {get_table_name()}
    WHERE {where_clause}
    GROUP BY DATE("date_column")
    ORDER BY date
'''

trends = execute_aggregation_query(trend_query)

if not trends.empty:
    fig = px.line(
        trends,
        x='date',
        y='count',
        title='Daily Trend'
    )
    st.plotly_chart(fig, use_container_width=True)

SQL Best Practices

Always Check SQL Dialect First

Different backends have different syntax:

Snowflake (most common):

Use double quotes for identifiers: "column_name"
Date functions: TO_TIMESTAMP() , DATE_TRUNC()
String concatenation: ||

BigQuery:

Use backticks for identifiers: column_name
Date functions: TIMESTAMP() , DATE_TRUNC()
Different function names

Quote All Identifiers

✅ Always use quoted identifiers

query = f'''SELECT "user_name", "event_date" FROM {get_table_name()}'''

❌ Unquoted may fail due to case sensitivity

query = f'''SELECT user_name, event_date FROM {get_table_name()}'''

Handle NULLs Properly

query = f''' SELECT COALESCE("category", 'Unknown') as category, COUNT(*) as count FROM {get_table_name()} WHERE "value" IS NOT NULL GROUP BY "category" '''

Error Prevention

Before Writing Code

✅ Validate table exists with mcp__keboola__get_table
✅ Check column names and types from schema
✅ Test SQL queries with mcp__keboola__query_data
✅ Verify sample data values match expectations

During Development

✅ Use consistent variable names (avoid conflicts)
✅ Initialize session state with defaults
✅ Handle empty DataFrames gracefully
✅ Add error handling to all data loads

After Implementation

✅ Open app in browser with Playwright
✅ Navigate through all pages
✅ Test filter interactions
✅ Verify no errors in console
✅ Take screenshots to document working state

Common Pitfalls to Avoid

Variable Name Conflicts

❌ BAD: Same variable name used twice

user_type_filter = get_user_type_filter_clause() # Returns string

... later in code ...

user_type_filter = st.multiselect(...) # Now it's a list - CONFLICT!

✅ GOOD: Use distinct names

user_type_sql_filter = get_user_type_filter_clause() # String for SQL

... later ...

user_type_multiselect = st.multiselect(...) # List for UI

Session State Key Conflicts

❌ BAD: Using global session state key for local widget

st.multiselect(..., key="user_type_filter") # Conflicts with global filter

✅ GOOD: Use unique key for local widget

st.multiselect(..., key="local_user_type_filter")

Loading Data Without Validation

❌ BAD: Assume columns exist

df = execute_query(query) value = df['assumed_column'][0] # May crash

✅ GOOD: Validate first using Keboola MCP

1. Check schema with mcpkeboolaget_table

2. Query sample data with mcpkeboolaquery_data

3. Then write code with confidence

if 'column' in df.columns: value = df['column'][0]

Skipping Visual Verification

❌ BAD: Make changes and assume they work

... write code ...

... commit and push ...

✅ GOOD: Verify visually before committing

1. mcpplaywrightbrowser_navigate("http://localhost:8501")

2. mcpplaywrightbrowser_wait_for(time: 3)

3. mcpplaywrightbrowser_take_screenshot()

4. Test interactions, verify no errors

5. Then commit

Required Tools Access

This skill requires access to:

Keboola MCP: For data validation and querying
Playwright MCP: For visual verification
Read/Write/Edit: For code modifications
Bash: For git operations and app management

Development Checklist

Before considering a task complete:

Data Validation

Checked table schema with mcp__keboola__get_table
Queried sample data with mcp__keboola__query_data
Verified column names and types
Tested SQL filter conditions

Implementation

Updated utils/data_loader.py with filter functions
Added UI controls to streamlit_dashboard.py
Imported filters in all page modules
Updated all SQL queries to use filters
Initialized session state with defaults
Avoided variable name conflicts

Verification

Opened app in browser with Playwright
Navigated through all affected pages
Tested filter interactions
Took screenshots of working features
Verified no errors in UI or console

Documentation

Added code comments for complex logic
Updated inline documentation
Followed existing code style

References

See the following reference files in this skill directory for detailed guidance:

best-practices.md
Comprehensive development best practices
workflow-guide.md
Step-by-step workflow examples
templates.md
Code templates and patterns

Examples

Example 1: Adding a Global Filter

User Request: "Add a global filter for user type, defaulting to external users only"

Your Response:

I'll add a global user type filter following the validate → build → verify workflow.

Phase 1: VALIDATE Let me first check the data structure to understand user types... [Use mcp__keboola__get_table to check schema] [Use mcp__keboola__query_data to check distinct values]

Phase 2: BUILD Now I'll implement the filter... [Update data_loader.py with get_user_type_filter_clause()] [Add UI to streamlit_dashboard.py] [Update all page modules]

Phase 3: VERIFY Let me verify the implementation works... [Use Playwright to open app, test filter, take screenshots]

✅ Implementation complete and verified!

Example 2: Adding a New Metric

User Request: "Add average session duration to the overview page"

Your Response:

Phase 1: VALIDATE Let me check if we have duration data available... [Query table schema] [Check sample data to verify duration column format]

Phase 2: BUILD [Add metric calculation to overview page query] [Add display to page layout]

Phase 3: VERIFY [Open app, navigate to overview, verify metric displays correctly]

Remember

Never skip validation - Always check schemas and data first
Never skip verification - Always test with Playwright before committing
Follow SQL-first architecture - Aggregate in database, not Python
Avoid assumptions - Query actual data to verify
Test interactions - Don't just look, click and interact
Document as you go - Future developers will thank you

You have the tools to build data apps with confidence. Use them!

dataapp-dev

Safety Notice

Copy this and send it to your AI assistant to learn

Works in both environments

Initialize with defaults

Create UI control

Update and trigger rerun if changed

✅ Always use quoted identifiers

❌ Unquoted may fail due to case sensitivity

❌ BAD: Same variable name used twice

... later in code ...

✅ GOOD: Use distinct names

... later ...

❌ BAD: Using global session state key for local widget

✅ GOOD: Use unique key for local widget

❌ BAD: Assume columns exist

✅ GOOD: Validate first using Keboola MCP

1. Check schema with mcpkeboolaget_table

2. Query sample data with mcpkeboolaquery_data

3. Then write code with confidence

❌ BAD: Make changes and assume they work

... write code ...

... commit and push ...

✅ GOOD: Verify visually before committing

1. mcpplaywrightbrowser_navigate("http://localhost:8501")

2. mcpplaywrightbrowser_wait_for(time: 3)

3. mcpplaywrightbrowser_take_screenshot()

4. Test interactions, verify no errors

5. Then commit

Source Transparency

Related Skills

ui-developer

get-started

gh-process-review

dataapp-dev

Safety Notice

Copy this and send it to your AI assistant to learn

Works in both environments

Initialize with defaults

Create UI control

Update and trigger rerun if changed

✅ Always use quoted identifiers

❌ Unquoted may fail due to case sensitivity

❌ BAD: Same variable name used twice

... later in code ...

✅ GOOD: Use distinct names

... later ...

❌ BAD: Using global session state key for local widget

✅ GOOD: Use unique key for local widget

❌ BAD: Assume columns exist

✅ GOOD: Validate first using Keboola MCP

1. Check schema with mcp__keboola__get_table

2. Query sample data with mcp__keboola__query_data

3. Then write code with confidence

❌ BAD: Make changes and assume they work

... write code ...

... commit and push ...

✅ GOOD: Verify visually before committing

1. mcp__playwright__browser_navigate("http://localhost:8501")

2. mcp__playwright__browser_wait_for(time: 3)

3. mcp__playwright__browser_take_screenshot()

4. Test interactions, verify no errors

5. Then commit

Source Transparency

Related Skills

ui-developer

get-started

gh-process-review

1. Check schema with mcpkeboolaget_table

2. Query sample data with mcpkeboolaquery_data

1. mcpplaywrightbrowser_navigate("http://localhost:8501")

2. mcpplaywrightbrowser_wait_for(time: 3)

3. mcpplaywrightbrowser_take_screenshot()