Neo4j Cypher
When to Use
Use this skill when writing, reviewing, or debugging Cypher queries for Neo4j. Covers query patterns, performance optimization, fraud-detection domain queries, and Neo4j 5+ syntax.
Core Query Patterns
MATCH and Filtering
-- Basic node match with property filter MATCH (c:Customer {customerId: $customerId}) RETURN c
-- Relationship traversal MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account) WHERE a.status = 'active' RETURN c, a
-- Variable-length paths MATCH path = (a:Account)-[:TRANSACTION*1..5]->(b:Account) RETURN path
-- Multiple relationship types MATCH (c:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]->(pii) RETURN c, pii
OPTIONAL MATCH
Use when the related data may not exist. Without OPTIONAL, rows with no match are dropped entirely.
MATCH (c:Customer {customerId: $customerId}) OPTIONAL MATCH (c)-[:HAS_EMAIL]->(e:Email) OPTIONAL MATCH (c)-[:HAS_PHONE]->(p:Phone) RETURN c, collect(DISTINCT e) AS emails, collect(DISTINCT p) AS phones
WITH Chaining
Use WITH to pipe results between query stages, filter intermediate results, and control cardinality.
MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account) MATCH (a)-[:PERFORM]->(tx:Transaction) WITH c, a, count(tx) AS txCount, sum(tx.amount) AS totalAmount WHERE txCount > 10 RETURN c.customerId, a.accountNumber, txCount, totalAmount ORDER BY totalAmount DESC
Aggregation
-- Group and aggregate MATCH (a:Account)-[:PERFORM]->(tx:Transaction) WITH a, count(tx) AS txCount, sum(tx.amount) AS total, avg(tx.amount) AS avgAmount RETURN a.accountNumber, txCount, total, avgAmount ORDER BY total DESC LIMIT 20
-- collect() for lists — use DISTINCT to avoid duplicates from cartesian products MATCH (c:Customer)-[:HAS_ACCOUNT]->(a:Account) OPTIONAL MATCH (a)-[:PERFORM]->(tx:Transaction) RETURN c.customerId, collect(DISTINCT a.accountNumber) AS accounts, count(DISTINCT tx) AS transactionCount
UNWIND
Expand a list into rows. Useful for parameterized batch operations.
UNWIND $accountNumbers AS accNum MATCH (a:Account {accountNumber: accNum}) RETURN a
CASE Expressions
MATCH (tx:Transaction) RETURN tx.amount, CASE WHEN tx.amount > 10000 THEN 'high' WHEN tx.amount > 1000 THEN 'medium' ELSE 'low' END AS riskTier
Subqueries (CALL {})
MATCH (c:Customer) CALL (c) { MATCH (c)-[:HAS_ACCOUNT]->(a:Account)-[:PERFORM]->(tx:Transaction) RETURN sum(tx.amount) AS totalSpend } RETURN c.customerId, totalSpend ORDER BY totalSpend DESC
Write Operations
CREATE and MERGE
-- CREATE always creates new CREATE (c:Customer {customerId: $id, firstName: $first, lastName: $last})
-- MERGE finds or creates — always specify the minimal unique key MERGE (e:Email {address: $email}) ON CREATE SET e.createdAt = datetime() ON MATCH SET e.lastSeen = datetime()
-- MERGE relationship MATCH (c:Customer {customerId: $customerId}) MATCH (e:Email {address: $email}) MERGE (c)-[:HAS_EMAIL]->(e)
SET and REMOVE
MATCH (a:Account {accountNumber: $accNum}) SET a.status = 'frozen', a.frozenAt = datetime(), a:Frozen REMOVE a:Active
DELETE
-- Delete node and all its relationships MATCH (n:TempNode {id: $id}) DETACH DELETE n
-- Delete specific relationship MATCH (c:Customer)-[r:HAS_EMAIL]->(e:Email {address: $email}) DELETE r
Performance
Index Usage
Always create indexes on properties used in MATCH/WHERE lookups.
-- Property index (most common) CREATE INDEX customer_id FOR (c:Customer) ON (c.customerId)
-- Composite index CREATE INDEX account_lookup FOR (a:Account) ON (a.accountNumber, a.status)
-- Text index for CONTAINS/STARTS WITH CREATE TEXT INDEX email_text FOR (e:Email) ON (e.address)
-- Verify indexes SHOW INDEXES
EXPLAIN and PROFILE
-- EXPLAIN: shows plan without executing EXPLAIN MATCH (c:Customer {customerId: '123'})-[:HAS_ACCOUNT]->(a:Account) RETURN c, a
-- PROFILE: executes and shows actual rows/db hits per operator PROFILE MATCH (c:Customer {customerId: '123'})-[:HAS_ACCOUNT]->(a:Account) RETURN c, a
Look for:
-
NodeByLabelScan → missing index, add one
-
CartesianProduct → unconnected MATCH clauses, connect them or use WITH
-
Eager → query plan can't stream, may cause memory issues on large datasets
-
High db hits relative to result rows → inefficient traversal
Parameterized Queries
Always use parameters ($param ) instead of string interpolation. This enables query plan caching and prevents injection.
-- Good MATCH (c:Customer {customerId: $customerId}) RETURN c
-- Bad — no plan cache, injection risk MATCH (c:Customer {customerId: '${userInput}'}) RETURN c
Avoiding Cartesian Products
-- BAD: two unconnected MATCH clauses = cartesian product MATCH (a:Account) MATCH (b:Bank) RETURN a, b -- rows = |accounts| × |banks|
-- GOOD: connect through relationships MATCH (a:Account)-[:PERFORM]->(tx:Transaction)-[:BENEFITS_TO]->(b:Bank) RETURN a, tx, b
-- GOOD: if truly independent, use UNION or separate queries
Limit Early, Filter Early
-- Push WHERE as early as possible MATCH (c:Customer) WHERE c.nationality = $country -- filter before traversal MATCH (c)-[:HAS_ACCOUNT]->(a:Account)-[:PERFORM]->(tx:Transaction) WHERE tx.amount > $threshold RETURN c, a, tx
Fraud-Domain Patterns
Shared PII Detection (Synthetic Identity)
MATCH (c1:Customer)-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]->(pii)<-[:HAS_EMAIL|HAS_PHONE|HAS_SSN]-(c2:Customer) WHERE c1 <> c2 WITH c1, c2, collect(pii) AS sharedPII, count(pii) AS sharedCount WHERE sharedCount >= 2 RETURN c1.customerId, c2.customerId, sharedCount, [p IN sharedPII | labels(p)[0]] AS sharedTypes
Transaction Ring Detection
-- Circular fund flow (Neo4j 5.9+ quantified path patterns) MATCH path = (a:Account)-[:PERFORM]->(first_tx) ((tx_i)-[:BENEFITS_TO]->(a_i)-[:PERFORM]->(tx_j) WHERE tx_i.date < tx_j.date)* (last_tx)-[:BENEFITS_TO]->(a) WHERE size(apoc.coll.toSet([a] + a_i)) = size([a] + a_i) RETURN path
Fund Flow / Money Trail
-- Trace where money went from a specific account MATCH (source:Account {accountNumber: $accNum}) MATCH path = (source)-[:PERFORM]->(tx:Transaction)-[:BENEFITS_TO]->(dest) RETURN dest, tx.amount, tx.date, labels(dest)[0] AS destType ORDER BY tx.date DESC
Network Expansion (1-hop, 2-hop)
-- All entities within 2 hops of a customer MATCH path = (c:Customer {customerId: $customerId})-[*1..2]-(connected) RETURN path
Community Detection (with GDS)
-- Project graph CALL gds.graph.project('fraud-network', 'Customer', {LINKED: {type: 'LINKED', orientation: 'UNDIRECTED'}})
-- Run Weakly Connected Components CALL gds.wcc.stream('fraud-network') YIELD nodeId, componentId WITH componentId, collect(gds.util.asNode(nodeId).customerId) AS members WHERE size(members) > 1 RETURN componentId, members, size(members) AS clusterSize ORDER BY clusterSize DESC
-- Clean up projection CALL gds.graph.drop('fraud-network')
Centrality (PageRank / Betweenness)
CALL gds.pageRank.stream('fraud-network') YIELD nodeId, score RETURN gds.util.asNode(nodeId).customerId AS customerId, score ORDER BY score DESC LIMIT 20
Neo4j 5+ Features
Element IDs (replaces internal integer IDs)
-- Neo4j 5+: use elementId() instead of id() MATCH (n:Customer) WHERE elementId(n) = $elementId RETURN n
-- In Bloom scene actions MATCH (n) WHERE elementId(n) IN $nodes RETURN n
Quantified Path Patterns (5.9+)
-- Match paths of variable length with inline predicates MATCH path = (a:Account) (()-[:PERFORM]->(tx:Transaction)-[:BENEFITS_TO]->() WHERE tx.amount > 1000){2,5} (b:Account) RETURN path
Temporal Types
-- datetime(), date(), time(), duration() CREATE (tx:Transaction { timestamp: datetime(), settlementDate: date('2024-03-15'), processingTime: duration('PT2H30M') })
-- Filtering by date range MATCH (tx:Transaction) WHERE tx.timestamp >= datetime($startDate) AND tx.timestamp <= datetime($endDate) RETURN tx
-- Date arithmetic MATCH (tx:Transaction) WHERE tx.timestamp >= datetime() - duration({days: 30}) RETURN tx
COUNT {} and EXISTS {} Subqueries
-- Count subquery (Neo4j 5+) MATCH (c:Customer) WHERE COUNT { (c)-[:HAS_ACCOUNT]->(a:Account)-[:PERFORM]->(tx:Transaction) WHERE tx.amount > 10000 } > 5 RETURN c
-- EXISTS subquery MATCH (c:Customer) WHERE EXISTS { (c)-[:HAS_EMAIL]->(e:Email)<-[:HAS_EMAIL]-(other:Customer) WHERE c <> other } RETURN c
Anti-Patterns
- Collecting without DISTINCT
When multiple OPTIONAL MATCH clauses create cartesian products between collected lists:
-- BAD: emails × phones duplicates MATCH (c:Customer) OPTIONAL MATCH (c)-[:HAS_EMAIL]->(e:Email) OPTIONAL MATCH (c)-[:HAS_PHONE]->(p:Phone) RETURN c, collect(e) AS emails, collect(p) AS phones
-- GOOD: use DISTINCT RETURN c, collect(DISTINCT e) AS emails, collect(DISTINCT p) AS phones
- MERGE on Too Many Properties
-- BAD: if any property differs, creates duplicate MERGE (c:Customer {customerId: $id, firstName: $first, lastName: $last})
-- GOOD: merge on unique key, set other props MERGE (c:Customer {customerId: $id}) ON CREATE SET c.firstName = $first, c.lastName = $last
- Unbounded Variable-Length Paths
-- BAD: can explode on connected graphs MATCH path = (a)-[*]->(b) RETURN path
-- GOOD: always bound the length MATCH path = (a)-[*1..5]->(b) RETURN path
- Using Labels in WHERE Instead of MATCH
-- BAD: scans all nodes then filters MATCH (n) WHERE 'Customer' IN labels(n) RETURN n
-- GOOD: label in MATCH uses label index MATCH (n:Customer) RETURN n
- String Concatenation for Dynamic Queries
-- BAD: no plan caching, injection risk "MATCH (n {id: '" + userId + "'}) RETURN n"
-- GOOD: use parameters MATCH (n {id: $userId}) RETURN n
- Loading Too Much Data
-- BAD: returns everything MATCH (n) RETURN n
-- GOOD: limit and paginate MATCH (n:Customer) RETURN n ORDER BY n.customerId SKIP $offset LIMIT $pageSize