regex-log

Regex Log Parsing Skill

This skill provides a systematic approach for constructing complex regular expressions that extract and validate structured data from log files.

When to Use This Skill

This skill applies when:

Building regex patterns to extract data from log entries
Validating specific formats (IPv4 addresses, dates, timestamps) within logs
Handling requirements for first/last occurrence selection
Enforcing word boundary conditions
Combining multiple validation constraints in a single pattern

Approach: Decomposition Strategy

Complex log parsing regex should be built by decomposing the problem into sub-patterns:

Step 1: Identify All Requirements

Before writing any regex, create a complete list of requirements:

What data needs to be validated (present but not captured)?
What data needs to be captured?
What boundary conditions apply (word boundaries, line anchors)?
Are there positional requirements (first, last, nth occurrence)?
What constitutes an invalid match?

Step 2: Build Sub-Patterns Independently

Construct each validation pattern separately before combining:

IPv4 Address Pattern

For valid IPv4 addresses (0-255 per octet, no leading zeros except for 0 itself):

Octet pattern: (?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])
Order alternatives from most specific to least specific
Full IPv4: (?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])

Date Pattern (YYYY-MM-DD)

For valid dates with proper month-day validation:

31-day months: (?:0[13578]|1[02])-(?:0[1-9]|[12][0-9]|3[01])
30-day months: (?:0[469]|11)-(?:0[1-9]|[12][0-9]|30)
February (up to 29): 02-(?:0[1-9]|1[0-9]|2[0-9])
Combine with year: [0-9]{4}-(?:...combined month-day patterns...)

Step 3: Apply Positional Requirements

Selecting Last Occurrence

To capture the last valid pattern in a line:

^.<pattern>(?!.<pattern>)

Use ^.* to greedily consume characters
Use negative lookahead (?!.*<pattern>) to ensure no pattern follows

Selecting First Occurrence

To capture the first valid pattern:

^(?:(?!<pattern>).)*<pattern>

Or simply rely on regex engines returning the first match by default.

Step 4: Apply Validation Without Capture

To require presence of a pattern without capturing it:

Use lookahead: (?=.*<pattern>) at the start of the regex
This validates the line contains the pattern without affecting the capture

Step 5: Apply Word Boundaries

For patterns that must not be adjacent to alphanumeric characters:

Use \b word boundaries: \b<pattern>\b
Be aware that \b matches between word and non-word characters

Verification Strategy

Create Comprehensive Test Cases

Organize tests by category:

Valid cases: Confirm expected matches

Minimum/maximum valid values (e.g., 0.0.0.0, 255.255.255.255)
Edge values for each component

Invalid format cases: Confirm rejection

Out-of-range values (e.g., 256.0.0.0)
Invalid formatting (leading zeros where prohibited)
Invalid months (00, 13) or days (32)

Boundary condition cases:

Pattern at start/end of line
Pattern adjacent to alphanumeric characters (should fail with word boundaries)
Pattern adjacent to punctuation (should pass with word boundaries)

Positional cases:

Multiple valid patterns in one line (verify correct one is captured)
Single pattern in line
No valid pattern in line

Test File Structure

Create a structured test file that:

Groups tests by category
Uses clear naming for each test case
Reports pass/fail status for each test
Summarizes overall results

Example structure:

test_cases = { "valid_ipv4": [...], "invalid_ipv4": [...], "valid_dates": [...], "invalid_dates": [...], "last_occurrence": [...], "boundary_conditions": [...] }

Common Pitfalls

Incomplete First Attempt

Problem: Creating incomplete or truncated test files
Solution: Plan the full test structure before writing; validate file completeness before execution

Environment Assumptions

Problem: Assuming python command exists when only python3 is available
Solution: Check the Python environment first or use python3 explicitly

Scattered Reasoning

Problem: Disorganized thought process leading to repeated work
Solution: Follow the decomposition strategy linearly; complete each sub-pattern before moving to the next

Duplicate Patterns Without Abstraction

Problem: Same regex pattern repeated multiple times, increasing error risk
Solution: Define complex sub-patterns once in reasoning, then reference them; in code, use variables

Missing Edge Cases

Problem: Focusing only on happy path validation
Solution: Explicitly test:
Boundary values (min/max for each component)
Invalid values just outside valid range
Empty and null cases
Patterns at different positions in the line

Order of Alternatives

Problem: Less specific alternatives matching before more specific ones
Solution: Order regex alternatives from most specific to least specific (e.g., 25[0-5] before 2[0-4][0-9] before [0-9] )

Greedy vs Non-Greedy Matching

Problem: Unexpected capture due to greedy quantifiers
Solution: Understand when to use .* vs .? ; for "last occurrence" patterns, greedy . is typically correct

Workflow Summary

List all requirements explicitly
Build and test sub-patterns independently
Combine sub-patterns with appropriate anchors and lookaheads
Create comprehensive test cases covering all categories
Run tests and verify all pass
Clean up test files after validation

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

extracting-pdf-text

video-processing

google-workspace

portfolio-optimization