Regex Log Parsing Skill
This skill provides a systematic approach for constructing complex regular expressions that extract and validate structured data from log files.
When to Use This Skill
This skill applies when:
-
Building regex patterns to extract data from log entries
-
Validating specific formats (IPv4 addresses, dates, timestamps) within logs
-
Handling requirements for first/last occurrence selection
-
Enforcing word boundary conditions
-
Combining multiple validation constraints in a single pattern
Approach: Decomposition Strategy
Complex log parsing regex should be built by decomposing the problem into sub-patterns:
Step 1: Identify All Requirements
Before writing any regex, create a complete list of requirements:
-
What data needs to be validated (present but not captured)?
-
What data needs to be captured?
-
What boundary conditions apply (word boundaries, line anchors)?
-
Are there positional requirements (first, last, nth occurrence)?
-
What constitutes an invalid match?
Step 2: Build Sub-Patterns Independently
Construct each validation pattern separately before combining:
IPv4 Address Pattern
For valid IPv4 addresses (0-255 per octet, no leading zeros except for 0 itself):
-
Octet pattern: (?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])
-
Order alternatives from most specific to least specific
-
Full IPv4: (?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])
Date Pattern (YYYY-MM-DD)
For valid dates with proper month-day validation:
-
31-day months: (?:0[13578]|1[02])-(?:0[1-9]|[12][0-9]|3[01])
-
30-day months: (?:0[469]|11)-(?:0[1-9]|[12][0-9]|30)
-
February (up to 29): 02-(?:0[1-9]|1[0-9]|2[0-9])
-
Combine with year: [0-9]{4}-(?:...combined month-day patterns...)
Step 3: Apply Positional Requirements
Selecting Last Occurrence
To capture the last valid pattern in a line:
^.<pattern>(?!.<pattern>)
-
Use ^.* to greedily consume characters
-
Use negative lookahead (?!.*<pattern>) to ensure no pattern follows
Selecting First Occurrence
To capture the first valid pattern:
^(?:(?!<pattern>).)*<pattern>
Or simply rely on regex engines returning the first match by default.
Step 4: Apply Validation Without Capture
To require presence of a pattern without capturing it:
-
Use lookahead: (?=.*<pattern>) at the start of the regex
-
This validates the line contains the pattern without affecting the capture
Step 5: Apply Word Boundaries
For patterns that must not be adjacent to alphanumeric characters:
-
Use \b word boundaries: \b<pattern>\b
-
Be aware that \b matches between word and non-word characters
Verification Strategy
Create Comprehensive Test Cases
Organize tests by category:
Valid cases: Confirm expected matches
-
Minimum/maximum valid values (e.g., 0.0.0.0, 255.255.255.255)
-
Edge values for each component
Invalid format cases: Confirm rejection
-
Out-of-range values (e.g., 256.0.0.0)
-
Invalid formatting (leading zeros where prohibited)
-
Invalid months (00, 13) or days (32)
Boundary condition cases:
-
Pattern at start/end of line
-
Pattern adjacent to alphanumeric characters (should fail with word boundaries)
-
Pattern adjacent to punctuation (should pass with word boundaries)
Positional cases:
-
Multiple valid patterns in one line (verify correct one is captured)
-
Single pattern in line
-
No valid pattern in line
Test File Structure
Create a structured test file that:
-
Groups tests by category
-
Uses clear naming for each test case
-
Reports pass/fail status for each test
-
Summarizes overall results
Example structure:
test_cases = { "valid_ipv4": [...], "invalid_ipv4": [...], "valid_dates": [...], "invalid_dates": [...], "last_occurrence": [...], "boundary_conditions": [...] }
Common Pitfalls
- Incomplete First Attempt
-
Problem: Creating incomplete or truncated test files
-
Solution: Plan the full test structure before writing; validate file completeness before execution
- Environment Assumptions
-
Problem: Assuming python command exists when only python3 is available
-
Solution: Check the Python environment first or use python3 explicitly
- Scattered Reasoning
-
Problem: Disorganized thought process leading to repeated work
-
Solution: Follow the decomposition strategy linearly; complete each sub-pattern before moving to the next
- Duplicate Patterns Without Abstraction
-
Problem: Same regex pattern repeated multiple times, increasing error risk
-
Solution: Define complex sub-patterns once in reasoning, then reference them; in code, use variables
- Missing Edge Cases
-
Problem: Focusing only on happy path validation
-
Solution: Explicitly test:
-
Boundary values (min/max for each component)
-
Invalid values just outside valid range
-
Empty and null cases
-
Patterns at different positions in the line
- Order of Alternatives
-
Problem: Less specific alternatives matching before more specific ones
-
Solution: Order regex alternatives from most specific to least specific (e.g., 25[0-5] before 2[0-4][0-9] before [0-9] )
- Greedy vs Non-Greedy Matching
-
Problem: Unexpected capture due to greedy quantifiers
-
Solution: Understand when to use .* vs .? ; for "last occurrence" patterns, greedy . is typically correct
Workflow Summary
-
List all requirements explicitly
-
Build and test sub-patterns independently
-
Combine sub-patterns with appropriate anchors and lookaheads
-
Create comprehensive test cases covering all categories
-
Run tests and verify all pass
-
Clean up test files after validation