filter-js-from-html

Filter JavaScript from HTML

Overview

This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.

Critical Requirements Analysis

Before implementation, identify and prioritize these requirements:

Security completeness: All XSS vectors must be removed
Format preservation: Output must be functionally identical to input except for harmful content removal
Clean content handling: Files without XSS content should remain completely unchanged

These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.

Approach Selection

Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)

When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.

Advantages:

Preserves whitespace, attribute ordering, quote styles exactly
Does not reconstruct or reformat HTML
Output matches input character-for-character except for removed content

Considerations:

Requires careful pattern construction to avoid partial matches
Must handle various encodings and obfuscation techniques
Test patterns against comprehensive XSS vector lists

Option 2: HTML Parser-Based Filtering

When format preservation is less critical or when dealing with malformed HTML.

Considerations:

HTML parsers inherently reconstruct output, changing formatting
May normalize attribute quotes, whitespace, tag casing
Better for malformed HTML that regex cannot reliably parse
If using this approach, verify that clean HTML files remain unchanged

Comprehensive XSS Vector Checklist

Before implementing, research and account for ALL of these attack categories:

Script Execution Tags

<script> tags (including variations with attributes)
<noscript> abuse cases

Event Handlers (Comprehensive List Required)

Common handlers:

onclick , onload , onerror , onmouseover , onfocus , onblur

Frequently missed handlers:

onlayoutcomplete , ontimeerror , onselectionchange
onrowsinserted , onrowsdelete , onrowexit , onrowenter
oncellchange , ondataavailable , ondatasetchanged , ondatasetcomplete
onbeforeupdate , onafterupdate , onerrorupdate
onfilterchange , onpropertychange , onreadystatechange
onbeforeprint , onafterprint , onbeforeunload
oncontextmenu , ondrag , ondragend , ondragenter , ondragleave
ondragover , ondragstart , ondrop
onhashchange , oninput , oninvalid , onpageshow , onpagehide
onpopstate , onresize , onstorage , onwheel

Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.

JavaScript URL Protocol

javascript: in href, src, action, formaction, data, poster attributes
Case variations: JavaScript: , JAVASCRIPT: , JaVaScRiPt:
Encoded variations: javascript: , javascript:

Other Dangerous Protocols

vbscript: (IE legacy)
data: URIs with script content: data:text/html,<script>...</script>
data:text/html;base64,... encoded payloads

CSS-Based Attacks

<style> tags with dangerous properties
-moz-binding (Firefox legacy)
expression() (IE legacy)
behavior: property
@import with javascript or data URIs

Meta Tag Attacks

<meta http-equiv="refresh" content="0;url=data:text/html,...">
<meta http-equiv="refresh" content="0;url=javascript:...">

External Resource Loading

<link> tags with dangerous href values
<object> tags with data attributes
<embed> tags with src attributes
<applet> tags (legacy)
<iframe> with src or srcdoc containing scripts

SVG-Based Attacks

<svg onload="..."> and other SVG event handlers
<svg><script>...</script></svg>
SVG <use> with external references

Encoding and Obfuscation

HTML entity encoding: <script>
URL encoding: %3Cscript%3E
UTF-7 encoding attacks
Null byte injection: <scr\0ipt>
Unicode variations

HTML Comment Exploits

Conditional comments:
Nested comment breaking

Verification Strategy

Test Categories (All Required)

XSS Attack Vectors

Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)
Test XSS polyglots that combine multiple techniques
Include lesser-known event handlers in tests

Format Preservation

Provide clean HTML files with varied formatting
Verify byte-for-byte identical output for clean files
Test various whitespace patterns, quote styles, attribute ordering

Edge Cases

Malformed HTML
Mixed case tags and attributes
Attributes without quotes
Multiple encodings in same document

Testing Process

Research first: Before writing tests, search for:

OWASP XSS Prevention Cheat Sheet
XSS Filter Evasion Cheat Sheet
Known XSS polyglots
Browser-specific attack vectors

Create adversarial tests: Do not rely solely on self-created test cases

Use external comprehensive test suites
Include vectors that have bypassed filters historically

Test clean content preservation: Equal priority to security testing

Create diverse clean HTML samples
Verify no modifications occur
Check whitespace, comments, attribute order

Common Pitfalls

Incomplete Event Handler Lists

Mistake: Hardcoding only common event handlers like onclick , onload , onerror . Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.

Ignoring CSS Attack Vectors

Mistake: Focusing only on JavaScript while ignoring CSS-based XSS. Solution: Filter <style> tags, dangerous CSS properties, and style attributes with expressions.

Missing Protocol Handlers

Mistake: Only filtering javascript: protocol. Solution: Also filter vbscript: , data: URIs with dangerous content, and handle encoded protocol names.

Format Alteration with Parsers

Mistake: Using HTML parsers when format preservation is required. Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.

Self-Validating Tests

Mistake: Creating test cases that match implementation capabilities rather than real attack vectors. Solution: Use external, adversarial test suites created by security researchers.

Quote and Encoding Handling

Mistake: Not handling HTML entities in attributes (" , ' ). Solution: Consider how encoded characters in attributes might bypass filters.

Forgetting Meta Refresh

Mistake: Not filtering <meta http-equiv="refresh"> with dangerous URLs. Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.

Ignoring External Resources

Mistake: Not filtering <link> , <object> , <embed> tags. Solution: Evaluate whether these tags can load or execute dangerous content.

Implementation Checklist

Before considering the implementation complete:

Researched comprehensive XSS attack vector lists
Implemented filtering for ALL event handlers (not just common ones)
Handled script tags and noscript abuse
Filtered javascript:, vbscript:, and dangerous data: URIs
Addressed CSS-based attacks (style tags, expressions, bindings)
Handled meta refresh attacks
Considered link, object, embed, applet tags
Handled SVG-based attacks
Accounted for encoding variations
Tested with external XSS test suites
Verified clean HTML files remain unchanged
Tested format preservation (whitespace, quotes, ordering)

filter-js-from-html

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

extracting-pdf-text

video-processing

google-workspace

portfolio-optimization