filter-js-from-html

Filter JavaScript from HTML

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "filter-js-from-html" with this command: npx skills add letta-ai/skills/letta-ai-skills-filter-js-from-html

Filter JavaScript from HTML

Overview

This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.

Critical Requirements Analysis

Before implementation, identify and prioritize these requirements:

  • Security completeness: All XSS vectors must be removed

  • Format preservation: Output must be functionally identical to input except for harmful content removal

  • Clean content handling: Files without XSS content should remain completely unchanged

These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.

Approach Selection

Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)

When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.

Advantages:

  • Preserves whitespace, attribute ordering, quote styles exactly

  • Does not reconstruct or reformat HTML

  • Output matches input character-for-character except for removed content

Considerations:

  • Requires careful pattern construction to avoid partial matches

  • Must handle various encodings and obfuscation techniques

  • Test patterns against comprehensive XSS vector lists

Option 2: HTML Parser-Based Filtering

When format preservation is less critical or when dealing with malformed HTML.

Considerations:

  • HTML parsers inherently reconstruct output, changing formatting

  • May normalize attribute quotes, whitespace, tag casing

  • Better for malformed HTML that regex cannot reliably parse

  • If using this approach, verify that clean HTML files remain unchanged

Comprehensive XSS Vector Checklist

Before implementing, research and account for ALL of these attack categories:

  1. Script Execution Tags
  • <script> tags (including variations with attributes)

  • <noscript> abuse cases

  1. Event Handlers (Comprehensive List Required)

Common handlers:

  • onclick , onload , onerror , onmouseover , onfocus , onblur

Frequently missed handlers:

  • onlayoutcomplete , ontimeerror , onselectionchange

  • onrowsinserted , onrowsdelete , onrowexit , onrowenter

  • oncellchange , ondataavailable , ondatasetchanged , ondatasetcomplete

  • onbeforeupdate , onafterupdate , onerrorupdate

  • onfilterchange , onpropertychange , onreadystatechange

  • onbeforeprint , onafterprint , onbeforeunload

  • oncontextmenu , ondrag , ondragend , ondragenter , ondragleave

  • ondragover , ondragstart , ondrop

  • onhashchange , oninput , oninvalid , onpageshow , onpagehide

  • onpopstate , onresize , onstorage , onwheel

Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.

  1. JavaScript URL Protocol
  • javascript: in href, src, action, formaction, data, poster attributes

  • Case variations: JavaScript: , JAVASCRIPT: , JaVaScRiPt:

  • Encoded variations: &#106;avascript: , java&#115;cript:

  1. Other Dangerous Protocols
  • vbscript: (IE legacy)

  • data: URIs with script content: data:text/html,<script>...</script>

  • data:text/html;base64,... encoded payloads

  1. CSS-Based Attacks
  • <style> tags with dangerous properties

  • -moz-binding (Firefox legacy)

  • expression() (IE legacy)

  • behavior: property

  • @import with javascript or data URIs

  1. Meta Tag Attacks
  • <meta http-equiv="refresh" content="0;url=data:text/html,...">

  • <meta http-equiv="refresh" content="0;url=javascript:...">

  1. External Resource Loading
  • <link> tags with dangerous href values

  • <object> tags with data attributes

  • <embed> tags with src attributes

  • <applet> tags (legacy)

  • <iframe> with src or srcdoc containing scripts

  1. SVG-Based Attacks
  • <svg onload="..."> and other SVG event handlers

  • <svg><script>...</script></svg>

  • SVG <use> with external references

  1. Encoding and Obfuscation
  • HTML entity encoding: &#60;script&#62;

  • URL encoding: %3Cscript%3E

  • UTF-7 encoding attacks

  • Null byte injection: <scr\0ipt>

  • Unicode variations

  1. HTML Comment Exploits
  • Conditional comments: <!--[if IE]><script>...<![endif]-->

  • Nested comment breaking

Verification Strategy

Test Categories (All Required)

XSS Attack Vectors

  • Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)

  • Test XSS polyglots that combine multiple techniques

  • Include lesser-known event handlers in tests

Format Preservation

  • Provide clean HTML files with varied formatting

  • Verify byte-for-byte identical output for clean files

  • Test various whitespace patterns, quote styles, attribute ordering

Edge Cases

  • Malformed HTML

  • Mixed case tags and attributes

  • Attributes without quotes

  • Multiple encodings in same document

Testing Process

Research first: Before writing tests, search for:

  • OWASP XSS Prevention Cheat Sheet

  • XSS Filter Evasion Cheat Sheet

  • Known XSS polyglots

  • Browser-specific attack vectors

Create adversarial tests: Do not rely solely on self-created test cases

  • Use external comprehensive test suites

  • Include vectors that have bypassed filters historically

Test clean content preservation: Equal priority to security testing

  • Create diverse clean HTML samples

  • Verify no modifications occur

  • Check whitespace, comments, attribute order

Common Pitfalls

  1. Incomplete Event Handler Lists

Mistake: Hardcoding only common event handlers like onclick , onload , onerror . Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.

  1. Ignoring CSS Attack Vectors

Mistake: Focusing only on JavaScript while ignoring CSS-based XSS. Solution: Filter <style> tags, dangerous CSS properties, and style attributes with expressions.

  1. Missing Protocol Handlers

Mistake: Only filtering javascript: protocol. Solution: Also filter vbscript: , data: URIs with dangerous content, and handle encoded protocol names.

  1. Format Alteration with Parsers

Mistake: Using HTML parsers when format preservation is required. Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.

  1. Self-Validating Tests

Mistake: Creating test cases that match implementation capabilities rather than real attack vectors. Solution: Use external, adversarial test suites created by security researchers.

  1. Quote and Encoding Handling

Mistake: Not handling HTML entities in attributes (&quot; , &#39; ). Solution: Consider how encoded characters in attributes might bypass filters.

  1. Forgetting Meta Refresh

Mistake: Not filtering <meta http-equiv="refresh"> with dangerous URLs. Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.

  1. Ignoring External Resources

Mistake: Not filtering <link> , <object> , <embed> tags. Solution: Evaluate whether these tags can load or execute dangerous content.

Implementation Checklist

Before considering the implementation complete:

  • Researched comprehensive XSS attack vector lists

  • Implemented filtering for ALL event handlers (not just common ones)

  • Handled script tags and noscript abuse

  • Filtered javascript:, vbscript:, and dangerous data: URIs

  • Addressed CSS-based attacks (style tags, expressions, bindings)

  • Handled meta refresh attacks

  • Considered link, object, embed, applet tags

  • Handled SVG-based attacks

  • Accounted for encoding variations

  • Tested with external XSS test suites

  • Verified clean HTML files remain unchanged

  • Tested format preservation (whitespace, quotes, ordering)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

extracting-pdf-text

No summary provided by upstream source.

Repository SourceNeeds Review
General

video-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

google-workspace

No summary provided by upstream source.

Repository SourceNeeds Review
General

portfolio-optimization

No summary provided by upstream source.

Repository SourceNeeds Review