refactor(html_cleaner): adopt robust HTML parsing for content cleaning

Replaces the fragile regex-based HTML cleaning logic with a proper HTML parser using `golang.org/x/net/html`. The previous implementation was unreliable and could not correctly handle malformed tags, script content, or a wide range of HTML entities.

This new approach provides several key improvements:
- Skips the content of `
This commit is contained in:
2025-11-06 04:26:51 +01:00
parent 2790064ad5
commit e6977d3374
4 changed files with 52 additions and 34 deletions

1
go.mod
View File

@ -4,6 +4,7 @@ go 1.24.0
require (
github.com/fumiama/go-docx v0.0.0-20250506085032-0c30fd09304b
golang.org/x/net v0.46.0
golang.org/x/text v0.30.0
)