shared × unstructured — bad data written permanently, and it spreads
Encoding / Charset Corruption (Mojibake)
Bytes written in one charset and read back assuming another corrupt the stored text permanently.
01the recipe
In the wild
compound ofFile & Network AccessCWE-176 Unicode HandlingVersion & Library MismanagementcompoundCWE-172 Encoding Error
example.py
# SMELL: write bytes in one charset, read them back assuming another.
# (file / network access x version / library mismanagement)
with open("names.csv", "w") as f: # platform-default encoding (changed across versions)
f.write(name)
...
name = open("names.csv", encoding="utf-8").read() # decoded as UTF-8 -> mojibake
# every round-trip mangles the bytes further; the store is now corrupt.
# RIGHT: pin the encoding on both ends; never rely on the default.
with open("names.csv", "w", encoding="utf-8") as f:
f.write(name)
name = open("names.csv", encoding="utf-8").read()Relying on a default encoding that differs across versions and hosts means bytes written as one charset are decoded as another. The text is corrupted in place -- permanently -- and the damage compounds on each re-save.
// observed
mojibake: 'Jose' with an accent reads back garbled, then worse pinned: utf-8 on both ends -- bytes survive the round-trip
02weakness catalog
Mapped weaknesses (CWE)
On its own, this defect is catalogued by MITRE as one or more of these weaknesses. The exploitable vulnerability usually appears only when it chains or combines with another.