Encoding Special Characters in XML | Baeldung (2024)

1. Introduction

In this article, we’re going to explore XML entities, what they are, and what they can do for us. In particular, we’ll see what entities exist as standard with XML and how we can define our own if necessary.

2. How Is XML Structured?

XML is a markup format for representing arbitrary data. It does this using a hierarchical structure of XML elements, each of which can have attributes. For example:

<part number="1976"> <name>Windscreen Wiper</name></part>

This shows an element called “part” that has one attribute – “number” – and one nested element – “name”.

Notably, the XML language uses some special characters to manage this. For example, an element always starts with a less-than sign – “<” and ends with a greater-than sign – “>”.

However, if these characters have a special meaning to XML, that means we can’t use them within our content. Doing so would be ambiguous at least and outright unparsable at worst.

For example, if we were to try to use XML to represent a simple math equation, we might write:

<math> 1 < x > 5 </math>

This attempts to represent that x has a value between 1 and 5. However, an XML process can’t know that the intention isn’t to have an element “x” in between the two numbers.

3. Standard XML Entities

XML solves this problem through the use of XML Entities. These are special sequences that instead represent other characters.

XML entities always start with an ampersand character – “&” – and end with a semicolon character – “;”. The name of the entity is then between these two characters. For example, the entity “&lt;” is used to represent the less-than character – “<“.

There’s a set of five standard entities that are necessary to represent the characters with special meaning to XML:

EntityCharacter Represented
&amp;Ampersand – &
&apos;Apostrophe – ‘
&gt;Greater-than sign – >
&lt;Less-than sign – <
&quot;Quotation mark – “

Knowing this, our above attempt to represent a math equation would become:

Suddenly, there’s no ambiguity in how to understand this.

4. Character Entities

In addition to the above, XML also offers the ability to represent arbitrary Unicode characters. We do this by directly referencing the Unicode code point in decimal or hexadecimal form.

These are standard XML entities – meaning that they’re prefixed with an “&” character and suffixed with a “;” character. Decimal codepoints are then prefixed with a “#” character and hexadecimal ones with “#x”.

For example, the character “÷” is the division sign. Unicode represents this as the code point U+00F7. As such, we can represent this in XML as or as .

This is especially useful if we aren’t using a Unicode character set to encode our XML documents – for example, if we’re using ISO-8859-1 instead – but still want to represent Unicode characters. It can also be useful to represent certain special characters, such as non-printing or combining characters so that a developer reading it can see they’re present.

Finally, we can use this to represent control characters that otherwise can’t be present in the document – for example, U+0000 is the Nul character, but having this bare character present in the document is likely to break many readers.

5. Custom Entities

It’s also possible for us to define our own XML entities. This lets us specify an entity name of our choosing and define the value that it’ll be replaced with. This can help if we have certain values that are repetitive and that we need to manage easily, but it does open up some potential security risks if used carelessly.

We need to use a Document Type Definition (DTD) to define custom entities. This is a section before the start of the XML document that can be used to define its structure – similar to an XSD. We do this with the “<!DOCTYPE name […]>” construct, where “name” is an arbitrary name for the DTD:

<!DOCTYPE example [ ....]><part number="1976"> <name>Windscreen Wiper</name></part>

Inside this construct, we include the DTD definition. This can include, among other things, custom entity definitions – either as internal or external entities.

5.1. Internal Entities

An internal entity is defined directly in line, giving it a name and a value. Once this is done, an entity of this name can be used as-is and treated as any other entity. For example:

<!DOCTYPE example [ <!ENTITY windscreen "Windscreen Wiper">]><part number="1976"> <name>&windscreen;</name></part>

Here, we’ve defined a custom entity named “windscreen” and a replacement value of “Windscreen Wiper”. We use this with “&windscreen;”. Our XML process will replace this with the “Windscreen Wiper” value.

5.2. External Entities

External entities work the same, but instead of providing the value directly in the DTD, we provide the location to find it. For example:

<!DOCTYPE example [ <!ENTITY windscreen SYSTEM "http://example.com/parts/windscreen.txt">]><part number="1976"> <name>&windscreen;</name></part>

Here, we have defined a custom entity with the name “windscreen” and the replacement value of whatever is found at the URL “http://example.com/parts/windscreen.txt”. We can use this exactly as before, and the XML processor will automatically fetch this external resource to include when needed.

5.3. Potential Security Risks

Using custom entities can be powerful but can also open us to some potential security risks. In particular, if we’re processing XML documents that are provided by untrusted sources, then we need to be especially careful.

The most obvious attack here is XML External Entity (XXE) injection. This is where someone can craft an XML document that will maliciously load a resource the attacker shouldn’t have access to. For example:

<!DOCTYPE example [ <!ENTITY windscreen SYSTEM "file:///etc/passwd">]><part number="1976"> <name>&windscreen;</name></part>

This XML document declares a custom entity that the contents of the system password file will replace. Obviously, this isn’t something that should be possible, but if we’re not careful, then an attacker could do exactly this.

Another potential attack is sometimes known as an XML Bomb. This is a DoS attack that uses the repetitive expansion of XML entities:

<!DOCTYPE test [ <!ENTITY a0 "someLargeString"> <!ENTITY a1 "&a0;&a0;&a0;&a0;&a0;&a0;&a0;&a0;&a0;&a0;"> <!ENTITY a2 "&a1;&a1;&a1;&a1;&a1;&a1;&a1;&a1;&a1;&a1;"> <!ENTITY a3 "&a2;&a2;&a2;&a2;&a2;&a2;&a2;&a2;&a2;&a2;"> <!ENTITY a4 "&a3;&a3;&a3;&a3;&a3;&a3;&a3;&a3;&a3;&a3;">]><document>&a4;</document>

Here, we have our “&a4;” entity. This expands to 10 instances of “&a3;”, each of which expands to 10 instances of “&a2;”, and so on. This results in our document including 10,000 instances of “someLargeString”. If our attacker went even further, we could get significantly more – going 10 levels deep would give us 10,000,000,000 instances, which would be 140 GB in size.

In general, the only way to avoid these risks is to disable custom entities entirely in the XML processor. However, this removes the benefits that are gained from them as well. If we’re processing XML documents from untrusted sources, then this risk is likely not worth the benefit, but for internal documents, it might be beneficial.

6. Conclusion

In this article, we’ve seen how we can use XML entities in our XML documents to allow us to represent special characters. We even learned how we can define our entities if necessary.

Encoding Special Characters in XML | Baeldung (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 5405

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.