Chase Mao's blog

Best Practice of Encoding URL from User Input

2022-05-06

URL Encoding Intrudction

There are three kinds of characters in URIs defined in RFC 3986:

  1. Reserved characters are used to delimit components and subcomponents in URIs.

reserved = gen-delims / sub-delims

gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “@”

sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;” / “=”

If a reserved character has a special meaning in a specific context and must be used in the URI for other purposes, it must be percent-encoded. For example, when search / in google, the url will be https://www.google.com/search?q=%2F, %2F is the percent-encoding for /.

  1. Unreserved characters are characters that are allowed in a URI but do not have a reserved purpose.

unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”

All URIs must not use percent encoding for unreserved characters.

  1. Other characters. Other characters are recommended to be converted to a UTF-8 byte sequence first, and then use percent encoding for their byte values. Each byte is represented by two hexadecimal numbers, prefixed with % for escaping.

Best practice

There are two situation where we need to handle encode from user input.

When user input is part of the URL

For example, when using google, the query keyword will be in the URL of search result, like the https://www.google.com/search?q=%2F example above.

In this situtaion, the problem is that the encoding format is not explicitly specified in the standard, which lead to different browsers using different encoding foramt, which will casuse website not compitable in some browser.

The best practice here is to use JavaScript to encode the URL first, and then submit it to the server without allowing the browser to intervene. Since JavaScript’s output is always consistent, it ensures that the server receives data in a unified format.

There are two kind of encode methods in JavaScript.

  1. encodeURI() encodes the entire URL, so in addition to common symbols, it does not encode other symbols with special meanings in URLs, such as “; / ? : @ & = + $ , #”. After encoding, it outputs the UTF-8 form of the symbols, with a % prefix added to each byte.

  2. encodeURIComponent() differs from encodeURI() in that it is used to individually encode the components of a URL, not the entire URL. Therefore, “; / ? : @ & = + $ , #”, which are not encoded in encodeURI(), are all encoded in encodeURIComponent().

When user input is part of the URL, we can use encodeURIComponent() to encode that part and combine it with other part to form whole URL.

When user input is a whole URL

For example, when user is submiting a form, and one of the fields is to fill in a website URL.

The problem here is that do we need to encode the URL or just use the url from user input.

To dive deep into this question, we should know more about URL encoding and decoding. What happen when we encode or decode URL more than once.

In RFC 3986 2.4. When to Encode or Decode. It says since %itself needs to be encoded as %25 when used as data, repeated encoding will lead to unexpected results, such as % encoded twice resulting in %2525. Repeated decoding is also not allowed, because during decoding, % itself will be considered as the starting character of percent encoding, and decoding % may result in errors.

So we know that if the whole URL from user input is a encoded URL, we should not encode it again. Or we can chooze to user input be a raw URL ( not encoded ), and we use encodeURI() to encode it.

The best practice is that user input is a already encoded URL. Because if user input is a raw URL, we may not able to encode it properly if the URL use reserved characters as a plain text. Still the example https://www.google.com/search?q=%2F is a encoded format of https://www.google.com/search?q=/, note that last / here is part of the query but the delims of URL. When we call encodeURI("https://www.google.com/search?q=/"), it will get https://www.google.com/search?q=/ but https://www.google.com/search?q=%2F, because / will be seen as a delims in encodeURI().

Handle encoded URL in backend from user input

In RFC 3986 2.4. When to Encode or Decode, it says the time when to encode or decode.

Encoding: The only time to encode is when assembling the URI components to produce the URI. Once the URI is produced, it will always maintain its percent-encoded form.

Decoding: Decoding starts by using delimiters to divide the URI into several components, and then decoding according to the percent-encoding rules.

So when we handle a URL in backend, the URL must be a encoded URL according to the rule above. To use this url, such as get or change its components. We must decode it first.

Luckily most of advanced programe language provide lib to decode URL, in Go we can use url.Parse() to parse a URL, if the URL is a invalid URL, it will return error. So it will always be a good move to check if a URL from user input is valid or not before save it into database.