Recently I introduced a solution for URL encoding in Java .
public static String encode(String url) { try { URL u = new URL(url); URI uri = new URI(u.getProtocol(), u.getUserInfo(), IDN.toASCII(u.getHost()), u.getPort(), u.getPath(), u.getQuery(), u.getRef()); String correctEncodedURL = uri.toASCIIString(); return correctEncodedURL; } catch (Exception e) { throw new RuntimeException(e); } }
Now I like to introduce a set of URLs to test the code. Good test sets are provided at the ‘Web Platform Tests’ (wpt) repository. A comprehensible assembly of infos about the URL standard can be found at whatwg.org.
On basis of the ‘Web Platform Tests’ I created a file to hold test urls together with the expected outcome. The test set is provided in the following form:
{ "in" : "http://你好你好.urltest.lookout.net/", "out" : "http://xn--6qqa088eba.urltest.lookout.net/" }, { "in" : "http://urltest.lookout.net/?q=\"asdf\"", "out" : "http://urltest.lookout.net/?q=%22asdf%22" }
To test my URL encoding implementation I use the following code
try (InputStream in = Thread.currentThread().getContextClassLoader() .getResourceAsStream("url-succeding-tests.json")) { ObjectMapper mapper = new ObjectMapper(); JsonNode testdata = mapper.readValue(in, JsonNode.class).at("/tests"); for (JsonNode test : testdata) { String url = test.at("/in").asText(); String expected = test.at("/out").asText(); String encodedUrl = URLUtil.encode(url); org.junit.Assert.assertTrue(expected.equals(encodedUrl)); } } catch (Exception e) { throw new RuntimeException(e); }
During my tests I also found some URLs that were not encoded correctly. I collected them in another JSON file.
Here are some failing examples:
{ "in" : "http://www.example.com/##asdf", "out" : "http://www.example.com/##asdf" },{ "in" : "http://www.example.com/#a\nb\rc\td", "out" : "http://www.example.com/#abcd" }, { "in" : "file:c:\\\\foo\\\\bar.html", "out" : "file:///C:/foo/bar.html" }, { "in" : " File:c|////foo\\\\bar.html", "out" : "file:///C:////foo/bar.html" },{ "in" : "http://lookout.net/", "out" : "http://look%F3%A0%80%A0out.net/" }, { "in" : "http://look־out.net/", "out" : "http://look%D6%BEout.net/" },
Here is, how my encoding routine fails:
In: http://www.example.com/##asdf Expect: http://www.example.com/##asdf Actual: http://www.example.com/#%23asdf In: http://www.example.com/#a b c d Expect: http://www.example.com/#abcd Actual: http://www.example.com/#a%0Ab%0Dc%09d In: file:c:\\foo\\bar.html Expect: file:///C:/foo/bar.html Actual: ERROR In: File:c|////foo\\bar.html Expect: file:///C:////foo/bar.html Actual: ERROR java.net.URISyntaxException: Relative path in absolute URI: file://c:%5C%5Cfoo%5C%5Cbar.html java.net.URISyntaxException: Relative path in absolute URI: file://c%7C////foo%5C%5Cbar.html java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code point was found in the inputlookout In: http://lookout.net/ Expect: http://look%F3%A0%80%A0out.net/ Actual: ERROR java.lang.IllegalArgumentException: java.text.ParseException: The input does not conform to the rules for BiDi code points.look־out In: http://look־out.net/ Expect: http://look%D6%BEout.net/ Actual: ERROR
Fazit
My URL encoding routine needs still some refinement. Especially cases of double encoding and the handling of URL fragments must be subjects of further improvement. However I’m already very happy with this standard Java solution. A more sophisticated approach can be found here: https://github.com/smola/galimatias and will also be subject of future tests.
Since this research is based on one of my stackoverflow answers, you can find the relevant code in my overflow repository.