Recently I introduced a solution for URL encoding in Java .
public static String encode(String url) {
try {
URL u = new URL(url);
URI uri = new URI(u.getProtocol(),
u.getUserInfo(),
IDN.toASCII(u.getHost()),
u.getPort(),
u.getPath(),
u.getQuery(),
u.getRef());
String correctEncodedURL = uri.toASCIIString();
return correctEncodedURL;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Now I like to introduce a set of URLs to test the code. Good test sets are provided at the ‘Web Platform Tests’ (wpt) repository. A comprehensible assembly of infos about the URL standard can be found at whatwg.org.
On basis of the ‘Web Platform Tests’ I created a file to hold test urls together with the expected outcome. The test set is provided in the following form:
{
"in" : "http://你好你好.urltest.lookout.net/",
"out" : "http://xn--6qqa088eba.urltest.lookout.net/"
},
{
"in" : "http://urltest.lookout.net/?q=\"asdf\"",
"out" : "http://urltest.lookout.net/?q=%22asdf%22"
}
To test my URL encoding implementation I use the following code
try (InputStream in = Thread.currentThread().getContextClassLoader()
.getResourceAsStream("url-succeding-tests.json")) {
ObjectMapper mapper = new ObjectMapper();
JsonNode testdata = mapper.readValue(in, JsonNode.class).at("/tests");
for (JsonNode test : testdata) {
String url = test.at("/in").asText();
String expected = test.at("/out").asText();
String encodedUrl = URLUtil.encode(url);
org.junit.Assert.assertTrue(expected.equals(encodedUrl));
}
} catch (Exception e) {
throw new RuntimeException(e);
}
During my tests I also found some URLs that were not encoded correctly. I collected them in another JSON file.
Here are some failing examples:
{
"in" : "http://www.example.com/##asdf",
"out" : "http://www.example.com/##asdf"
},{
"in" : "http://www.example.com/#a\nb\rc\td",
"out" : "http://www.example.com/#abcd"
}, {
"in" : "file:c:\\\\foo\\\\bar.html",
"out" : "file:///C:/foo/bar.html"
}, {
"in" : " File:c|////foo\\\\bar.html",
"out" : "file:///C:////foo/bar.html"
},{
"in" : "http://lookout.net/",
"out" : "http://look%F3%A0%80%A0out.net/"
}, {
"in" : "http://look־out.net/",
"out" : "http://look%D6%BEout.net/"
},
Here is, how my encoding routine fails:
In: http://www.example.com/##asdf
Expect: http://www.example.com/##asdf
Actual: http://www.example.com/#%23asdf
In: http://www.example.com/#a
b
c d
Expect: http://www.example.com/#abcd
Actual: http://www.example.com/#a%0Ab%0Dc%09d
In: file:c:\\foo\\bar.html
Expect: file:///C:/foo/bar.html
Actual: ERROR
In: File:c|////foo\\bar.html
Expect: file:///C:////foo/bar.html
Actual: ERROR
java.net.URISyntaxException: Relative path in absolute URI: file://c:%5C%5Cfoo%5C%5Cbar.html
java.net.URISyntaxException: Relative path in absolute URI: file://c%7C////foo%5C%5Cbar.html
java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code point was found in the inputlookout
In: http://lookout.net/
Expect: http://look%F3%A0%80%A0out.net/
Actual: ERROR
java.lang.IllegalArgumentException: java.text.ParseException: The input does not conform to the rules for BiDi code points.look־out
In: http://look־out.net/
Expect: http://look%D6%BEout.net/
Actual: ERROR
Fazit
My URL encoding routine needs still some refinement. Especially cases of double encoding and the handling of URL fragments must be subjects of further improvement. However I’m already very happy with this standard Java solution. A more sophisticated approach can be found here: https://github.com/smola/galimatias and will also be subject of future tests.
Since this research is based on one of my stackoverflow answers, you can find the relevant code in my overflow repository.