URL encoding in Java – Explained

Recently I introduced a solution for URL encoding in Java .

public static String encode(String url) {
    try {
      URL u = new URL(url);
      URI uri = new URI(u.getProtocol(), 
                        u.getUserInfo(), 
                        IDN.toASCII(u.getHost()), 
                        u.getPort(), 
                        u.getPath(),
                        u.getQuery(), 
                        u.getRef());
      String correctEncodedURL = uri.toASCIIString();
      return correctEncodedURL;
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
}

 

Now I like to introduce a set of URLs to test the code.  Good test sets are provided at the  ‘Web Platform Tests’ (wpt) repository.  A comprehensible assembly of infos about the URL standard can be found at whatwg.org.

On basis of the ‘Web Platform Tests’ I created a file to hold test urls together with the expected outcome. The test set is provided in the following form:

{ 
  "in" : "http://你好你好.urltest.lookout.net/",
  "out" : "http://xn--6qqa088eba.urltest.lookout.net/" 
}, 
{ 
  "in" : "http://urltest.lookout.net/?q=\"asdf\"", 
  "out" : "http://urltest.lookout.net/?q=%22asdf%22" 
}

To test my URL encoding implementation I use the following code

try (InputStream in = Thread.currentThread().getContextClassLoader()
        .getResourceAsStream("url-succeding-tests.json")) {
  ObjectMapper mapper = new ObjectMapper();
  JsonNode testdata = mapper.readValue(in, JsonNode.class).at("/tests");
  for (JsonNode test : testdata) {
    String url = test.at("/in").asText();
    String expected = test.at("/out").asText();
    String encodedUrl = URLUtil.encode(url);
    org.junit.Assert.assertTrue(expected.equals(encodedUrl));
  }
} catch (Exception e) {
  throw new RuntimeException(e);
}

During my tests I also found some URLs that were not encoded correctly. I collected them in another JSON file.

Here are some failing examples:

{
  "in" : "http://www.example.com/##asdf",
  "out" : "http://www.example.com/##asdf"
},{
  "in" : "http://www.example.com/#a\nb\rc\td",
  "out" : "http://www.example.com/#abcd"
}, {
  "in" : "file:c:\\\\foo\\\\bar.html",
  "out" : "file:///C:/foo/bar.html"
}, {
  "in" : "  File:c|////foo\\\\bar.html",
  "out" : "file:///C:////foo/bar.html"
},{
  "in" : "http://look󠀠out.net/",
  "out" : "http://look%F3%A0%80%A0out.net/"
}, {
  "in" : "http://look־out.net/",
  "out" : "http://look%D6%BEout.net/"
},

Here is, how my encoding routine fails:

In:	http://www.example.com/##asdf
Expect:	http://www.example.com/##asdf
Actual:	http://www.example.com/#%23asdf

In:	http://www.example.com/#a
b
c	d
Expect:	http://www.example.com/#abcd
Actual:	http://www.example.com/#a%0Ab%0Dc%09d

In:	file:c:\\foo\\bar.html
Expect:	file:///C:/foo/bar.html
Actual:	ERROR

In:	  File:c|////foo\\bar.html
Expect:	file:///C:////foo/bar.html
Actual:	ERROR

java.net.URISyntaxException: Relative path in absolute URI: file://c:%5C%5Cfoo%5C%5Cbar.html
java.net.URISyntaxException: Relative path in absolute URI: file://c%7C////foo%5C%5Cbar.html
java.lang.IllegalArgumentException: java.text.ParseException: A prohibited code point was found in the inputlook󠀠out
In:	http://look󠀠out.net/
Expect:	http://look%F3%A0%80%A0out.net/
Actual:	ERROR

java.lang.IllegalArgumentException: java.text.ParseException: The input does not conform to the rules for BiDi code points.look־out
In:	http://look־out.net/
Expect:	http://look%D6%BEout.net/
Actual:	ERROR

Fazit

My  URL encoding routine needs still some refinement. Especially cases of double encoding and the handling of URL fragments must be subjects of further improvement. However I’m  already very happy with this standard Java solution. A more sophisticated approach can be found here: https://github.com/smola/galimatias and will also be subject of future tests.

Since this research is based on one of my stackoverflow answers, you can find the relevant code in my overflow repository.

Software Development in year 2018

Interested in the current state of the art (in real world)? Read this:

https://news.ycombinator.com/item?id=18442637

Favorite comment so far:

We have absolutely no idea how to write code. I always wonder if it’s like this for other branches of engineering too? I wonder if engineers who designed my elevator or airplane had “ok it’s very surprising that it’s working, let’s not touch this” moments. Or chemical engineers synthesize medicines in way nobody but a rockstar guru understands but everyone changes all the time. I wonder if my cellphone is made by machines designed in early 1990s because nobody was able to figure out what that one cog is doing.

Software is a mess. I’ve seen some freakishly smart people capable of solving very hard problems writing code that literally changes the world at this very moment. But the code itself is, well, a castle of shit. Why? Is it because our tools (programming languages, compilers etc) are still stone age technology? Is it because software is inherently a harder problem than say machines or chemical processes for the human brain? Is it because software engineers are less educated than other engineers? ”

 

 

What to do with apache logs?

Two simple things are easily achievable .

  1. Loading logs into a log file analyser https://matomo.org/
  2. Depersonalize

An example setup on Ubuntu is shown below.

Configure two daily running cronjobs

0 1 * * * /scripts/import-logfiles.sh
0 2 * * * /scripts/depersonalize-apache-logs.sh

Use import-logfiles.sh to load all server requests into the matomo database. Use depersonalize-apache-logs.sh to anonymize all logs older than seven days. Depersonalization is achieved by replacing the last two bytes of IP-adresses with  0.

Both scripts work on a default Ubuntu setup of apache2.  Apache  Logfiles are  compressed and end with ‘gz’. They are placed in ‘/var/log/apache2’ and start with the prefix ‘localhost-access.’

Kölner Grünsystem – Öffentlicher Empfang im Rathaus

Am kommen Dienstag den 13.11. findet ab 16:30 ein Empfang zum Thema „Kölner Grünsystem“ anlässlich des Europäischen Kulturerbejahres 2018 im Rathaus statt – um auf die aktuelle Gefährdung der Gleueler Wiese hinzuweisen wird die BI “Grüngürtel für alle”  ab 15:30 mit einer stillen Mahnwache vor Ort sein.

Gemeinsame Stellungnahme NABU / BUND zur Mitarbeit bei „StadtGrün naturnah“

Stop listening

Google Home devices and Apple HomePod both have voice commands to mute the microphone from across the room — “OK Google, mute the microphone” and “Hey Siri, stop listening” — but not Amazon Echo devices.

On Amazon Echo devices you have to push a mute button. But this will not help against  Amazon Echo Remote devices. The remote microphone (which is push-for-use and not always on) remains available even if the main unit has the microphone array disabled.

https://www.androidcentral.com/how-disable-microphone-amazon-echo

https://www.howtogeek.com/237397/how-to-stop-your-amazon-echo-from-listening-in/

Wenn Kopierer nicht kopieren…

…ist vermutlich Software im Spiel.

Sehr sehenswerter/unterhaltsamer Vortrag aus dem Jahre 2014. Wer es noch nicht kennt…

“Kopierer, die spontan Zahlen im Dokument verändern: Im August 2013 kam heraus, dass so gut wie alle Xerox-Scankopierer beim Scannen Zahlen und Buchstaben einfach so durch andere ersetzen. Da man solche Fehler als Benutzer so gut wie nicht sehen kann, ist der Bug extrem gefährlich und blieb lange unentdeckt: Er existiert über acht Jahre in freier Wildbahn.”

https://www.youtube.com/watch?v=7FeqF1-Z1g0

URL encoding in Java.

Here is, how I encode URLs in Java.

  1. Split URL into structural parts. Use java.net.URL for it.
  2. Encode each part properly
  3. Use IDN.toASCII(putDomainNameHere) to Punycode encode the host name!
  4. Use java.net.URI.toASCIIString() to percent-encode, NFC encoded unicode – (better would be NFKC!). For more info see: How to encode properly this URL
    URL url= new URL("http://search.barnesandnoble.com/booksearch/first book.pdf);
    URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
    String correctEncodedURL=uri.toASCIIString(); 
    System.out.println(correctEncodedURL);

    Prints

    http://search.barnesandnoble.com/booksearch/first%20book.pdf