Traumjob IT

  •  Flexible Arbeitsorte. Homeoffice ist oft ohne Nachteile möglich.
  •  Unterschiedliche Branchen. Man kann Einblicke in ganz unterschiedliche Lebens- und Wirtschaftsbereiche erhalten.
  • Ausreichende Bezahlung. Selbst wenn man für gemeinnützige oder öffentliche Institutionen arbeitet, bekommt man i.d.R eine faire Bezahlung.
  • Quereinstieg möglich. Vor allem NaturwissenschaftlerInnen arbeiten oft als Quereinsteiger. Tatsächlich findet man aber auch alle möglichen anderen Biographien – was sehr erfrischend sein kann.
  • Lebendige Gemeinschaft. Ständig erfindet irgendwer was neues. Die IT-Branche folgt immer wieder großen Hypes. Wer sich davon nicht stressen lässt, kann ständig etwas neues lernen und gewinnt so immer wieder zusätzliche Perspektiven.

Unix tools introduced. Today: FHS

The Filesystem Hierarchy Standard (FHS) defines a standard layout to organize various kinds of application and OS related data in a predictable and common way [1].

A basic knowledge of the FHS will help you to find application or OS related data more easily. If you are a developer, it also provides a good orientation for organizing your own applications in a maintainable way, e.g. as ubuntu package.

/bin – essential user commands

/boot – OS boot loader

/dev – devices (everything is a file principle)

/etc – system configuration

/home – user data

/lib – essentail shared libraries

/media – mount point for removable media

/mnt – mount point for temporarily mounted filesystems

/opt – add-on applications

/root – home of root

/run – run time variable data

/sbin – system binaries

/srv – data for services provided by the system

/tmp – temporary data

/proc – is a virtual filesystem

/usr – secondary hierarchy

bin – Most user commands
lib – Libraries
local – Local hierarchy (empty after main installation)
sbin – Non-vital system binaries
share – Architecture-independent data

/var – variable data

cache  – Application cache data
lib  – Variable state information
local  – Variable data for /usr/local
lock –  Lock files
log – Log files and directories
opt – Variable data for /opt
run – Data relevant to running processes
spool – Application spool data
tmp  -Temporary files preserved between system reboots

Find more

What about – /init.d ?

What does the .d stand for in directory names?

FHS in Debian

 

A stopwatch in bash

Is it possible to implement a stopwatch in bash? Here is my try:

https://github.com/jschnasse/stopwatch/blob/master/stopwatch

The script uses some interesting features like:

  1.  read -s -t.1 -n1 c to read exactly one character (-n1) into a variable c only waiting 0.1 seconds for user input.
  2. sleep .1 to delay further processing for 0.1 seconds
  3. secs=$(printf "%1d\n" ${input: 0 : -9})Create a digit from all but the last 9 characters lead by a zero if string is empty. This is used to separate seconds from a nano seconds. 

Direct Accessing XML with Java

Motivation

Processing of huge XML files can become cumbersome if your hardware is limited.

“Parsing a sample 20 MB XML document[1] containing Wikipedia document abstracts into a DOM tree using the Xerces library roughly consumes about 100 MB of RAM. Other document model implementations[2] such as Saxon’s TinyTree are more memory efficient; parsing the same document in Saxon consumes about 50 MB of memory. These numbers will vary with document contents, but generally the required memory scales linearly with document size, and is typically a single-digit multiple of the file size on disk.”

Probst, Martin. “Processing Arbitrarily Large XML using a Persistent DOM.” 2010. https://www.balisage.net/Proceedings/vol5/html/Probst01/BalisageVol5-Probst01.html

A good way to deal with huge files is to split them into smaller ones. But sometimes you don’t have that option.

Here is where Random Access comes into play. While random access of binary files is well supported by standard Java tools, this is not  true for higher-order text-based formats like XML.

The Plan

  1. Find proper access points, by taking XML structure into account.
  2. Translate character offsets  to byte offsets (take encoding into account)

This sounds straightforward.

Existing Libraries

The StAX library offers streaming access to XML data without the need of loading a complete DOM model into memory. The library comes with an XMLStreamReader offering a method getLocation().getCharacterOffset() .

But unfortunately this will only return character offsets. In order to access the file with standard java readers we need byte offsets. UTF-8 uses variable lengths for encoding characters.  This means that we have to reread the whole file from the beginning to calculate the byte offset from character offset. This seems not acceptable.

Solution

In the following I will introduce a solution, based on  a generated XML parser using ANTLR4.

  1. We will use the parser to walk through the XML file. While the parser is doing it’s work it will spit out byte offsets whenever a certain criteria is fulfilled (in the example we will search for XML-Elements with the name ‘page’).
  2. I will use the byte offsets to access the XML file and to read portions of XML into a Java bean using JAXB.

The Following works very well on a ~17GB Wikipedia dump/20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 . I still had to increase heap size using -xX6GB but compared to a DOM approach this looks much more acceptable.

1. Get XML Grammar

cd /tmp
git clone https://github.com/antlr/grammars-v4

2. Generate Parser

cd /tmp/grammars-v4/xml/
mvn clean install

3. Copy Generated Java files to your Project

cp -r target/generated-sources/antlr4 /path/to/your/project/gen

4. Hook in with a Listener to collect character offsets

package stack43366566;

import java.util.ArrayList;
import java.util.List;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;

public class FindXmlOffset {

    List<Integer> offsets = null;
    String searchForElement = null;

    public class MyXMLListener extends XMLParserBaseListener {
        public void enterElement(XMLParser.ElementContext ctx) {
            String name = ctx.Name().get(0).getText();
            if (searchForElement.equals(name)) {
                offsets.add(ctx.start.getStartIndex());
            }
        }
    }

    public List<Integer> createOffsets(String file, String elementName) {
        searchForElement = elementName;
        offsets = new ArrayList<>();
        try {
            XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            XMLParser parser = new XMLParser(tokens);
            DocumentContext ctx = parser.document();
            ParseTreeWalker walker = new ParseTreeWalker();
            MyXMLListener listener = new MyXMLListener();
            walker.walk(listener, ctx);
            return offsets;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] arg) {
        System.out.println("Search for offsets.");
        List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                        "page");
        System.out.println("Offsets: " + offsets);
    }

}

5. Result

Prints:

Offsets: [2441, 10854, 30257, 51419 ….

6. Read from Offset Position

To test the code I’ve written class that reads in each wikipedia page to a java object

@JacksonXmlRootElement
class Page {
 public Page(){};
 public String title;
}

using basically this code

private Page readPage(Integer offset, String filename) {
        try (Reader in = new FileReader(filename)) {
            in.skip(offset);
            ObjectMapper mapper = new XmlMapper();
             mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
            Page object = mapper.readValue(in, Page.class);
            return object;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

Download

Find complete example on github.