Adventures in a Paperless world Part 1

Adventures in a Paperless world - Part 1

4th Nov 2018 docker reverse engineering

12 minutes

This is a rather long story, because the process to set this up has been interesting and has brought me down many side paths. If you're just here to figure out how to set all of this up, I'm planning on putting up an abridged guide at some point in the future.

Despite how pervasive the Internet has become over the past few decades, most of the world is still obstinately relying on paper media for a surprising amount of communication. The problem with this approach is twofold: paper manufacturing is one of the most environmentally destructive industries on the planet, and I'm absolutely terrible at keeping track of where I put all those documents.

Most of my important things are collected in a white binder that is frighteningly close to bursting at the seams, and everything that I deem less useful (or I'm too lazy to take downstairs) ends up on random shelves around the house until my wife and I go on a round of mega-cleaning.

After receiving yet another power bill that I promptly abandoned on the coffee table, I wondered if perhaps there wasn't a better way to deal with this. As it turns out, there most definitely already was in the form of not one, but two very popular projects: Mayan EDMS and Paperless.

Despite having clearly different targets and depth of feature sets, they both offer a function that seemed right up my lazy alley: drop documents/pictures/PDFs in a specified folder and they will be picked up, go through an OCR step, and end up in a central searchable database. Magic! Plus I had recently set up my old desktop as a home server with Ubuntu Server and Docker, so this would have been a fantastic use case.

However, being quite an expert about myself, I knew for a fact that I would have to make this process as frictionless as possible for me to have any hope of keeping it up, especially considering that I would have to put in a significant initial time investment to digitize my existing paper archive. Turning on my computer, opening an image acquisition application, scanning the pages, clicking on a bunch of menus to export everything as PDF, then uploading it to a destination folder? Not gonna happen.

Printer automation

Paperless' readme recommends to "set [your scanner] to 'scan to FTP' or something similar. It should be able to push scanned images to a server without you having to do anything."

Unfortunately my multifunction printer (a Brother HL-L2380DW) does not support anything of the sort. The best it offers is a host of "Scan to document/image/email" options that seem to pick up computers where the official drivers have been installed. I wondered if I could somehow plug into this functionality to automate the entire process.

I went to the Brother website to find Linux-compatible drivers for the scanner, and on the downloads page I saw an entry for a thing called "Scan-key-tool". The description read "With this tool, you can start a scan by the button on the machine." Amazing! Gotta love it when Linux gets some actual love from the manufacturer.

Dockerizing the printer tools

Not wanting to pollute my server with drivers and odd scripts, I set out to create a Docker container that would do the network scanning part and put the results in a volume shared with either a Mayan EDMS or a Paperless container. Lo and behold someone had, once again, already gone through the effort and created an image called docker-brscan4 specifically to do that. Out of curiosity, I explored the repository to see how they had set it up.

Analyzing the image

FROM ubuntu:16.04
MAINTAINER Ke Zhang <[email protected]>

RUN apt-get -y update && apt-get -y upgrade && apt-get -y clean
RUN apt-get -y install sane sane-utils ghostscript netpbm x11-common- && apt-get -y clean

Some basic package installs: sane (Scanner Access Now Easy) is a library for scanner management.

ADD drivers /opt/brother/docker_skey/drivers
RUN dpkg -i /opt/brother/docker_skey/drivers/*.deb

The drivers are actually stored in the repository itself. A bit ugly but, according to the author, this is allowed by Brother's license, and to be honest manufacturer websites are not exactly known for their reliability or automation friendliness.

Then came the interesting part.

ADD config /opt/brother/docker_skey/config
ADD scripts /opt/brother/docker_skey/scripts

RUN cfg=`ls /opt/brother/scanner/brscan-skey/brscan-skey-*.cfg`; ln -sfn /opt/brother/docker_skey/config/brscan-skey.cfg $cfg

ENV SCANNER_NAME="venus"
ENV SCANNER_MODEL="DCP-7065DN"
ENV SCANNER_IP_ADDRESS="192.168.1.16"

The config folder contains one configuration file that is linked over the installed one, and points to the scripts copied in the scripts directory.

password=
IMAGE="bash /opt/brother/docker_skey/scripts/scan2image.sh"
OCR="bash /opt/brother/docker_skey/scripts/scan2pdfc.sh"
EMAIL="bash /opt/brother/docker_skey/scripts/scan2pdfbw.sh"
FILE="bash /opt/brother/docker_skey/scripts/scan2pdf.sh"
SEMID=b
user=gaia

Apparently whoever set this up had decided to use the OCR and email options to act as black and white/color options for PDF scanning.

Finally the start.sh script is run as the container's CMD.

/usr/bin/brsaneconfig4 -a name=$SCANNER_NAME model=$SCANNER_MODEL ip=$SCANNER_IP_ADDRESS
/usr/bin/brscan-skey
while true; do
  sleep 300
done
exit 0

The brscan-key binary clearly runs in the background as a sort of daemon, which forced the creator of the image to add an endless loop to prevent the container from exiting immediately: start.sh is executed by Docker as process number 1, which makes it the parent of the whole process tree.

The trouble begins

Anticipating some work to make the image a bit cleaner, I cloned the repository and created a bare-bones docker-compose.yml file. Docker-compose makes it very easy to organize and orchestrate your multi-container apps, and is also a very convenient solution for single-container services since it organizes all the parameters nicely in an easy-to-read format.

version: "3"
services:
  brscankey:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "54921:54921"
      - "54925:54925/udp"
    environment:
      - SCANNER_NAME=lscan
      - SCANNER_MODEL=HL-L2380DW
      - SCANNER_IP_ADDRESS=192.168.2.108
    entrypoint: "/opt/brother/docker_skey/scripts/start.sh"
    volumes:
      - '/tmp:/scans'

I also changed the user key from gaia to jack in the configuration file, then I created a container in the foreground with sudo docker-compose up. Somewhat expectedly, the container didn't output a single line of text: brsaneconfig4 didn't produce any output, and I didn't expect the skey daemon to really produce any output either.

Not keeping my hopes up, I went to the printer and tried the "Print to document" function. I was greeted by a "Check connection" message, which meant that the printer didn't know/had not been told about the new supposed destination.

The pretty well-hidden FAQ page on Brother's website recommends changing the configuration to list the correct network adapter when this happens, as the skey tool defaults to eth0, but eth0 was the correct interface on the container, so something else was clearly wrong. Was the script even working at all?

Network peeking

When a network-related tool doesn't have anything useful to say, one way to figure out what's going on is to bust out good old tcpdump and take a look inside the pipes.

After running sudo tcpdump host 192.168.2.108 to monitor all traffic going to and coming from the printer, I restarted the container and immediately went "ah-ha!"

19:26:27.344208 IP 192.168.2.250.33654 > 192.168.2.108.snmp:  C="internal" SetRequest(427)  E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=IMAGE;HOST=:54925;APPNUM=1;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=OCR;HOST=:54925;APPNUM=3;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=EMAIL;HOST=:54925;APPNUM=2;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=FILE;HOST=:54925;APPNUM=5;DURATION=360;BRID=;"

The script was talking to the printer using SNMP, and the printer was responding back with an identical message (not captured in the box above). If they were definitely talking to each other, why wasn't the printer acknowledging this?

Out of ideas, I turned back to the Brother package. The tool is clearly supposed to work, so I tried to get it to be a bit more helpful about what was going on.

Tearing daemons apart

To take a look at the internals of the Brother tool, I extracted the contents of the .deb file with ar x brscan-skey-0.2.4-1.amd64.deb and once again extracted the data.tar.gz filed contained within.

$ tree
.
├── control.tar.gz
├── data.tar.gz
├── debian-binary
├── opt
│   └── brother
│       └── scanner
│           └── brscan-skey
│               ├── brscan-skey
│               ├── brscan-skey-0.2.4-0
│               ├── brscan-skey-0.2.4-1.sh
│               └── script
│                   └── brscan_scantoemail-0.2.4-0
└── usr
    └── share
        └── doc

8 directories, 7 files

The brscan-skey folder contained multiple executable files, so I used the fantastic file command to learn a bit more about them.

$ file *
brscan-skey:            POSIX shell script, ASCII text executable
brscan-skey-0.2.4-0:    ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.0, not stripped
brscan-skey-0.2.4-1.sh: POSIX shell script, ASCII text executable
script:                 directory

Two things immediately jumped out at me: first of all, the brscan-skey binary was actually just a shell script, likely wrapping the brscan-skey-0.2.4-0 binary. Secondly, the "real" binary was marked as "not stripped": the names of the internal symbols (variables, functions, etc.) had not been removed during the compilation process. This meant that if I had to resort to disassembling it, I would have obtained a much more readable code listing.

I checked the contents of the shell wrapper.

#! /bin/sh

if [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
  if [ "$2" = "2" ];then
    echo '   no option                :register all MFCs'
    echo '   -t (--terminate)         :terminate this tool'
    echo '   -a (--add MFC)           :register the specified MFC'
    echo '   -d (--delete) MFC        :exclude the specified MFC'
    echo '   -p (--passwd) PASSWORD   :set the password'
    echo '   -u (--username) USERNAMR :set the user name'
    echo '   -l (--list)              :list the available MFCs'
    echo '   -m (--mailto)            :mail address (scan to e-mail)'  
    echo '   --refresh                :refresh setting'
    echo '   --reset                  :reset the configuration file'
    echo '   --diagnosis              :print diagnosis data'
    echo '   -h --help                :help'
  fi
  echo '   Copyright 2007-2012 Brother Industries, Ltd'
  exit 0
fi

if [ "$1" = "-l" ] || [ "$1" = "--list" ]; then
    /opt/brother/scanner/brscan-skey/brscan-skey-0.2.4-0 $*
    exit 0
fi

if [ "$1" = "-f" ];then
    /opt/brother/scanner/brscan-skey/brscan-skey-0.2.4-0 $*
else
    /opt/brother/scanner/brscan-skey/brscan-skey-0.2.4-0 $*&
fi

Parameters! This thing had actual parameters! And beyond that, the last 5 lines told me that it wasn't, in fact, a real daemon: the script was merely starting it as a background process by appending & to the command. This meant that I could remove the while loop from the Dockerfile and kept the skey tool as the main foreground process just by appending -f to the script invocation in start.sh. Improvements!

Back to the parameters, though. The --diagnosis flag seemed perfect for the purpose, but unfortunately it resulted in a useless listing of information about environment variables, loaded drivers, and the skey configuration file. The --help command wasn't very useful either.

$ ./brscan-skey --help
   Copyright 2007-2012 Brother Industries, Ltd

I was back at square one.

Remembering the detail about the symbols, I decided to take the last resort step and decompile the binary. I installed Snowman and fed the binary to it. Peeking around, I found the function that sent the SNMP data, and then I took a look at the main function. Bingo!

    eax9 = argv_analyse("--debug-mode", *reinterpret_cast<int32_t*>(&rdi), rsi, rcx8, 0x80, 83, 0);
    if (!(reinterpret_cast<uint1_t>(eax9 < 0) | reinterpret_cast<uint1_t>(eax9 == 0))) {
        rdx10 = reinterpret_cast<struct s0*>(reinterpret_cast<int64_t>(rbp7) + 0xfffffffffffffd50);
        set_debug_verbose_mode(*reinterpret_cast<int32_t*>(&rdi), rsi, rdx10, rcx8, 0x80, 83, *reinterpret_cast<int32_t*>(&rdi), rsi, rdx10, rcx8, 0x80, 83);
    }

Buried in a list of parameter checks, there it was! I immediately started the tool again with brscan-key -f --debug-mode and... Nothing.

A quick second look at the code revealed that the parameter expects an actual argument, so I restarted it again with brscan-key -f --debug-mode 1, which didn't say much, but --debug-mode 3 finally revealed something useful.

get_host_ip_address : FAIL

The daemon was unable to determine its own IP address, probably because whatever method it was trying to use was being thrown off by being inside a Docker container. Not wanting to just slap a network-mode: host on the container and call it a day, I looked through the source again to see what the IP detection function was doing.

Right at the beginning of get_host_ip_address, this line presented itself:

 get_inifile_value("ip_address", v8, rdx11, 0, r8, r9);

There it was. A configuration parameter not listed anywhere. I added it to the configuration file, and this time the SNMP message looked a bit different.

IP 192.168.2.250.34930 > 192.168.2.108.snmp:  C="internal" SetRequest(479)  E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=IMAGE;HOST=192.168.2.250:54925;APPNUM=1;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=OCR;HOST=192.168.2.250:54925;APPNUM=3;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=EMAIL;HOST=192.168.2.250:54925;APPNUM=2;DURATION=360;BRID=;" E:2435.2.3.9.2.11.1.1.0="TYPE=BR;BUTTON=SCAN;USER="jack";FUNC=FILE;HOST=192.168.2.250:54925;APPNUM=5;DURATION=360;BRID=;"

In retrospect, I should have probably taken the fact that there was no IP in the original message more seriously, but I was thrown off by the fact that the printer was seemingly acknowledging it, so I assumed that it was taking the source IP of the SNMP message as the implicit host.

I checked the printer, and this time an entry called jack appeared in the list for the "Scan to file" option. Success? Not quite yet.

The response to the UDP packets that the printer sends to trigger a scan was getting lost in the network. tcpdump revealed that the destination port was supposedly unreachable, and a bit of Googling led me to a Docker issue detailing how UDP can be problematic with network forwarding.

Not wanting to figure out even more networking issues, I finally gave up and set the container's network to host mode. In retrospect, if I had done this from the beginning I would have spared myself a lot of trouble, but I learned quite a bit along the way.

The final fix

Now everything was working perfectly, except for the fact that once the scan scripts were started by the skey tool... Nothing happened. The scanner wouldn't scan, nothing would error out, just... Nothing.

What made this even stranger was that if I ran the scan script manually right after skey started it, everything would work flawlessly, so at least I knew that I wasn't necessarily out of luck.

I initially thought it could have boiled down to a difference in environment variables between skey and the console, so I replaced the scanning script with a different script that would dump all the environment variables to a file, and compared the result with the standard environment. No difference.

Adding some extra manual logging, I noticed that the scanning tool was returning an "Invalid argument" error. The always amazing ArchWiki has a lot of information on this specific error, but nothing helped. It did, however, point me in the right direction.

I noticed that the scan scripts all had a sleep 0.01 call before launching the actual scan command. What if there actually was some sort of race condition? Perhaps the skey tool itself was trying to access the scanner right after running the script? Or maybe the scanner needed some time to enter the right mode?

Whatever the reason, I replaced the sleep value with 4 seconds, and the following scan was finally completed flawlessly, producing a PDF file in the mounted /scans folder. Multi-page mode also worked properly, with the scanner popping up a "More pages?" prompt and the computer side correctly merging all the scans together in a single document.

This whole process took me so long that I almost forgot the original purpose of it all, but I finally have a container that can make the scanning process for all my documents as frictionless as can be. Well, having a scanner with an Automatic Document Feeder would arguably be even better, but I think I can live with this!

For Part 2, I will actually be setting up Paperless or Mayan with Docker. I haven't quite decided which one of the two I want to use just yet, but I'm hoping that part will be significantly easier than this!