Flawless!

clitest: command line tester

2018-05-02T00:00:00+00:00

It’s like Python’s doctest but for CLIs. Given a text file containing snippets of shell sessions (prompts with commands and their expected outputs), clitest executes the snippets and verifies the actual output matches the described output. It is written in shell language. The project lives on GitHub and is available under MIT license. It does not see much development anymore, but that is fairly understandable given its simplicity. Nevertheless, the author continues to maintain deliverables and documentation. The project has comprehensive documentation in its README.

Installation and usage

Installation and usage are trivial: the whole clitest is just a single shell script with no external dependencies. Hence, it is easy to include copies of clitest in repositories. I was curious how is the little tool implemented in shell, and I discovered the code is nice and readable. I could not resist trying to run ShellCheck on it, and it only spat out few style issues: good job!

I tried to use clitest as a testing driver for pyff project examples so that I only need to provide expected output for each example. It worked nice, although the way how clitest executes the commands to test makes it a little sensitive about $CWD, $PATH and the like. Fortunately, these are quite straightforward to sort out. The test files for pyff are standard Python source code files, so I needed to embed the clitest instructions into Python comments (lines starting with the # character). I used the --prefix option for this; it tunes the way how clitest searches for testable snippets in files.

I have encountered a minor pitfall where pyff does not have entirely deterministic output. The output is correct, but the ordering of some elements may be different between runs. clitest fully matches the output against the expected one, so it is harder to use clitest in these cases (I have decided to make pyff deterministic, even if the order does not matter). There is a clitest feature which can alter the matching when using the inline output syntax, but my use case did not match the inline output feature well.

When would I use it?

I like clitest because it is a single file drop-in without dependencies. This makes it easy to drop it into a repository where it can serve as a lightweight integration test driver: all you need are some input files, define how to execute the program under test and specify expected output.

A second nifty use case is mentioned in the clitest documentation: it can be nicely used to test the code snippets in your documentation for correctness. Often, you will write the documentation in markdown or something similar, and it will contain snippets of how to execute the program and what is the expected output. This way, you can feed your documentation straight into clitest which can, without further work, validate your examples are still correct. You may even do this within a CI automatically, so your documentation is always up-to-date.

Python Diff

2018-04-06T00:00:00+00:00

GitHub repository: petr-muller/pyff

The idea of syntactic/semantic-aware diff tool was in my head since we needed something similar for a project we were working in Red Hat Lab together with VeriFIT research group. We wanted to connect code differences (git commits or PRs) with test results and build a “riskiness classifier”. The rough idea was something like ”whenever people change I/O code in method M of class C, test T tends to break”. We were missing the analyzer that would easily give us, in machine-readable format, what actually changed in the code, besides changed lines that a simple diff tool can give you. We somehow managed to build something ad-hoc for C code differences and continued, but since then I thought the smart diff could be an interesting project.

Comparing abstract syntax trees

I decided to start in a simple way: take two versions of a Python file as in input, and work over their AST to detect differences. There is an AST module in Python standard library that can parse Python code easily but I remembered a talk on Pylint which described Astroid as an improved module with more functionality (build for usage in Pylint). I wanted to use it but failed to find a current documentation link; for some reason I kept discovering www.astroid.org which is dead for some time (I discovered the current documentation later).

So I decided to go with “vanilla” ast module for a while. I discovered the very helpful Green Tree Snakes - the missing Python AST docs documentation for it and from there, the first steps were quite simple. I chose the approach of driving the development by examples: I selected a git commit from a different project, looked at the diff and asked myself “what changed in that code?”, then went to implement the necessary code.

I have started with detecting added and removed imports, classes and high-level methods in the module, followed by detecting simple changes of these entities such as added/removed methods and changed implementations. The entities are currently identified by name, which means renaming is not properly detected (it will be reported as one class/method removed and another added). At the moment, the only supported output is the natural language summary of the changes.

After I had this MVP version of pyff ready I went on to set up some necessary project infrastructure: README, tests and some helper code.

Further steps

I would like to implement a programatical API and a machine readable output format (probably JSON), then follow with implementing further change types detection. I will probably continue with the example-driven approach, but I would like to implement some “smart” detection soon: something like recognizing that the program was not semantically changed (for example, a simple variable rename) and not reporting an implementation change in that case.

Sabbatical, Month 2

2018-04-03T00:00:00+00:00

I started working during my second university year (my first job was being a tester in Grisoft, which later became AVG, which later merged with Avast) and I was employed non-stop since then. I spent ten years with Red Hat, from where I moved to SAP Labs and had not taken any free time between the jobs. So when I decided to leave SAP Labs, the idea of taking few months off was irresistible. I did not have any clear idea of what I would like to achieve during the sabbatical. People often take a career break to pursue a specific project, to work on a dream project. I did not have any such goal. I knew I would like to reduce my book backlog, to live more healthy (exercise and sleep more), get the driving license and finally, do more for fun, open-source coding.

Python projects

I am already quite proficient in Python, so I went on to build some projects I had in my head for some time. The projects are not that technologically advanced, so I decided to intentionally try some new tools while working on them, and learn in the process. I am trying to use pytest for testing (along with some plugins) and mypy and Pylint for static analysis. I also connected few modern, cloudy, GitHub-connected tools to my repos to evaluate how they work - I tried Codacy, Code Climate, Codecov, Coveralls, Dependency CI/Tidelift and Travis CI. I like the direction where these tools are going, and I might write a post about them in the future.

Python Diff

This is a Python-based toy project that is supposed to compare two versions of a Python module (say, a part of a Git commit) and determine the syntactical/semantical difference between them. I do not have a clear idea about what exactly could the detected differences be, but I am starting with simple stuff like added/removed classes or methods and continue with detecting methods with changed API or just implementation, etc. The initial goal is to provide some machine-readable artifact describing the difference and then try to build further tools over them. Possible directions for this project might be a GitHub PR commenter bot posting human-readable summaries of changes (I could learn more about cloud services, GitHub API and natural language generation in the process), or combining the difference data with other sources like code coverage and perhaps some machine learning, and building a tool to predict “riskiness” of a Python project PR.

Of all my current project, I probably like Python Diff the best. After it moves a bit, I will do a separate post about it, and I will surely consider doing a talk about it on some Python conference.

VtES Game Log

In last autumn, I started playing Vampire: the Eternal Struggle (nearly dead, old-school, multiplayer CCG) again after a period of hiatus. I play using various decks against various players in various settings (friendlies in a pub, online, tournaments) so I decided to build a simple tool for tracking my games and results, going from a local, CLI-oriented utility to first an online REST API and then possibly a web application. The underlying data structures are trivial, so I would like to learn more about writing REST APIs (possibly experimenting with some API tools like Apiary, Dredd or 3scale) and running them on some cloud platform like AWS or Openshift.

Collaborations

After I finished in SAP, I let some of my friends know, we met and discussed some possible collaboration.

Engeto Testing Course

My former Red Hat colleague Filip founded an IT education startup where one of the things they do are courses. They do not have a course focused on software testing yet, so we agreed we would collaborate on creating one together. I was mostly researching on what content to include in the course until now. Now we finally have the research done, and we are following with an outline, and we will start producing the actual content soon.

Perun

Perun is a very interesting pet project of my former colleague of VeriFIT, Tomáš Fiedor. It is a long-term performance tracking/control system which attaches your project’s performance metrics (performance test results, profiling information and the like) to your Git revision tree, so you can track, analyze and visualize them over time, possibly spotting performance regressions as they happen during development. We had a nice talk with Tomáš about possible directions of this project, one of which was taking it into the cloud and making a Git/GitHub-connected web application (like Travis CI or Code Climate) out of it. Unfortunately, this was the one project for which I failed to allocate sufficient time :(

Books

Agile Testing

I have started reading this one in January already but had a hard time finishing it because it became quite tedious to read (it tends to be repetitive and vague). I managed to finish it in March, finally. It is starting to show its age because it focuses so much on managing situations where a tester’s company is not that friendly to Agile, transitioning from Waterfall or other non-ideal situations not that common today. I liked the Agile Testing Quadrant and the last part about how testers can contribute in different stages of an agile project.

Coders at Work

This is also a fatty (around 600 pages), but it reads well being in an interview format; at least for me, it does. I am currently somewhere in the middle. It does not give a reader any straight applicable material, but it is interesting to read because the interviewees have different achievements and backgrounds. I especially like to compare answers of different people to identical questions.

Misc

I was not doing just IT stuff. I finally obtained my driving license, which took quite a lot of time, although I managed to pass both exams on a first try. I also spent a lot of time on physiotherapy exercises to treat my jumper’s knee. It healed nicely, but it seems to be back after few football matches :(

Radamsa: A general-purpose fuzzer

2018-01-05T00:00:00+00:00

Radamsa is a general-purpose, black-box oriented mutating fuzzer. It is written in Scheme and is available on its GitHub page under the MIT license. While the project is not entirely abandoned (there are occasional commits on develop branch, but the last commit on the master branch is a PR merge six months ago), there does not seem much development to happen anymore. The project is a side result of the research done by Oulu University Secure Programming Group. The project has simple but straightforward and information documentation in the repository README file.

Basics

Its documentation describes Radamsa as an “extremely black-box fuzzer”: it does not need any information about neither the input format nor the internals of the fuzzed program. The tool starts with a given sample input for an application, on which it applies a mutation while trying to keep the general format valid(-ish). The root of Radamsa was a research on the automatic analysis of communication protocols.

Radamsa claims to be applicable, without any configuration, on programs processing any format of input - binary or text. Quick experiments (see below) show that Radamsa is quite successful. Although Radamsa’s output cannot probably compare with format-specialized fuzzers (such as CSmith for C programs), the applied mutations go well beyond random garbage injection, leading to valuable testing inputs for a program.

Intuitively, I would not expect many successful bug discoveries in proven, battle-tested software, but Radamsa README file contains an impressive list of discovered CVEs, including curl, libxslt or bzip2.

Installation and usage

The instructions say building Radamsa is a simple clone-and-run-make process and I was able to build a single, dependency-free binary without any problem. Mimicking few examples from the documentation worked as expected: feeding my name to Radamsa’s standard input yielded multiple mangled variants which I imagine can cause havoc to naive string processing routines.

The next experiment I tried was running Radamsa with a simple Python program as an input. Again, the results were interesting, the applied modifications varied a lot but kept a general structure of Python code. I saw lines removed or duplicated and tokens changed (for example, integer literals changed to a different value). I also encountered quite interesting, non-trivial mutations like replacing the whole expression in a parentheses with a recursive-ish expression (think something like a(a(a(a(a(b)))))). Again, several tries made me convinced these inputs would be valuable when trying to fuzz something that processes Python grammar.

As a last experiment, I tried an XML file. Specifically, xUnit result XML file. Again, Radamsa changed the file mostly in a way that kept the overall format, but the scale of applied mutations was similar to the Python input.

When would I use it?

Radamsa is extremely simple to start with - you only need the target system, few sample inputs and you are good to go. Set up Radamsa in a loop, feed its output to the system under test and detect bugs. Of course, the black box approach limits the rate with which Radamsa can penetrate deep into the tested system, especially compared to smart fuzzers guided by the instrumented target system, such as American Fuzzy Lop. You also need to have a reasonable way how to detect error condition in the tested system, given an unknown input (but this holds for most fuzzers). Of course, you can usually start with some simple criteria like “the target should terminate and not crash”.

I will certainly include Radamsa in my toolbelt. The fact that I can immediately, without any setup, run it for few hours against pretty much anything makes it useful in different situations, especially when instrumentation or specialized format fuzzers are not available or worth the effort to set up.

Randoop: Automatic unit test generation for Java

2017-12-25T00:00:00+00:00

Randoop is an automatic unit test generator for Java (and .NET). Randoop is written in Java and is available either from its project page or GitHub page. It is available under the MIT license. As of 2017-12-24, the project seems to be quite alive, although most of the commits are authored by a single developer (but the project accepts occasional PRs). Randoop appears to be driven by a research group at the University of Washington, but the overall quality of the project structure, supporting documentation, build system and other project artifacts is excellent.

Basics

According to its documentation, Randoop generates tests using feedback-directed random test generation. It randomly (but smartly) generates sequences of constructor and method invocation for input classes. These sequences are executed, and the results are used to create assertions. This means the tests can mostly only capture the actual behavior of the tested class (possibly for future regression testing), not reveal many new bugs. There is an exception to this, though – Randoop can detect when the class under test does not conform to basic Java contracts (Object.equals() and the like) and several other likely-buggy behaviors, such as NullPointerException being thrown when no null values are passed as params to a method. The documentation states that it is possible to add more contracts for checking.

Installation and usage

I have cloned the Git repository and followed the manual to build Randoop from source using Gradle. The build went for about five minutes and produced a JAR file. I have tried to execute Randoop on a little library I developed when working on the static analysis of C programs, smg.

I started with generating tests for the simple SMGRegion class. After a little fiddling with params, Randoop ran for a while, generating 9 files about 2MB each, with 4286 tests (so about 18MB total, which looks a bit excessive a ~60 lines long class). No “error revealing” tests were generated, just the regression tests. I have tried to execute the tests, and they all passed. Their total runtime was 0.105 seconds, which is good. I tried to introduce a change in the tested class and rerun the tests, and now 2506 tests failed as a result.

Afterwards, I have tried to include all public classes, and the results were about the same – about ~4200 tests, no error-revealing (but still, Randoop can find just basic Java contract violations).

The generated tests are straightforward (just constructor and method invocation sequences) but quite long, with the usual appearance of generated code (numbered variable names, etc.). I was able to investigate the fails quickly, but of course, the generated code has no real semantic meaning that would hint the programmer about why the bug is there, not other than “this worked before”.

When would I use it?

Randoop seems quite useful to me. It is mature enough, well-documented and quite easy to use. I also did not encounter any problems with the tool. Its error-revealing mode could be run as part of CI, being basically a simple fuzzer for Java contracts (but I think existing static analyzers could do the same job).

The generated tests usefulness is slightly more questionable. They could serve as regression tests, as they can only alert you later when you, perhaps mistakenly, change observable behavior. The good thing is Randoop can indeed create tests that you possibly need but did not write. You could generate a testsuite at a particular point in time and keep executing it: this way you would have a nice regression suite, but you would not test any code added after the suite was generated. Regenerating the suite after each change seems too expensive, but has some merit (of course, only if the original suite was run and passed first). Perhaps some discard-few-old, generate-few-new strategy might be employed there (I guess these strategies are probably discussed somewhere in the related scientific papers, such as the authors’ Scaling Up Automated Test Generation ASE’2011 paper.

I can also imagine situations where Randoop generates tests that capture “undefined” behavior, like ordering or specific values that may change between execution. The user manual briefly discusses this, and the tool provides few techniques that can be applied to prevent such behavior.

Set up this thing

2017-04-15T00:00:00+00:00

Finally took some time and set up this thing. Hopefully more contant will start to appear here.