Python Diff

GitHub repository: petr-muller/pyff

The idea of syntactic/semantic-aware diff tool was in my head since we needed something similar for a project we were working in Red Hat Lab together with VeriFIT research group. We wanted to connect code differences (git commits or PRs) with test results and build a “riskiness classifier”. The rough idea was something like ”whenever people change I/O code in method M of class C, test T tends to break”. We were missing the analyzer that would easily give us, in machine-readable format, what actually changed in the code, besides changed lines that a simple diff tool can give you. We somehow managed to build something ad-hoc for C code differences and continued, but since then I thought the smart diff could be an interesting project.

Comparing abstract syntax trees

I decided to start in a simple way: take two versions of a Python file as in input, and work over their AST to detect differences. There is an AST module in Python standard library that can parse Python code easily but I remembered a talk on Pylint which described Astroid as an improved module with more functionality (build for usage in Pylint). I wanted to use it but failed to find a current documentation link; for some reason I kept discovering www.astroid.org which is dead for some time (I discovered the current documentation later).

So I decided to go with “vanilla” ast module for a while. I discovered the very helpful Green Tree Snakes - the missing Python AST docs documentation for it and from there, the first steps were quite simple. I chose the approach of driving the development by examples: I selected a git commit from a different project, looked at the diff and asked myself “what changed in that code?”, then went to implement the necessary code.

I have started with detecting added and removed imports, classes and high-level methods in the module, followed by detecting simple changes of these entities such as added/removed methods and changed implementations. The entities are currently identified by name, which means renaming is not properly detected (it will be reported as one class/method removed and another added). At the moment, the only supported output is the natural language summary of the changes.

After I had this MVP version of pyff ready I went on to set up some necessary project infrastructure: README, tests and some helper code.

Further steps

I would like to implement a programatical API and a machine readable output format (probably JSON), then follow with implementing further change types detection. I will probably continue with the example-driven approach, but I would like to implement some “smart” detection soon: something like recognizing that the program was not semantically changed (for example, a simple variable rename) and not reporting an implementation change in that case.

Written on April 6, 2018