Python Static Analysis Tools

Python Static Analysis Tools

May 1, 2020Ā·
Andrew Wyllie
Andrew Wyllie
Ā· 7 min read
Image credit: Andrew Wyllie

Lately Iā€™ve been looking at tools to help improve code quality especially with respect to security issues. While code review is a useful process, it sometimes is difficult to pinpoint code which may lead to vulnerabilities. Unless you have memorized all of the various attack vectors, you are probably going to miss something along the way. The idea that we might be able to automatically analyze the code as part of our CI/CD process seemed very intriguing.

I have been using Flake8 to do static analysis of Python code for some time now. Itā€™s fairly easy to configure to run with pytest so I get unit tests and syntax checking at the same time. There is also a handy module for vim that checks your python syntax on the fly called vim-flake8. To be quite honest, all I was really doing was running the checker and fixing up a couple little problems here and there. Itā€™s a nice way to implement a style guide without a bunch of engineers sitting around debating the merits of different code formatting models and whether the whitespace was going to be tabs or the less efficient, but more precise, spaces. Surely thereā€™s more that can be done with static testing though, something that can check my code for type mismatches and give back cryptic error messages that would make a FORTRAN77 compiler feel proud.

Security Testing

Whenever I utter the words ā€˜Security Testingā€™ I use the same voice as the guy in the Princess Bride in The Pit of Despair because you know that no matter how thorough you are, the process never really ends. The package I looked at is called Bandit which claims to be ā€œa tool designed to find common security issues in Python codeā€ and was developed as part of the OpenStack Security Project. Immediately I see the appeal, just install it in my CI/CD pipeline and let it do itā€™s thing. Bandit is designed to look for things like using outdated encryption ciphers, insecure/deprecated functions, and things like loading pickle files. This is not to say that loading pickles files isnā€™t safe at times, but rather, if you are loading pickle files in a production environment you better be damn sure you know where it came from. I downloaded and ran bandit against some code Iā€™ve been working on for a while and got the pickle error as I was loading a pickle file from S3 on AWS. I was still pretty new to Python when I wrote the code and, at the time, I thought using pickle would be a good way to encode binary data before putting it on S3. Itā€™s not. If someone could access the files on S3, they could load all types of code and data into my program. Fortunately, I have rarely (if ever) actually called this function so crisis averted, still need to fix the code though. Part of the challenge for me as a Data Engineer is that for data scientists its totally reasonable to use pickle to dump and load data. Taking code from a data scientist and converting it to production code can lead to overlooking some issues like this.

Here are some pretty good reason to install and use Bandit:

  • look for vulnerabilities in code before pushing it to master - this is the obvious one
  • being more in-tune with best practices
  • keeping old code up to date - bandit will detect deprecated functions especially functions with known vulnerabilities. Checking old code that no one looks at to make sure it is still safe is a pretty nice feature
  • checking dependencies. Yes, itā€™s all open source and open for review but when is the last time you actually reviewed all the source code for a package or dependency you are using?

That last one is a big deal. I ran Bandit against the XRay SDK from AWS and it picked up stuff like this

Test results:

>> Issue: [B310:blacklist] Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.
   Severity: Medium   Confidence: High
   Location: aws_xray_sdk/core/plugins/ec2_plugin.py:23
   More Info: https://bandit.readthedocs.io/en/latest/blacklists/blacklist_calls.html#b310-urllib-urlopen
22	
23	        r = urlopen('http://169.254.169.254/latest/meta-data/instance-id', timeout=1)
24	        runtime_context['instance_id'] = r.read().decode('utf-8')

This is saying that the code is connecting to a remote URL. This is perfectly(?) safe in this case as the 169.254.169.254 is a link-local IP (See RFC 3927) which is used by AWS to provide metadata about a running EC2 and is apparently also available from Lambda (which is news to me). That address could have been 104.126.73.169 or even worse 170.178.168.203 or some other random IP address on the internet! Itā€™s also nice that they provide a like to their website to explain what the problem is. Anyway, I spent a lot of time playing bandit with and pointing at dependencies Iā€™m including in some of my projects (since my code did not produce and errors HA!). Bandit has definitely earned its place on my CI/CD test stack. Keep in mind though that you should only use Bandit as a tool and that there are many other security (and compliance) issues that need to be addressed.

Static Typing

Ok, so I took the plunge and looked a mypy. If youā€™ve never looked at static typing in Python, this may be a good place to start. This is my first experience with it in Python and it looks very promising, although it might screw a lot of people up as it looks a bit different. mypy requires Python 3.5 or later which should not be a problem for anybody anymore, right?

The idea with static typing is to explicitly type the functions so what most people are familiar with as a dynamically typed function:

def just_add_beer(foo):
    return foo + ' with beer'

would be written as:

def just_add_beer(foo: str) -> str:
    retrurn foo + ' with beer'

Simple enough and much more explicit. When the function is called, mypy will check to make sure it is being called with the correct type and throw errors if it is not. This is the kind of checking that can reveal some of those really hard to find bugs where you are passing an incorrect type and the function just merrily goes on working assuming that you know what you are doing or even worse, crashes in production. Unit tests donā€™t necessarily pick up on these things either as most people do not try passing incorrect types to their functions as they are usually only testing edge cases like, if I pass an int thatā€™s too big, what happens as opposed to if I pass the word ā€˜helloā€™ as an int what happens.

Sadly, Python does not seem to give a **** about statically typed functions as the function does not error out when I pass an integer to a function that wants a string:

wyllie@dilex:~ $ python
>>> def add_beer(foo):
...     return foo + ' with beer'
... 
>>> add_beer('hello')
'hello with beer'
>>> add_beer(1)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in add_beer
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>> add_beer('hello')
'hello with beer'
>>> 
>>> def just_add_beer(foo: str) -> str:
...     return foo + ' with beer'
... 
>>> just_add_beer('hello')
'hello with beer'
>>> just_add_beer(1)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in just_add_beer
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>> exit
Use exit() or Ctrl-D (i.e. EOF) to exit
>>> exit()
(crap, that exit thing burns me every time).  

I donā€™t know, I may have to do some more research on this technique and this project to see if itā€™s wroth stressing over - maybe Iā€™m missing something like Perlā€™s use strictā€¦

If you use pytest, you can also install the pytest-mypy module which simplifies adding mypy checks to you CI/CD pipeline.

The decision to use mypy is a bit more complicated than using flake8 or bandit which will just run with no code changes (well, no code changes except fixing broken code that has been identified). mypy requires thinking about your code in a different, albeit more robust, way. Fortunately, mypy will ignore functions that are not explicitly typed this way so you donā€™t have to rewrite your whole codebase on the first day you use or even use it everywhere in your code. You might decide that only new code will be supported and then update older code as time permits - maybe add doc strings to all of your functions while you are at it.

Finally

These are just a few of the great tools out there. Itā€™s worth investing some time researching what is available and adding some of these to your workflow.