Automating Data Reduction via Whitelists

In a previous post Build your own NSRL Server I showed people how to get a NSRL server setup so they could filter out whitelisted hashes from md5deep output. I found that I didn't like that method and never really used it. I had plenty of RAM, so I kept for-looping through my text file of whitelisted files. It was slow, but I ignored it because I usually kicked it off and went working in parallel on other stuff. But then I figured, why not setup an API? So I set one up with 12 lines of code using Flask and Python. I don't even know if you can call it an API, but we will because I don't know what else to call it.

You can get the Code here Whitelist API

Here is the workflow I more or less follow:

  • Open the file and edit it with your details.
  • python
  • I run this server portion of code on a 4th Gen Intel NUC i3 running Ubuntu, 250GB SSD and 16GB of memory. The NUC is portable enough I could (if I wanted) take it with me on-site. I also run Elasticsearch/Logstash/Kibana (ELK) on it. It works great for a mini-SIEM.
  • md5deep.exe -z -r -l -o e -s "%SystemDrive%\*" > hashes.txt
  • This is automated via an IR collection script, but you could just as easy do this against a mounted image as well.


  • _file_size_  _hash_ _file path_
  • python -f hashes.txt > hashes.csv
  • VT API Code

Output is a csv file with four columns

Silvrback blog image

Usually people at this point are saying, "well, yeah, but you still have a lot of files to review". Sure, you are right, but i've found that this process filters out around 75%+ of any random system just from the whitelist alone.

After filtering via the API, I am typically left with something around 800 - 4,000 remaining hashes, which is pretty good considering my whitelist is; 3.5 GB. I use NSRL (only the md5 hashes), Mandiant's Readline whitelist, and then some stuff I have done on my own.

A real/recent example: I had an initial list of 27,323 hashes from a system. After filtering against my API I was able to get that number down to 4,174. After taking the 4,174 remaining hashes and bouncing it against VT while also filtering out the 0 hit VT entries and only looking at >= 1 hits and files not previously submitted I was down to 47. So we went from 27,323 down to 47 in about 20 - 25 mins.

One of the the 47 was 27/54, one was 9/50, two were 1/55, and 43 were non-scanned/submitted (unknown to VT). Of those 43 unknown, 3 were PUP/malware in the users download folder.

As analysts, we don't have time to review 27,323 entries or even 4,174, but if we can get the number down to 47, or even a couple hundred. That's a heck of a start. Then add on the fact that it's all automated. It's a win-win.

Is it always helpful? No. But it works when it does.

Example output if you query direct:

{ "in_set": true, "md5_hash": "392126e756571ebf112cb1c1cdedf926" }

{ "in_set": false, "md5_hash": "392126e756571ebf112cb1c1cde00000" }

If you have code suggestions drop me an email. At some point I would like to make it more robust.