Apache Allura™ / Tickets / #7962 Better binary file detection

Dave Brondsema - 2015-08-17

We should test the https://pypi.python.org/pypi/binaryornot/ library and see if its better than our current use of the magic standard library.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-08-18

Owner: Anonymous --> Igor Bondarenko

Labels: --> 42cc

Status: open --> in-progress
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-10-21

Current algorithm for has_html_view looks something like this:

Guess mimetype based on filename only (mimetypes.guess_type)

If it is text/* assume we can display it

If not - check various lists of "viewable extensions", if one of them contains extension of given filename assume we can display it

If not - guess file type based on content (using python-magic) if it is text assume we can display it

The problem with the example above is that ".d" is a valid extension for D programming language source files. On step (1) it is detected as such (mimetype text/x-dsrc).

One way to fix this is always check content of the file if we think it is a text file and we can display it, but it will slow down things significantly. On my local machine doing content-based check is 184 times slower in the best case scenario (120ms vs. 0.65ms). Thus it will be slower for every viewable file (and most of them are, in a typical repo).

I have checked binaryornot and filemagic libraries, but they're working with the same speed as python-magic, which we're using now. Most of the time is probably spent accessing the filesystem and not actualy guessing file's type.

I can't think of any way to exclude false positives without a performance penalty. Any thoughts?

To reduce performance penalty we can check file content only for files with more than one dot in the filename (like 2.jpg.d), but it seems like very poor heuristic to me.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Dave Brondsema - 2015-10-21
  
  Good analysis, I guess the .d extension indeed was a misleading example.
  
  After has_html_view if it returns True, the next logic often is to read the contents so we can display the file. Or run a diff or something like that. In the cases where we're reading the file anyway, we could do a content-based check to confirm it is text. For simple file display that could work. Diff/commit logic might be trickier to sort out.
  
  I'm thinking its pretty good as-is for now, and we were mostly just confused by the .d extension.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-10-22

status: in-progress --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Bondarenko - 2015-10-22

Ok, leaving as-is for now

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Apache Allura™

Forge software for hosting software projects

Milestone

Searches

Help

#7962 Better binary file detection

Discussion