#7962 Better binary file detection

unreleased
open
42cc (432)
General
nobody
2015-10-22
2015-08-12
No

Improve our binary/text file detection.

here is an example of a jpg with a ".d" extention that made it through the has_html_view function( allura.model.repository.Blob#has_html_view)

Performance should be a primary consideration because of the large number of calls on bigger commits.

Discussion

  • Igor Bondarenko

    Igor Bondarenko - 2015-08-18
    • Owner: Anonymous --> Igor Bondarenko
    • Labels: --> 42cc
    • Status: open --> in-progress
     
  • Igor Bondarenko

    Igor Bondarenko - 2015-10-21

    Current algorithm for has_html_view looks something like this:

    1. Guess mimetype based on filename only (mimetypes.guess_type)
    2. If it is text/* assume we can display it
    3. If not - check various lists of "viewable extensions", if one of them contains extension of given filename assume we can display it
    4. If not - guess file type based on content (using python-magic) if it is text assume we can display it

    The problem with the example above is that ".d" is a valid extension for D programming language source files. On step (1) it is detected as such (mimetype text/x-dsrc).

    One way to fix this is always check content of the file if we think it is a text file and we can display it, but it will slow down things significantly. On my local machine doing content-based check is 184 times slower in the best case scenario (120ms vs. 0.65ms). Thus it will be slower for every viewable file (and most of them are, in a typical repo).

    I have checked binaryornot and filemagic libraries, but they're working with the same speed as python-magic, which we're using now. Most of the time is probably spent accessing the filesystem and not actualy guessing file's type.

    I can't think of any way to exclude false positives without a performance penalty. Any thoughts?

    To reduce performance penalty we can check file content only for files with more than one dot in the filename (like 2.jpg.d), but it seems like very poor heuristic to me.

     
    • Dave Brondsema

      Dave Brondsema - 2015-10-21

      Good analysis, I guess the .d extension indeed was a misleading example.

      After has_html_view if it returns True, the next logic often is to read the contents so we can display the file. Or run a diff or something like that. In the cases where we're reading the file anyway, we could do a content-based check to confirm it is text. For simple file display that could work. Diff/commit logic might be trickier to sort out.

      I'm thinking its pretty good as-is for now, and we were mostly just confused by the .d extension.

       
  • Igor Bondarenko

    Igor Bondarenko - 2015-10-22
    • status: in-progress --> open
     
  • Igor Bondarenko

    Igor Bondarenko - 2015-10-22

    Ok, leaving as-is for now

     

Log in to post a comment.