Improve our binary/text file detection.
here is an example of a jpg with a ".d" extention that made it through the has_html_view function( allura.model.repository.Blob#has_html_view
)
Performance should be a primary consideration because of the large number of calls on bigger commits.
We should test the https://pypi.python.org/pypi/binaryornot/ library and see if its better than our current use of the
magic
standard library.Current algorithm for
has_html_view
looks something like this:mimetypes.guess_type
)text/*
assume we can display itpython-magic
) if it is text assume we can display itThe problem with the example above is that ".d" is a valid extension for D programming language source files. On step (1) it is detected as such (mimetype
text/x-dsrc
).One way to fix this is always check content of the file if we think it is a text file and we can display it, but it will slow down things significantly. On my local machine doing content-based check is 184 times slower in the best case scenario (120ms vs. 0.65ms). Thus it will be slower for every viewable file (and most of them are, in a typical repo).
I have checked
binaryornot
andfilemagic
libraries, but they're working with the same speed aspython-magic
, which we're using now. Most of the time is probably spent accessing the filesystem and not actualy guessing file's type.I can't think of any way to exclude false positives without a performance penalty. Any thoughts?
To reduce performance penalty we can check file content only for files with more than one dot in the filename (like
2.jpg.d
), but it seems like very poor heuristic to me.Good analysis, I guess the .d extension indeed was a misleading example.
After
has_html_view
if it returns True, the next logic often is to read the contents so we can display the file. Or run a diff or something like that. In the cases where we're reading the file anyway, we could do a content-based check to confirm it is text. For simple file display that could work. Diff/commit logic might be trickier to sort out.I'm thinking its pretty good as-is for now, and we were mostly just confused by the .d extension.
Ok, leaving as-is for now