#7757 UnicodeDecodeError when generating code snapshot on hg repo

v1.3.1
closed
General
2015-08-10
2014-10-10
Anonymous
No

Originally created by: jwb1980

https://sourceforge.net/p/forge/site-support/8700/


[forge:site-support:#8700]


From IRC #sourceForge
download the source code of this project https://sourceforge.net/p/nhunspell/code/ci/default/tree/
3:55 When I try the snapshot Sourceforge says "We're having trouble finding that snapshot. Would you like to resubmit?"
3:55 TortoiseSVN gives me error 500 in my fork repository


Discussion

  • Dave Brondsema

    Dave Brondsema - 2014-10-10
    Traceback (most recent call last):
      File "/var/local/allura/Allura/allura/model/monq_model.py", line 265, in __call__
        self.result = func(*self.args, **self.kwargs)
      File "/var/local/allura/Allura/allura/tasks/repo_tasks.py", line 145, in tarball
        repo.tarball(revision, path)
      File "/var/local/allura/Allura/allura/model/repository.py", line 666, in tarball
        self._impl.tarball(revision, path)
      File "/var/local/env-allura/lib/python2.7/site-packages/TimerMiddleware-0.4.4-py2.7.egg/timermiddleware/__init__.py", line 117, in wrapper
        return self.run_and_log(func, inst, *args, **kwargs)
      File "/var/local/env-allura/lib/python2.7/site-packages/TimerMiddleware-0.4.4-py2.7.egg/timermiddleware/__init__.py", line 126, in run_and_log
        return func(*args, **kwargs)
      File "/var/local/env-allura/lib/python2.7/site-packages/ForgeHg-0.2.0-py2.7.egg/forgehg/model/hg.py", line 351, in tarball
        commands.archive(HgUI(), self._hg, path, rev=commit, prefix='')
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/commands.py", line 382, in archive
        matchfn, prefix, subrepos=opts.get('subrepos'))
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/archival.py", line 298, in archive
        write(f, 'x' in ff and 0755 or 0644, 'l' in ff, ctx[f].data)
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/archival.py", line 258, in write
        archiver.addfile(prefix + name, mode, islink, data)
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/archival.py", line 214, in addfile
        f = self.opener(name, "w", atomictemp=True)
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/scmutil.py", line 276, in __call__
        f = self.join(path)
      File "/var/local/env-allura/lib/python2.7/site-packages/mercurial-3.0-py2.7-linux-x86_64.egg/mercurial/scmutil.py", line 338, in join
        return os.path.join(self.base, path)
      File "/var/local/env-allura/lib64/python2.7/posixpath.py", line 71, in join
        path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 27: ordinal not in range(128)
    
     
  • Dave Brondsema

    Dave Brondsema - 2014-10-10
    • summary: We're having trouble finding that snapshot. Would you like to resubmit SS8700 --> UnicodeDecodeError when generating code snapshot on hg repo
     
  • Dave Brondsema

    Dave Brondsema - 2015-06-18
    • labels: support --> support, sf-current
    • Description has changed:

    Diff:

    
    
    • Component: allura-forge-classic --> General
     
  • Igor Bondarenko - 2015-06-19
    • labels: support, sf-current --> support, sf-current, 42cc
    • status: open --> in-progress
    • assigned_to: Igor Bondarenko
     
  • Dave Brondsema

    Dave Brondsema - 2015-06-29
    • labels: support, sf-current, 42cc --> support, sf-current, 42cc, sf-1
     
  • LXj - 2015-06-30

    So I made an interesting finding. The file in question has a name de_DE_ö_frami.aff.

    I cloned the repo and made some experiments in the same dir as the file. First I tried usual tricks with .encode('utf-8') and the like, but it didn't help. But then it struck me:

    In [6]: path = os.listdir('.')[-11]
    
    In [7]: path
    Out[7]: 'de_DE_\xf6_frami.aff'
    
    In [8]: print path
    de_DE_�_frami.aff
    
    n [12]: "de_DE_ö_frami.aff" == 'de_DE_\xf6_frami.aff'
    Out[12]: False
    
    In [13]: os.listdir(u'.')[-11]
    Out[13]: 'de_DE_\xf6_frami.aff'
    
    In [14]: "de_DE_ö_frami.aff"
    Out[14]: 'de_DE_\xc3\xb6_frami.aff'
    

    So this seems like python's os.listdir reports the filename incorrectly! I experimented with cyrillic file names and found no problems

    In [8]: os.listdir('.')[-8]
    Out[8]: '\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
    
    In [9]: print '\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82'
    привет
    
    In [10]: os.listdir(u'.')[-8]
    Out[10]: u'\u043f\u0440\u0438\u0432\u0435\u0442'
    
    In [12]: print os.listdir(u'.')[-8]
    привет
    

    Other random unicode file names like "ᕕ┌◕ᗜ◕┐ᕗ" don't have this problem either

    In [6]: os.listdir('.')[0]
    Out[6]: '(\xe2\x95\xaf\xc2\xb0\xe2\x96\xa1\xc2\xb0\xef\xbc\x89\xe2\x95\xaf\xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb'
    
    In [7]: print os.listdir('.')[0]
    (╯°□°)╯︵ ┻━┻
    
    In [8]: os.listdir('.')[1]
    Out[8]: '\xe1\x95\x95\xe2\x94\x8c\xe2\x97\x95\xe1\x97\x9c\xe2\x97\x95\xe2\x94\x90\xe1\x95\x97'
    
    In [9]: print os.listdir('.')[1]
    ᕕ┌◕ᗜ◕┐ᕗ
    

    Conclusion: we have a strange rare bug with python's os module scrambling unicode filenames.

     

    Last edit: LXj 2015-06-30
  • Igor Bondarenko - 2015-07-02

    I saw this error while working on a docker ticket on all repos with unicode filenames. Generating UTF-8 locale and setting it as default fixed that for docker:

    # Snapshot generation for SVN (and maybe other SCMs) might fail without this
    RUN locale-gen en_US.UTF-8
    ENV LANG en_US.UTF-8
    

    It seems like deployment specific thing to me. Server might be missing some locale, which is needed to properly decode filenames. We'll investigate it further to confirm.

     
  • Igor Bondarenko - 2015-07-04
    • status: in-progress --> review
     
  • Igor Bondarenko - 2015-07-04

    Closed #811. forgehg:ib/7757

    The problem was that we had path to archive directory as unicode, and mercurial tried to decode it while concatenating it with file name, which is utf-8 encoded plain string, not unicode. I've fixed it by encoding path to archive directory as utf-8 plain string.

    I could not fix the issue with browsing, though https://sourceforge.net/p/nhunspell/code/ci/default/tree/NHunspell/UnitTests/de_DE_%C3%B6_frami.aff

    The error is:

     ManifestLookupError: NHunspell/UnitTests/de_DE_��_frami.aff@d1baa762529d: not found in manifest
    

    I did some digging:

    1. String comes from browser and we unquote it and convert to unicode: u'/NHunspell/UnitTests/de_DE_\xf6_frami.aff'
    2. Then we encode it to pass to mercurial and it looks like this: 'de_DE_\xc3\xb6_frami.aff'.
    3. But mercurial manifest contains: 'NHunspell/UnitTests/de_DE_\xf6_frami.aff' (looks like (1), but str, not unicode)

    I've tried several places to fix it, but did't succeed.

     
  • Dave Brondsema

    Dave Brondsema - 2015-07-07
    • Reviewer: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-07

    Looks good, one step forward for the code snapshot. We can leave the ManifestLookupError issue when browsing for another day I guess.

     
  • Dave Brondsema

    Dave Brondsema - 2015-07-07
    • status: review --> closed
     
  • Dave Brondsema

    Dave Brondsema - 2015-07-13
    • labels: support, sf-current, 42cc, sf-1 --> support, 42cc, sf-1
     
  • Dave Brondsema

    Dave Brondsema - 2015-08-10
    • Milestone: unreleased --> v1.3.1
     

Log in to post a comment.