#6534 Wiki importer for github

v1.1.0
closed
nobody
General
2015-08-20
2013-08-07
No

Wikis are git repositories and can be accessed like git clone https://github.com/OpenRefine/OpenRefine.wiki for example. Check the main repo API first to see if the repo has wiki enabled. You can see https://sourceforge.net/p/googlecodewikiimporter/git/ for reference as an example of another wiki importer. It is a separate repo because it needs the "html2text" package to convert html to markdown, and that is a GPL library.

Github supports many markup types. Find a full list and determine what the best way to convert them to markdown is. My guess is that few formats will have tools available to convert them directly to markdown, so my likely recommendation would be to render them as HTML (using pypeline as a generic way to handle many of those formats) and then html2text to get it into markdown.

If html2text or any other GPL library is needed, this will have to be a separate repo from the main Allura repo. So please evaluate & test the conversion options first, before putting code into place.

A second phase to all this (i.e. do it separately, after the basic import is all working) would be to handle revision history. This would mean going through each commit in the wiki git repo, and converting & updating every file that changes. This may be very time consuming, so when we get to it, we may want it to be a checkbox option, so users only do it if they want it.

Related

Tickets: #6534

Discussion

1 2 > >> (Page 1 of 2)
  • Anton Kasyanov - 2013-08-13

    Created
    #415 [#6534] Wiki importer for github: basic (3cp)
    #416 [#6534] Wiki importer for github: revision history (3cp)

     

    Related

    Tickets: #6534

  • Anton Kasyanov - 2013-08-13
    • status: open --> in-progress
     
  • Igor Bondarenko - 2013-09-05

    Closed #415. je/42cc_6534

     
  • Igor Bondarenko - 2013-09-10

    Found some bugs in markup conversion. Created #435: [#6534] Wiki importer for github: convert markup properly (3cp)

     

    Related

    Tickets: #6534

  • Dave Brondsema

    Dave Brondsema - 2013-09-10

    Perhaps similar to [#6622]?

     

    Related

    Tickets: #6622

  • Igor Bondarenko - 2013-09-10

    Exactly that :) Will do that separately, then.

     
  • Igor Bondarenko - 2013-09-10
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-09-10

    Closed #416. Force-pushed je/42cc_6534

    We'll implement proper markup conversion in [#6622].

     

    Related

    Tickets: #6622

  • Dave Brondsema

    Dave Brondsema - 2013-09-13
    • QA: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2013-09-13
    • status: code-review --> in-progress
     
  • Dave Brondsema

    Dave Brondsema - 2013-09-13
    • Docstring for tool_option is copied from another method, should be its own.
    • Minor style nitpick: import git is a 3rd-party lib so should be in the 2nd section of imports, not the 1st section
    • the forgewiki/templates/wiki/page_history.html change doesn't seem right. It previously showed the revision date and now it shows the previous revision date, it seems.
    • the github project name needs to allow uppercase characters (e.g. OpenRefine/OpenRefine)

    formatting

    • render_any_markup returns an HTML string. If we're handling Markdown input, we should keep the Markdown and not render it at all (just special conversions later in [#6622]). For all others, we might want to run it through html2text so that it can be markdown instead of HTML, but so far things are looking pretty good just saving HTML in the wiki markdown content, and staying free from the html2text dependency is nice.
    • I tested with https://github.com/mxcl/homebrew/wiki/_pages and found a few issues:
    • there's a mediawiki page which isn't supported by pypeline at all: http://pypeline.sourceforge.net/tour.html#getting-started
      • for all the formats supported by github and not by pypeline, can you evaluate adding support to pypeline? Actually, for mediawiki, we already have a mediawiki2markdown function which we might want to use as a special case. http://pypeline.sourceforge.net/tour.html#extending-pypeline shows how to extend pypeline and we can do that in Allura, but I'd rather see the support added to pypeline itself, so if there are good conversion methods we can create, lets go ahead and add it to pypeline directly.
    • textile pages end up displaying as plain text because they have a tab in front of each line of HTML, and that indentation triggers markdown's preformatted mode. Can you figure out where that's coming from and make sure we don't get leading whitespace on lines?
    • links go back to github still. We should rewrite all links that match the wiki URL prefix. I think you've done this for the trac import already, so that technique can be re-used (perhaps factored out into a helper).
    • many page names have dashes instead of spaces in them. I haven't investigated this fully to know how we want to handle.
     

    Related

    Tickets: #6622

  • Dave Brondsema

    Dave Brondsema - 2013-09-13

    There are also gollum tags (e.g. links to other wiki pages) that can be in any source format. We'll need to handle those.

    That could be done as part of [#6622], but so far [#6622] is just for handling github markdown -> Allura markdown conversions. And gollum tags need to be handled for all formats of conversion (markdown included?)

     

    Related

    Tickets: #6622

  • Igor Bondarenko - 2013-09-16

    Ok, that's what I found about github-supported formats, that pypeline can't handle. We can add support for couple of formats to pypeline pretty easily (asciidoc, mediawiki), but the rest of them require pretty much work.

    ASCIIDoc: .asciidoc

    http://asciidoc.org/asciidocapi.html

    Uses GPLv2

    Requires installing asciidoc package system-wide. API is distributed as a standalone python script, so should be included directly into pypeline repo, or installed manually.

    I think pypeline support can be done in 2-3 cp.

    Org Mode: .org

    There are couple of org mode parsing libraries for python, but it seems that all of them just parse org mode files into tree of Orgnode objects, and there is no support for converting that into html. I'm not familiar with this format at all. I think adding such support might be pretty heavy.

    Pod: .pod

    Perl documentation system. Seems like only perl tool exists for converting this. Should be possible to write a python-wraaper around command-line tool and use it in pypeline, but this may take awhile.

    RDoc: .rdoc

    Ruby documentation system. Seems like only ruby tool exists for converting this. Also wrapper for command-line tool can be created, I think.

    MediaWiki: .mediawiki, .wiki

    Uses GPLv3

    Can add support to pypeline using python-mediawiki. Allura's mediawiki2markdown already uses it.

    Also, can convert to Allura-markdown using mediawiki2markdown. Both cases shouldn't be hard to implement. 1-2 cp, I guess.

     
  • Igor Bondarenko - 2013-09-16

    Created:

    • #438: [#6534] Wiki importer for github: small fixes (1cp)
    • #439: [#6534] Wiki importer for github: textile pages fix (1cp)
    • #440: [#6534] Wiki importer for github: handle links (2cp)
    • #441: [#6534] Wiki importer for github: handle gollum tags (4cp)
     

    Related

    Tickets: #6534

  • Dave Brondsema

    Dave Brondsema - 2013-09-16

    Using searches just on README files, to get a ballpark on popularity (e.g. https://github.com/search?q=path%3AREADME.asciidoc&type=Code&ref=searchresults) I get:

    • asciidoc: 1000
    • org: 3750
    • pod: 2200
    • rdoc: 135,000
    • mediawiki: 800

    Rdoc is definitely popular and mediawiki is not. However, since we have an easy approach for mediawiki (which may be more popular for wikis than readmes) let's just do mediawiki. Go with the mediawiki2markdown approach. Remember that depends on optional GPL libraries so keep this conversion optional too.

    Let's leave the rest for later, we'll see what demand is for them.

    If it's possible to list the supported formats on the import form's description text, that would be great.

    Which reminds me, a separate issue is that we need an individual tool importer for github wiki. That is, specifically, a GitHubWikiImportController set on the importer's controller attribute.

     
  • Igor Bondarenko - 2013-09-17

    Created:

    • #442: [#6534] Wiki importer for github: handle mediawiki (2cp)
    • #443: [#6534] Wiki importer for github: import into existing project (1cp)
     

    Related

    Tickets: #6534

  • Igor Bondarenko - 2013-09-17

    Closed #438. Force-pushed je/42cc_6534

     
  • Igor Bondarenko - 2013-09-18

    It's possible to embed images, link to images and files (external and internal) using gollum tags https://github.com/gollum/gollum/wiki#file-links Should we handle those too? It'll require importing all the images/files from the github wiki repo into allura as attachments or something. Should we do this right now?

     
  • Igor Bondarenko - 2013-09-18

    Now I think we should go with html2text conversion of html rendered by h.render_any_markup() for all input formats. There are several reasons for that:

    • Simplifies handling of gollum tags. For all formats we could just convert them to appropriate markdown tags. And with current approach we should handle two cases for that: one is when wiki in markdown and we keeping that, and the other - when wiki in any other markup and we keeping html.
    • We'll be able to convert gollum [[_TOC_]] tag directly into markdown [TOC] tag.
    • It will keep imported history cleaner (seeing diffs of generated html isn't very pleasant)

    Are you ok with that?

     
  • Dave Brondsema

    Dave Brondsema - 2013-09-18

    Yep, makes sense.

    Like I've mentioned before, html2text needs to stay optional, so keep that conversion (and everything that must happen after it) as optional depending on the presence of html2text. If its not installed, you'll just get a simpler less complete conversion.

     
  • Dave Brondsema

    Dave Brondsema - 2013-09-18

    What sort of internal files are possible? To files in the git repo? Or can you have images & files in the wiki itself? Have any examples?

    I think external references can be kept as-is.

     
  • Igor Bondarenko - 2013-09-19

    You can have any type of file in the wiki repo and can reference it from the page. When referencing image from the page text it ends up embeded into a page, and when referencing something else (e.g. pdf) it just displayed as link to the file.

    I've created example page showing this capabilities https://github.com/jetmind/dot/wiki/Files

    Source looks like this:

    [[Example pdf|example.pdf|width=400px]]
    
    [[Link to image|/image.jpg|width=400px]]
    
    [[image.jpg|frame|alt=hello|width=400px]]
    
    [[http://eofdreams.com/data_images/dreams/image/image-07.jpg|width=400px]]
    

    Also, when embedding an image there are couple of available options (like show in frame, resize, align, etc). Don't sure if we can convert all of those.

    I think we can upload such files as attachments to pages from where they are referenced, and convert links/embedd tags to markdown format.

    External references can be kept as-is, indeed.

     
  • Igor Bondarenko - 2013-09-19

    Closed #443. je/42cc_6534

     
  • Dave Brondsema

    Dave Brondsema - 2013-09-19

    Ok, I see. I made [#6673] for it. We should definitely track this need, but I don't want to try to do too much all at once on this ticket :)

     

    Related

    Tickets: #6673

  • Igor Bondarenko - 2013-09-20

    Closed #441, #440. je/42cc_6534

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.