#6622 Convert or handle Github markdown extensions

v1.1.0
closed
nobody
General
2015-08-20
2013-08-30
No

When importing github content (tickets, wiki, comments) we should deal with their special markup. For example, code blocks with optional language specification:

```javascript
function fancyAlert(arg) {
  if(arg) {
    $.facebox({div:'#foo'})
  }
}
```

should be converted to:

wzxhzdk:1

And strikethrough ~~example~~ should be converted to <s>example</s>. This we could possibly support directly in our Markdown renderer if we wanted to. That would also allow it to work for Markdown files in git repos (since we won't modify those during import).

Emoji I don't think we should handle (yet?)

Cross-reference syntax https://help.github.com/articles/github-flavored-markdown#references we may want to consider handling. See also Trac syntax [#6140] handling.

Converting markdown can be tricky to get right, so we have to be careful that we only convert the right content. Nested markup, escaped markup, etc.

Related

Tickets: #6140
Tickets: #6534
Tickets: #6535
Tickets: #6610
Tickets: #6622
Tickets: #6864

Discussion

  • Igor Bondarenko - 2013-09-10
    • status: open --> in-progress
     
  • Igor Bondarenko - 2013-09-10

    #435: [#6622] Wiki importer for github: convert markup properly (4cp)

     

    Related

    Tickets: #6622

  • Igor Bondarenko - 2013-10-01

    Do you want github-markdown syntax support for code blocks directly in Allura markdown renderer or just conversion during import?

     
  • Dave Brondsema

    Dave Brondsema - 2013-10-01

    I'd prefer conversion during import, so that our main rendering logic doesn't get any (more) complex.

     
  • Igor Bondarenko - 2013-10-14
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-10-14

    Closed #435. je/42cc_6622

     
  • Dave Brondsema

    Dave Brondsema - 2013-10-21
    • QA: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2013-10-25

    I have some concern about just using regexes, since a lot of false substitutions may be made (e.g. within code blocks or escaped text)

    pandoc can convert a lot of formats and does real parsing. It has github support too. pandoc -f markdown_github -t markdown_strict ~/dev/github-test.md works but loses some info like the language for a code block. Using -t markdown preserves it with a different syntax than Allura uses. It is the syntax that is supported via this extension: http://pythonhosted.org/Markdown/extensions/fenced_code_blocks.html (which also supports github syntax). Moreover, pandoc is GPL and would have to be run from the commandline (its written in haskell). Also it doesn't handle github's special shortlink syntaxes.

    I want to explore applying the regexes only within the right context, using a simple parser something like https://sourceforge.net/p/allura/pastebin/526ac1030910d42342ac23ef

     
  • Dave Brondsema

    Dave Brondsema - 2013-10-29
    • status: code-review --> in-progress
     
  • Dave Brondsema

    Dave Brondsema - 2013-10-29

    https://github.com/mxcl/homebrew/wiki/Tips-N%27-Tricks/ has a lot of text lost during conversion, and the ``` blocks don't get converted right. I wonder if this is due to the substitutions being simple regexes and not part of a parser? (see end of this comment)

    https://github.com/mxcl/homebrew/wiki/Formula-Cookbook gets converted pretty well, but the formatting specifier e.g. "ruby" doesn't get converted to ":::ruby"

    The ~~(.*)~~ regex is too greedy. Based on how github does strikeout, I think ~~(\S*?)~~ is better

    Minor: can \b replace (\s|^) and (\s|$)? I think that'd be cleaner.

    Whitespace handling at beginning & end of (\S+\s+)(#\d+) is different than all the other regexes.

    I think you should try a simple parser to handle pre-formatted blocks (indented and ~~~~) and escaped inline text with `` Here is a stub sample parser: https://sourceforge.net/p/allura/pastebin/526ac1030910d42342ac23ef as a starting point. It's untested/debugged so will need fixing. It also needs to be expanded to handle backticks. I'm thinking the regexes could be run within the handle_non_code method. The code block regex probably will need a different hook to run in. This seems quite complex to me, but the right way to do it. The Markdown package itself might also have a parser at the right level that can be hooked into and generate modified markdown again (I'm afraid it might be only suitable for generating HTML). Anyway, I think you should give this approach a shot at see how it goes and if it's going to work or not.

     
  • Igor Bondarenko - 2013-10-31

    Created #469: [#6622] Convert github markdown extension during import (followup:435) (3cp)

     

    Related

    Tickets: #6622

  • Igor Bondarenko - 2013-11-12
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-11-12

    Closed #469.

    Force-pushed je/42cc_6622 (rebase)

    Some notes:

    The ~~(.*)~~ regex is too greedy. Based on how github does strikeout, I think ~~(\S*?)~~ is better

    ~~(\S*)~~ is too restrictive, github allows spaces there. We've made that non-greedy though.

    Minor: can \b replace (\s|^) and (\s|$)? I think that'd be cleaner.

    Left it in two places (regex for #<ticket-num> and <commit-hash>). \b would broke rendering of links like user/project#ticket and user/project@hash if we change those.

     
  • Dave Brondsema

    Dave Brondsema - 2013-11-12

    I'm getting a failure on forgeimporters.github.tests.test_wiki:TestGitHubWikiImporter.test_convert_markup with all the latest requirements.

     
  • Dave Brondsema

    Dave Brondsema - 2013-11-12
     
  • Dave Brondsema

    Dave Brondsema - 2013-11-12

    Everything I've tested comes through fine. One minor thing we should look at is if we can support ``` without a blank line before it. E.g. at https://github.com/joyent/node/wiki/Installation#building-on-windows

    Github doesn't require a blank line before new types of blocks like code and lists. A little manual cleanup is probably ok, but since we're already handling ``` I think its worth trying to add the blank line in automatically. Not a big deal if its difficult, though.

     
  • Dave Brondsema

    Dave Brondsema - 2013-11-12
    • status: code-review --> in-progress
     
  • Igor Bondarenko - 2013-11-13

    Ow, seems like our internal CI skips that test (it requires html2text), that's why we missed that. Will fix. And handling of ``` without a blank line before it should be possible also.

     
  • Igor Bondarenko - 2013-11-13
    • status: in-progress --> code-review
     
  • Igor Bondarenko - 2013-11-13

    Closed #494. Pushed to je/42cc_6622.

    Fixed test and ``` and ~~~ blocks without blank line before.

    Note, there are some blocks in Mac OSX section that still not converted right. It's due to Allura markdown processor, which can't handle lists mixed with code blocks. I've tried a couple of options: indented code blocks, '~~~~~' code blocks, but it renders only list, or code block, not both.

     
  • Dave Brondsema

    Dave Brondsema - 2013-11-13
    • status: code-review --> closed
    • Milestone: forge-backlog --> forge-nov-15
     

Log in to post a comment.