When importing github content (tickets, wiki, comments) we should deal with their special markup. For example, code blocks with optional language specification:
```javascript function fancyAlert(arg) { if(arg) { $.facebox({div:'#foo'}) } } ```
should be converted to:
wzxhzdk:1
And strikethrough ~~example~~
should be converted to <s>example</s>
. This we could possibly support directly in our Markdown renderer if we wanted to. That would also allow it to work for Markdown files in git repos (since we won't modify those during import).
Emoji I don't think we should handle (yet?)
Cross-reference syntax https://help.github.com/articles/github-flavored-markdown#references we may want to consider handling. See also Trac syntax [#6140] handling.
Converting markdown can be tricky to get right, so we have to be careful that we only convert the right content. Nested markup, escaped markup, etc.
Tickets: #6140
Tickets: #6534
Tickets: #6535
Tickets: #6610
Tickets: #6622
Tickets: #6864
#435: [#6622] Wiki importer for github: convert markup properly (4cp)
Related
Tickets:
#6622Do you want github-markdown syntax support for code blocks directly in Allura markdown renderer or just conversion during import?
I'd prefer conversion during import, so that our main rendering logic doesn't get any (more) complex.
Closed #435.
je/42cc_6622
I have some concern about just using regexes, since a lot of false substitutions may be made (e.g. within code blocks or escaped text)
pandoc
can convert a lot of formats and does real parsing. It has github support too.pandoc -f markdown_github -t markdown_strict ~/dev/github-test.md
works but loses some info like the language for a code block. Using-t markdown
preserves it with a different syntax than Allura uses. It is the syntax that is supported via this extension: http://pythonhosted.org/Markdown/extensions/fenced_code_blocks.html (which also supports github syntax). Moreover,pandoc
is GPL and would have to be run from the commandline (its written in haskell). Also it doesn't handle github's special shortlink syntaxes.I want to explore applying the regexes only within the right context, using a simple parser something like https://sourceforge.net/p/allura/pastebin/526ac1030910d42342ac23ef
https://github.com/mxcl/homebrew/wiki/Tips-N%27-Tricks/ has a lot of text lost during conversion, and the
```
blocks don't get converted right. I wonder if this is due to the substitutions being simple regexes and not part of a parser? (see end of this comment)https://github.com/mxcl/homebrew/wiki/Formula-Cookbook gets converted pretty well, but the formatting specifier e.g. "ruby" doesn't get converted to ":::ruby"
The
~~(.*)~~
regex is too greedy. Based on how github does strikeout, I think~~(\S*?)~~
is betterMinor: can
\b
replace(\s|^)
and(\s|$)
? I think that'd be cleaner.Whitespace handling at beginning & end of
(\S+\s+)(#\d+)
is different than all the other regexes.I think you should try a simple parser to handle pre-formatted blocks (indented and
~~~~
) and escaped inline text with `` Here is a stub sample parser: https://sourceforge.net/p/allura/pastebin/526ac1030910d42342ac23ef as a starting point. It's untested/debugged so will need fixing. It also needs to be expanded to handle backticks. I'm thinking the regexes could be run within thehandle_non_code
method. The code block regex probably will need a different hook to run in. This seems quite complex to me, but the right way to do it. The Markdown package itself might also have a parser at the right level that can be hooked into and generate modified markdown again (I'm afraid it might be only suitable for generating HTML). Anyway, I think you should give this approach a shot at see how it goes and if it's going to work or not.Created #469: [#6622] Convert github markdown extension during import (followup:435) (3cp)
Related
Tickets:
#6622Closed #469.
Force-pushed
je/42cc_6622
(rebase)Some notes:
~~(\S*)~~
is too restrictive, github allows spaces there. We've made that non-greedy though.Left it in two places (regex for
#<ticket-num>
and<commit-hash>
).\b
would broke rendering of links likeuser/project#ticket
anduser/project@hash
if we change those.I'm getting a failure on
forgeimporters.github.tests.test_wiki:TestGitHubWikiImporter.test_convert_markup
with all the latest requirements.Everything I've tested comes through fine. One minor thing we should look at is if we can support ``` without a blank line before it. E.g. at https://github.com/joyent/node/wiki/Installation#building-on-windows
Github doesn't require a blank line before new types of blocks like code and lists. A little manual cleanup is probably ok, but since we're already handling ``` I think its worth trying to add the blank line in automatically. Not a big deal if its difficult, though.
Ow, seems like our internal CI skips that test (it requires html2text), that's why we missed that. Will fix. And handling of ``` without a blank line before it should be possible also.
Closed #494. Pushed to
je/42cc_6622
.Fixed test and ``` and ~~~ blocks without blank line before.
Note, there are some blocks in Mac OSX section that still not converted right. It's due to Allura markdown processor, which can't handle lists mixed with code blocks. I've tried a couple of options: indented code blocks, '~~~~~' code blocks, but it renders only list, or code block, not both.