#6464 Create tracker importer for Google Code using CSV and scraping

v1.0.1
closed
Tracker
2015-08-20
2013-07-15
Cory Johns
No

Since the Google Data API for Issues is deprecated and was scheduled to be shut down already (June 14th, 2013), we'll need to create an implementation using the CSV list and scraping to ensure that the Google Code importer continues to work.

The importer should follow the framework discussed on the mailing list and integrate with the project importer from [#6456].

    The list of tickets and their metadata can be retrieved via the CSV export list, e.g., https://code.google.com/p/modwsgi/issues/csv but the ticket body and comments will need to be scraped from the web interface.  The description and comments can be retrieved from, e.g., https://code.google.com/p/modwsgi/issues/detail?id=22 by iterating over the items with `id="hc\d+"` or `class="issuedescription|issuecomment"`.

The description and comments on issues don't support wiki syntax or HTML, so we can just convert them to text. User mapping will have the same issues, so whatever we end up doing in [#6461] will apply here.

Related

Tickets: #6456
Tickets: #6461

Discussion

  • Cory Johns - 2013-07-24
    • Milestone: forge-aug-09 --> forge-backlog
     
  • Dave Brondsema

    Dave Brondsema - 2013-07-24
    • Milestone: forge-aug-09 --> forge-backlog
     
  • Cory Johns - 2013-07-26
    • Milestone: forge-backlog --> forge-aug-09
     
  • Dave Brondsema

    Dave Brondsema - 2013-07-26
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,6 +2,6 @@
    
     The importer should follow the framework discussed on the [mailing list](http://mail-archives.apache.org/mod_mbox/incubator-allura-dev/201307.mbox/%3CCAEMb8zUg7Kem2aDxVzAqF3U4aKEj7jL3UO=UpX=2+NfY_P8kXQ@mail.gmail.com%3E) and integrate with the project importer from [#6456].
    
    -The list of tickets and their metadata can be retrieved via the CSV export list, e.g., https://code.google.com/p/modwsgi/issues/csv but the ticket body and comments will need to be scraped from the web interface.  The description and comments can be retrieved from, e.g., https://code.google.com/p/modwsgi/issues/detail?id=22 by iterating over the items with `id="hc\d+"` or `class="issuedescription|issuecomment"`.
    +        The list of tickets and their metadata can be retrieved via the CSV export list, e.g., https://code.google.com/p/modwsgi/issues/csv but the ticket body and comments will need to be scraped from the web interface.  The description and comments can be retrieved from, e.g., https://code.google.com/p/modwsgi/issues/detail?id=22 by iterating over the items with `id="hc\d+"` or `class="issuedescription|issuecomment"`.
    
     The description and comments on issues don't support wiki syntax or HTML, so we can just convert them to text.  User mapping will have the same issues, so whatever we end up doing in [#6461] will apply here.
    
    • Size: --> 2
     

    Related

    Tickets: #6456
    Tickets: #6461

  • Cory Johns - 2013-07-29
    • assigned_to: Cory Johns
     
  • Cory Johns - 2013-08-07
    • status: open --> code-review
     
  • Cory Johns - 2013-08-07

    Why did you add a bunch of space to that line?

     
  • Cory Johns - 2013-08-07

    Not sure why my comment didn't post when I changed the status (got a random 500), but here it is:

    allura:cj/6464

    I will probably add more tests with some actual HTML data but this is working and ready for review.

     
  • Dave Brondsema

    Dave Brondsema - 2013-08-07
    • QA: Dave Brondsema
     
  • Dave Brondsema

    Dave Brondsema - 2013-08-07
    • status: code-review --> in-progress
     
  • Dave Brondsema

    Dave Brondsema - 2013-08-07

    Failure against https://code.google.com/p/google-code-feed-gadget/issues/detail?id=1 and http://code.google.com/p/modwsgi/issues/detail?id=11

      File "/home/dbrondsema/dbrondsema-1019/forge/ForgeImporters/forgeimporters/google/tracker.py", line 53, in import_tool
        self.process_fields(ticket, issue)
      File "/home/dbrondsema/dbrondsema-1019/forge/ForgeImporters/forgeimporters/google/tracker.py", line 82, in process_fields
        owner=issue.get_issue_owner(),
      File "/home/dbrondsema/dbrondsema-1019/forge/ForgeImporters/forgeimporters/google/__init__.py", line 166, in get_issue_owner
        return UserLink(self.page.find(id='issuemeta').find('th', text=re.compile('Owner:')).findNext().a)
      File "/home/dbrondsema/dbrondsema-1019/forge/ForgeImporters/forgeimporters/google/__init__.py", line 185, in __init__
        self.name = tag.string.strip()
    AttributeError: 'NoneType' object has no attribute 'string'
    

    Would we want to convert # of stars to # of upvotes?

    Fields for type, priority, opsys, component (more possible?) should be added as custom fields and converted.

    Need to use skip_mod_date (grep for examples) to preserve the mod_date you set.

    Need to disable notifications. googlecodewikiimporter does this already, and for the Trac importer I suggested looking at a way to make it happen for all importers.

    Need to call g.post_event('project_updated')

    Everything is done as the current user. Would it be better to do it as *anonymous? That's what some of our other importers do.

    Since GC tickets and comments are plain text, whitespace is significant and should be preserved. Also special markdown chars need to be escaped. E.g. http://code.google.com/p/modwsgi/issues/detail?id=1 and http://code.google.com/p/modwsgi/issues/detail?id=4#c5 To do so, use forgeblog.command.rssfeeds.plain2markdown() That needs html2text which is GPL'd, so make sure you handle the lack of html2text gracefully. (And if you refactor plain2markdown to a more generic place, make sure you update SF's forge-classic code reference to it)

    Comments aren't posted on the Allura ticket in sequential order. They seem random.

    Attachment on a comment didn't get imported (from modwsgi #1)

     
  • Cory Johns - 2013-08-13
    • status: in-progress --> code-review
     
  • Cory Johns - 2013-08-13

    Changes force-pushed to:
    allura:cj/6464

     
  • Dave Brondsema

    Dave Brondsema - 2013-08-14
    • status: code-review --> in-progress
     
  • Dave Brondsema

    Dave Brondsema - 2013-08-14

    Needs to be rebased, there are some significant conflicts with master.

    And then check to see if https://pypi.python.org/pypi/GoogleCodeWikiImporter needs corresponding changes too.

    Better to call h.plain2markdown(..., preserve_multiple_spaces=True) than h.plain2markdown(..., True)

    ForgeBlog/forgeblog/tests/test_commands.py:test_plain2markdown should be moved to Allura's helper test file. And would be very good to have a test case for the \\ you added to md_chars_matcher_all (was it just a typo fix?)

    Our internal forge-classic repo needs changes to correspond to the plain2markdown/re_preserve_spaces changes.

     
  • Cory Johns - 2013-08-14
    • status: in-progress --> code-review
     
  • Cory Johns - 2013-08-14

    allura:cj/6464
    forge-classic:cj/6464
    googlecodewikiimporter:cj/6464

     
  • Dave Brondsema

    Dave Brondsema - 2013-08-15
    • status: code-review --> in-progress
     
  • Cory Johns - 2013-08-19
    • status: in-progress --> code-review
     
  • Cory Johns - 2013-08-19

    Force-pushed.
    allura:cj/6464
    forge-classic:cj/6464
    googlecodewikiimporter:cj/6464

     
  • Dave Brondsema

    Dave Brondsema - 2013-08-20
    • over-encoded summary line
    • voting needs to be enabled on the ticket for you to see the votes
    • close milestones if all the tickets are closed
    • sort the milestones before saving them, so they show up sorted
    • wiki import is broken - I'm checking on this
    • preserve original ticket #s
     
  • Cory Johns - 2013-08-20

    Fixes pushed.

     
  • Dave Brondsema

    Dave Brondsema - 2013-08-21
    • status: code-review --> closed
    • Milestone: forge-aug-09 --> forge-aug-23
     
  • Cory Johns - 2013-08-21

    Oh man, that's nice to see. :-)

     

Log in to post a comment.