ConfluenceConverter (file README.txt at debb4b2401fb)

     1 Introduction
     2 ------------
     3 
     4 ConfluenceConverter is a distribution of software that converts exported data
     5 from Confluence wiki instances, provided in the form of an XML file, to a
     6 collection of wiki pages and resources that can be imported into a MoinMoin
     7 instance as a page package.
     8 
     9 Migration Activities
    10 --------------------
    11 
    12 The following activities are involved in a migration from Confluence to
    13 MoinMoin. First, the activities that can be performed from any location:
    14 
    15   * Export of Confluence content
    16   * Conversion of Confluence content to MoinMoin content
    17   * Confluence page identifier extraction and mapping to MoinMoin identifiers
    18   * Acquisition of Confluence user profile details
    19 
    20 Then, the activities that are performed on the server:
    21 
    22   * Installation of MoinMoin
    23   * Initialisation of a MoinMoin wiki instance
    24   * Import of MoinMoin content into the new wiki instance
    25   * Installation of MoinMoin extensions
    26   * Initialisation of user profiles in MoinMoin
    27   * Installation of scripts and identifier mappings
    28   * Filesystem permission adjustments
    29 
    30 Prerequisites
    31 -------------
    32 
    33 ConfluenceConverter requires a library called xmlread that can be found at the
    34 following location:
    35 
    36 http://hgweb.boddie.org.uk/xmlread
    37 
    38 The xmlread.py file from the xmlread distribution can be copied into the
    39 ConfluenceConverter directory.
    40 
    41 ConfluenceConverter also requires access to the MoinMoin.wikiutil module found
    42 in the MoinMoin distribution.
    43 
    44 The moinsetup program is highly recommended for the installation of page
    45 packages and the management of MoinMoin wiki instances:
    46 
    47 http://moinmo.in/ScriptMarket/moinsetup
    48 
    49 If moinsetup is not being used, the page package installer documentation
    50 should be consulted:
    51 
    52 http://moinmo.in/HelpOnPackageInstaller
    53 
    54 To read Confluence user profiles on live Confluence sites using the
    55 get_profiles.py program, the libxml2dom library is required:
    56 
    57 http://hgweb.boddie.org.uk/libxml2dom
    58 
    59 MoinMoin Prerequisites
    60 ----------------------
    61 
    62 The page package installer does not preserve user information or the last
    63 modified time when installing page revisions. This can be modified by applying
    64 a patch to MoinMoin as follows while at the top level of the MoinMoin source
    65 distribution:
    66 
    67 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-packages.diff
    68 
    69 Here, CCDIR is the path to the top level of this source distribution where
    70 this README.txt file is found.
    71 
    72 When importing users, MoinMoin may be unable to handle user information
    73 containing non-ASCII characters. Another patch to solve such problems can be
    74 applied to MoinMoin as follows:
    75 
    76 patch -p1 $CCDIR/patches/patch-moin-1.9-MoinMoin-user.diff
    77 
    78 Wiki Content Prerequisites
    79 --------------------------
    80 
    81 For the output of the converter, the following MoinMoin extensions are
    82 required:
    83 
    84 http://moinmo.in/ParserMarket/ImprovedTableParser
    85 http://moinmo.in/ActionMarket/SubpageComments
    86 http://moinmo.in/MacroMarket/Color2
    87 
    88 A common dependency of various extensions is provided by MoinSupport:
    89 
    90 http://hgweb.boddie.org.uk/MoinSupport
    91 
    92 Additional Software
    93 -------------------
    94 
    95 PDF export support requires the ExportPDF action:
    96 
    97 http://moinmo.in/ActionMarket/ExportPDF
    98 
    99 This in turn requires Apache FOP for PDF production using XSL-FO:
   100 
   101 http://xmlgraphics.apache.org/fop/
   102 
   103 (On Debian systems, the fop package provides this tool.)
   104 
   105 To produce XSL-FO from DocBook output, xsltproc is required from the libxslt
   106 distribution:
   107 
   108 http://xmlsoft.org/XSLT/
   109 
   110 (On Debian systems, the xsltproc package provides this tool.)
   111 
   112 And DocBook output requires the DocBook resources to be installed, described
   113 in the following guide:
   114 
   115 http://www.sagehill.net/docbookxsl/ToolsSetup.html
   116 
   117 (On Debian systems, the docbook-xsl package provides these resources.)
   118 
   119 Quick Start
   120 -----------
   121 
   122 (!) The acquisition of Confluence wiki content and its conversion can be
   123 performed from any location, not necessarily on the server.
   124 
   125 To obtain XML export archives from a Confluence wiki instance, the
   126 exportspacexml.action resource is visited and the "Export" button selected.
   127 For example, for the Mailman Wiki, the appropriate resource (with the COM
   128 namespace selected) is as follows:
   129 
   130 http://wiki.list.org/spaces/exportspacexml.action?key=COM
   131 
   132 For your own instance, adjust the above URL accordingly. Alternatively, you
   133 can find your way to the export page by selecting a namespace, then choosing
   134 "Advanced" from the "Browse" menu, and then choosing "XML Export" from the
   135 "Export" sidebar.
   136 
   137 Given an XML export archive file for a Confluence wiki instance (in the
   138 example below, the file is called COM-123456-789012.zip), the following
   139 command can be used to prepare a page package for MoinMoin:
   140 
   141 python convert.py COM-123456-789012.zip COM
   142 
   143 In addition to the filename, a workspace name is required. Confluence appears
   144 to require a workspace as a container for collections of pages, but this also
   145 permits us to selectively import parts of a wiki into MoinMoin. If attachments
   146 were included in the export from Confluence, these will be imported into the
   147 page package.
   148 
   149 The result of the above command will be a directory having the same name as
   150 the chosen workspace, together with a zip archive for that directory's
   151 contents. Thus, the above command would produce a directory called COM and an
   152 archive called COM.zip.
   153 
   154 (!) The following step is performed on the server.
   155 
   156 To import the result (although you may wish to process other namespaces
   157 first), use moinsetup as follows:
   158 
   159 python moinsetup.py -m install_page_package COM.zip
   160 
   161 This requires a suitable moinsetup.cfg file in the working directory.
   162 
   163 Importing Many Workspaces/Namespaces
   164 ------------------------------------
   165 
   166 Where more than one namespace is to be imported, the page packages should be
   167 merged so that the resulting history information is ordered correctly.
   168 
   169 (!) This process can be performed from any location and the result uploaded to
   170 the server for eventual import.
   171 
   172 To merge packages, use a command of the following form:
   173 
   174 python merge.py OUT COM.zip DEV.zip DOC.zip SEC.zip
   175 
   176 A directory called OUT and a page package called OUT.zip will be produced. The
   177 latter can then be imported into MoinMoin as described above.
   178 
   179 Mappings from Identifiers to Pages
   180 ----------------------------------
   181 
   182 Confluence uses numbers to label content revisions, and links to Confluence
   183 sites sometimes use these numbers instead of a readable page name. MoinMoin,
   184 meanwhile, only uses page names and has no external numeric identifier scheme.
   185 Consequently, it is necessary to produce a mapping from Confluence identifiers
   186 to MoinMoin page names. In addition to numeric identifiers, Confluence also
   187 provides "tiny URLs" which are an alphanumeric encoding of the numeric
   188 identifiers.
   189 
   190 (!) This process can be performed with the converted content from any
   191 location, with the generated files uploaded to the server for eventual
   192 deployment.
   193 
   194 To generate mappings for the Confluence content, use the mappings script as
   195 follows:
   196 
   197 tools/mappings.sh COM
   198 
   199 Here, COM is a directory name containing converted Confluence content,
   200 corresponding to a space name in the original Confluence wiki. More than one
   201 space name can be used to generate a complete mapping for a site. For example:
   202 
   203 tools/mappings.sh COM DEV DOC SEC
   204 
   205 The following files are generated:
   206 
   207   * mapping-id-to-page.txt
   208   * mapping-tiny-to-id.txt
   209   * mapping-tiny-to-page.txt
   210 
   211 The most useful of these is the first as it includes all the necessary
   212 information provided by the arbitrary mapping from identifiers to page names.
   213 The second mapping merely converts the "tiny URLs" to identifiers, which can
   214 be done by applying an algorithm without any external knowledge of the wiki
   215 structure. The third mapping is provided as a convenience, combining the "tiny
   216 URL" conversion and the arbitrary mapping to page names.
   217 
   218 Translating Requests Using the Mappings
   219 ---------------------------------------
   220 
   221 Where Web server facilities such as RewriteMap are available for use, the
   222 first and third mapping files can be used directly. See the Apache
   223 documentation for details of RewriteMap:
   224 
   225 http://httpd.apache.org/docs/2.4/rewrite/rewritemap.html
   226 
   227 Otherwise, it is more likely that the first file is used by a program that can
   228 perform a redirect to the appropriate wiki page, and the "tiny URL" decoding
   229 is also done by this program when deployed in a suitable location to receive
   230 such requests. To support this, the following resources are provided:
   231 
   232   * scripts/redirect.py
   233   * config/mailmanwiki-redirect
   234 
   235 The latter configuration file should be combined with the Web server
   236 configuration file such that the appropriate aliases are able to capture
   237 requests and invoke the redirect.py script before the main wiki aliases are
   238 consulted. The script itself should be placed in a suitable filesystem
   239 location, and the mapping-id-to-page.txt file should be placed alongside it,
   240 or it should be placed in a different location and the MAPPING_ID_TO_PAGE
   241 variable changed in the script to refer to this different location.
   242 
   243 Supporting Confluence Action URLs
   244 ---------------------------------
   245 
   246 Besides the "viewpage" action mapping identifiers to pages (covered by the
   247 mapping described above), some other action URLs may be used in wiki content
   248 and must either be translated or supported using redirects. Since external
   249 sites may also employ such actions, a redirect strategy perhaps makes more
   250 sense. To support this, the following resources are involved:
   251 
   252   * scripts/dashboard.py
   253   * scripts/redirect.py
   254   * scripts/search.py
   255   * config/mailmanwiki-redirect
   256 
   257 The latter configuration file is also involved in identifier-to-page mapping,
   258 but in this case it causes requests to the "dashboard", "doexportpage" and
   259 "dosearchsite" actions to be directed to the dashboard.py, redirect.py and
   260 search.py scripts respectively.
   261 
   262 The dashboard.py script merely redirects requests to the root of the site,
   263 thus assuming that the front page is configured to show dashboard-like
   264 information.
   265 
   266 The redirect.py script, apart from supporting identifier-to-page redirects,
   267 also supports PDF page exports since the "doexportpage" action uses
   268 identifiers to indicate which page is to be exported. In an environment that
   269 uses .htaccess and mod_rewrite, the redirect.py script should also be deployed
   270 under separate names (such as export.py and exportpdf.py) so that it can
   271 discover whether it should be exporting a page instead of just showing it.
   272 
   273 The search.py script redirects search requests in a suitable form to the
   274 MoinMoin "fullsearch" action.
   275 
   276 Identifying and Migrating Users
   277 -------------------------------
   278 
   279 Confluence export archives do not contain user profile information, but page
   280 versions are marked with user identifiers. Therefore, a list of user
   281 identifiers can be obtained by running a script extracting these identifiers.
   282 The following command writes to standard output the users involved with
   283 editing the wiki in four different spaces (exported to four directories):
   284 
   285 tools/users.sh COM DEV DOC SEC
   286 
   287 This output can be edited and then passed to a program which fetches other
   288 profile details as follows:
   289 
   290 tools/users.sh COM DEV DOC SEC > users.txt
   291 
   292 After editing...
   293 
   294   cat users.txt \
   295 | tools/get_profiles.py http://wiki.list.org/ \
   296 > profiles.txt
   297 
   298 If no users are to be removed in migration, the following command could be
   299 issued:
   300 
   301   tools/users.sh COM DEV DOC SEC \
   302 | tools/get_profiles.py http://wiki.list.org/ \
   303 > profiles.txt
   304 
   305 The get_profiles.py program needs to be told the URL of the original
   306 Confluence site. Note that it accesses the site at a default rate of around
   307 one request per second; a different delay between requests can be specified
   308 using an additional argument.
   309 
   310 (!) The above steps can be performed from any location, but the command
   311 pipelines below need to be run on the server due to the use of a program that
   312 updates the deployed wiki.
   313 
   314 The output of the get_profiles.py program can be passed to another program
   315 which adds users to MoinMoin, and so the following commands can be used:
   316 
   317   cat profiles.txt \
   318 | tools/addusers.py wiki
   319 
   320 Alternatively, the users can be converted to profiles and immediately added
   321 without creating a profiles file:
   322 
   323   cat users.txt \
   324 | tools/get_profiles.py http://wiki.list.org/ \
   325 | tools/addusers.py wiki
   326 
   327 Or just using one single command without inspecting the users or profiles at
   328 all:
   329 
   330   tools/users.sh COM DEV DOC SEC \
   331 | tools/get_profiles.py http://wiki.list.org/ \
   332 | tools/addusers.py wiki
   333 
   334 The addusers.py program needs to be told the directory containing the wiki
   335 configuration.
   336 
   337 Output Structure
   338 ----------------
   339 
   340 The structure of a converted workspace is a directory hierarchy containing the
   341 following directories:
   342 
   343   * pages     (a collection of directories defining each page or content item,
   344                corresponding to Page, Comment and BlogPost elements in the XML
   345                exported from Confluence)
   346 
   347   * versions  (a collection of files, each defining a revision or version of
   348                some content, corresponding to BodyContent elements in the XML
   349                exported from Confluence)
   350 
   351 Each page directory contains the following things:
   352 
   353   * pagetype    (either "Page", "Comment" or "BlogPost")
   354 
   355   * manifest    (a list of version entries in a format similar to the MoinMoin
   356                  page package manifest format)
   357 
   358   * attachments (a list of attachment version entries in a format similar to
   359                  the MoinMoin page package manifest format)
   360 
   361   * pagetitle   (an optional page title imposed on the page by another content
   362                  item)
   363 
   364   * children    (a list of child page names defined for the page)
   365 
   366   * comments    (a list of creation date plus comment page identifier pairs)
   367 
   368 In the output structure, content items such as comments are represented as
   369 pages and each reference a content version. Since comments will ultimately be
   370 represented as subpages of some parent page, they will have a pagetitle file
   371 in their directory with an appropriate subpage name written according to the
   372 parent page's name and comment details.
   373 
   374 Troubleshooting
   375 ---------------
   376 
   377 The page package import activity in particular can be a source of problems.
   378 Generally, any error occurring when attempting to import a package is likely
   379 to be due to insufficient privileges when writing to the pages directory of a
   380 wiki or to its edit-log file.
   381 
   382 The moinsetup software can generate scripts that set the ownership of wiki
   383 files or apply ACLs (access control lists) to those files in order to make
   384 access to wiki data more convenient. Where the ownership of the files must be
   385 set (to www-data or nobody), the import step can be run as that user given
   386 sufficient privileges. However, the easiest solution is to apply ACLs, thus
   387 allowing the user who created the wiki to retain write access to it.
   388 
   389 Contact, Copyright and Licence Information
   390 ------------------------------------------
   391 
   392 The current Web page for ConfluenceConverter at the time of release is:
   393 
   394 http://hgweb.boddie.org.uk/ConfluenceConverter
   395 
   396 Copyright and licence information can be found in the docs directory - see
   397 docs/COPYING.txt and docs/LICENCE.txt for more information.