User:Tbayer (WMF)/Converting Google Docs to wikitext
- Export the document from Google Drive as ODT (File -> Download as -> OpenDocument Format (.odt))
- Install (LibreOffice and) the "Wiki Publisher" extension for LibreOffice (available e.g. in the Ubuntu software store)
- Export the document from LibreOffice as MediaWiki wikitext (File -> Export -> File type: MediaWiki)
- The resulting wikitext file should have most of the formatting preserved - even tables. But there is an annoying bug/feature making links that look like this in GDocs [1]looklikethis in the exported wikitext. To fix these, and also remove extraneous blank lines, run this little script (Python needs to be installed):
python gdocodtwikimultilfix.py gdocodtwiki.txt gdocodtwiki_fixed.txt
where gdocodtwikimultilfix.py is the following (save it locally as a text file in the directory where you want to do the conversion):
#!/usr/bin/python
# Short script to take wikitext generated by the "Wiki Publisher" extension
# for LibreOffice from an ODT file exported from Google Docs
# and fix duplicated external links, as well as remove extraneous blank lines
# By T. Bayer ([[user:HaeB]])
import os
import sys
import re
import codecs
class gdocodtmultilfixerror(Exception):
def __init__(self, value):
self.value = value
def __str__(self):
return repr(self.value)
if len(sys.argv) < 3 or len(sys.argv) > 3:
raise gdocodtmultilfixerror('needs exactly two command line arguments: 1. input file (non-fixed wikitext, output of Wiki Publisher) 2. output files (fixed wikitext)')
urlpattern = r'https?://[^\ ]*'
inputfilename = sys.argv[1]
outputfilename = sys.argv[2]
inputfile = codecs.open(inputfilename, mode='r', encoding='utf-8')
outputfile = codecs.open(outputfilename, mode='w', encoding='utf-8')
precedingline = '\n'
for line in inputfile:
m = line
urls = set(re.findall(urlpattern, m))
for url in urls:
# Somehow, the space before an external link gets moved into the link during the export process.
# Replace this ('[http://www.example.com ]' --> ' ')
urle = re.escape(url)
old = r'([^\]])\['+urle+r' \]\['+urle
new = r'\1 ['+url
# Collapse duplicated links:
m = re.sub(old, new, m)
old = r'\['+urle+'( [^\\]]*)]\['+urle+' '
new = u'['+url+r'\1'
while re.search(old, m):
m = re.sub(old, new, m)
# Collapse multiple blank lines to one:
if not (precedingline == '\n' and m == '\n'):
outputfile.write(m)
precedingline = m
inputfile.close()
outputfile.close()
There may be still be other formatting errors (e.g. bolded text that is not bolded in the original, or vice versa), but for longer documents this solution can save a lot of time compared to manual conversion.
One may consider turning off "smart quotes" in Google Docs ("Tools" -> "Preferences" -> uncheck "Use smart quotes").
See also
edit- mw:User:JHernandez (WMF)/How to migrate from g00gle docs to wikitext
- en:Wikipedia:Tools#Importing (converting) content to Wikipedia (MediaWiki) format
- en:Wikipedia:Tools/Editing_tools#From_OpenOffice_and_LibreOffice
- Writer2MediaWiki for OpenOffice (doesn't seem to work with LibreOffice 4.1. though)
- https://github.com/rampradeepk/google-docs-to-wiki ?
- mw:Extension:Html2Wiki#Features