Script to generate URS from Wikipedia
A persons’ URS is a phrase that could be used instead of his/her usual name in all circumstances, which makes it absolutely clear who he/she is. A good URS for a person should meet the following criteria:
- Everyone familiar with the person will confidently recognise him/her from the URS.
- There is no possibility that the URS could also describe anyone other than the person.
- Even someone who isn’t familiar with the person will have some understanding of who he/she is from the URS.
#!/usr/bin/python """ Script to generate URS from the starting paragraph of Wikipedia articles about persons. by Pravin Paratey (pravinp -at- gmail.com) Current Implementation: ---------------------- 1. Extract first sentence 2. Clean wiki markup 3. Observing given data, and the data on wikipedia, shows that there is a pattern that is followed while writing wikipedia entries for persons. Replacing (was/is)(an/a/the/) with (/the) does the trick 4. Output sentence formed Ideally: -------- Ideally, the piece of code should identify the following concepts: 1. Name of person 2. Time period 3. Son/Daughter/Father/Mother of (in case of famous personality) 4. Renowned for How do we go about it? 1 and 2 - straight forward. Wikipedia gives cues through its markup 3 - straight forward. String matching using "son of", "daughter of", etc 4 - will need to match against a database. For 3, we only keep the "son of", "daughter of", "X of Y" if Y is a prominent person. An easy way of doing this is using incoming links on wikipedia OR to search for X and Y individually on google and noting the number of results. """ import re, sys, codecs def cleanUri(m): """ Cleans Uri wiki markup """ word = m.group(1) if '|' in word: word = word.split('|')[1] return word.strip() def dotRemove(m): """ Replaces . by # inside tags """ return m.group(0).replace('.', '#') def cleanMarkup(text): """ Removes 1. wiki markup 2. sanitize html entities 3. comments """ #text = re.sub(r"\[\[[\w\s\-,]+\|(\w+)\]\]", r"\1", text) text = re.sub(r"\[\[(.*?)\]\]", cleanUri, text) text = re.sub(r"\{\{.*?\}\}", r"", text) text = re.sub(r".*?<\/ref>", r"", text) text = re.sub(r"---.*?---", r"", text) text = re.sub(r"\[.*?\]", r"", text) text = text.replace("'''", "").replace("''", "'") text = text.replace("[[", "").replace("]]", "") text = text.replace("–", "-").replace("&", "&") return text def getFirstSentence(text): """ Returns the text until first instance of '.' It also makes sure that the '.' isn't part of a wiki link or name""" tmp = re.sub(r"\[\[.*?\]\]", dotRemove, text) tmp = re.sub(r"\[.*?\]", dotRemove, tmp) tmp = re.sub(r".*?<\/ref>", dotRemove, tmp) tmp = re.sub(r"---.*?---", dotRemove, tmp) tmp = re.sub(r"'''.*?'''", dotRemove, tmp) tmp = re.sub(r"''.*?''", dotRemove, tmp) index = tmp.find('.') if index == -1: return text else: return text[:index] def makeArticle(m): """ Changes a, an to the when appropriate """ retval = ', the' if len(m.group(2)) == 0: retval = ' ' return retval def extractURS(text): """ The function to call. Returns the URS """ text = getFirstSentence(text) text = cleanMarkup(text) text = re.sub(r",?\s+(was|is)\s+(an|the|a|)", makeArticle, text) return text if __name__ == '__main__': #fp = open(sys.argv[1]) fp = codecs.open("input.txt", "r", "utf-8") fp2 = codecs.open("output.txt", "w", "utf-8") fp2.write(codecs.BOM_UTF8.decode("utf-8")), # Add BOM for UTF-8 for line in fp: line = line.rstrip() if len(line) == 0 or line.startswith("#"): # For debugging continue urs = extractURS(line) fp2.write(urs + '\r\n') fp.close() fp2.close()
Example Inputs and Outputs
These are inputs from Wikipedia (Click on the article and then Edit). Ex Lala Lajpat Rai. The above script outputs the URS.
Example Input: ”‘B. S. Johnson “’ (Bryan Stanley Johnson) ([[5 February]],[[1933]] - [[13 November]],[[1973]]) was an English experimental novelist, poet, literary critic and film-maker. Script Output: B. S. Johnson (Bryan Stanley Johnson) (5 February,1933 - 13 November,1973), the English experimental novelist, poet, literary critic and film-maker
How are URS used?
URS can be directly substituted in a sentence containing that persons’ name. (Hover over Bhagat Singh to see this URS. ex. Bhagat Singh was executed by the British in 1931. This way, a person who had no idea who Bhagat Singh was, now has more context about the person.