Super quick Find & Replace
Today, we are going to explore how to perform a fast search/replace over a file when the number of strings to search and replace is large. This means, you will be given a set of source and target words like in the table below,
Source(Find) | Target (Replace) |
---|---|
motives | oranges |
whale | watermelon |
idea | bulk |
… | … |
himself | itself |
And an input file like,
Chief among these motives was the overwhelming idea of the great whale himself. Such a portentous and mysterious monster roused all my curiosity.
The task is to find the best algorithm to replace all the source words in the input text with the target words. For the above input the output will be,
Chief among these oranges was the overwhelming bulk of the great watermelon itself. Such a portentous and mysterious monster roused all my curiosity.
As always, you can get the complete source code and the test files for this article from the sidebar.
Naive solution
The easiest way of implementing this would be to search the text for every word in the source list and replace it by the word in the target list. Example code for this would look like,
#!/usr/bin/env python
import sys, time, hashlib
class Naive:
""" Naive implementation of find/replace for a massive word list. """
def __init__(self, filename):
self._words = {}
fp = open(filename)
for line in fp:
word, replacement = line.strip().split()
self._words[word] = replacement
fp.close()
def replace_all(self, filename):
fp = open(filename)
text = fp.read()
fp.close()
for word in self._words.keys():
text = text.replace(word + ' ', self._words[word] + ' ')
return text
if __name__ == '__main__':
if len(sys.argv) < 2:
print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
sys.exit(1)
n = Naive(sys.argv[1])
for filename in sys.argv[2:]:
start = time.time()
print filename, hashlib.md5(n.replace_all(filename)).hexdigest(), time.time() - start
#print n.replace_all(filename)
Using Regular expressions
If the words to find are single alphanumeric words, you can use a regular expression (or even a string.split()) to find all words and replace them in one pass. This is much faster if the wordlist >> input text.
#!/usr/bin/env python
import sys, re, time, hashlib
class Regex:
""" Regex implementation of find/replace for a massive word list. """
def __init__(self, filename):
self._words = {}
fp = open(filename)
for line in fp:
word, replacement = line.strip().split()
self._words[word] = replacement
fp.close()
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._words.has_key(key):
return self._words[key]
else:
return key
def replace_all(self, filename):
fp = open(filename)
text = fp.read()
fp.close()
return re.sub("[a-zA-Z]+", self.replace_func, text)
if __name__ == '__main__':
if len(sys.argv) < 2:
print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
sys.exit(1)
r = Regex(sys.argv[1])
for filename in sys.argv[2:]:
start = time.time()
print filename, hashlib.md5(r.replace_all(filename)).hexdigest(), time.time() - start
#print r.replace_all(filename)
{: class=“prettyprint linenums:true”}
Using a Trie
A trie is a data structure specifically tuned to this task. A trie is an ordered tree that looks like so,
{:class=“pure-img”}
In the above figure, the words “apple”, “ape” and “zoo” have been shown.
#!/usr/bin/env python
import sys, time, hashlib
class Trie:
"""
Trie class
We use a tuple (string, dict) to represent a node. That is, the words
1. ant -> spider (replace every occurrence of ant with spider)
2. any -> every
will be represented as,
(None, {
a: (None, {
n: (None, {
t: ('spider', {}),
y: ('every', {})
}),
})
})
"""
def __init__(self, dict_file):
self._trie = (None, {})
fp = open(dict_file)
for line in fp:
words = line.strip().split()
self.add(words[0], words[1])
fp.close()
def add(self, text, replacement, node=None):
""" Adds word/replacement pair """
if node == None: node = self._trie
new_replacement, children, c = None, node[1], text[0]
# New id is the id for leaf nodes, None otherwise
if len(text) == 1: new_replacement = replacement
if c not in children:
children[c] = (new_replacement, {})
elif len(text) == 1: # Leaf node
(old_replacement, old_node) = children[c]
children[c] = (new_replacement, old_node)
if len(text) > 1:
self.add(text[1:], replacement, node[1][c])
def replace_all(self, text):
output = []
max_len, start = len(text), 0
while start < max_len:
(replacement, cur_node) = self._trie
end = start
while end < max_len and cur_node.has_key(text[end]):
(replacement, cur_node) = cur_node[text[end]]
end += 1
# If we are not at the beginning of the string, append prev char (whitespace)
if start != 0: output.append(text[start-1])
# Replacement found AND end not exceeded max-len AND the word has ended
if replacement != None and end <= max_len and (end == max_len or not text[end].isalpha()):
output.append(replacement)
else:
output.append(text[start:end])
start = end + 1
if end != max_len: output.append(text[end])
return ''.join(output)
if __name__ == '__main__':
if len(sys.argv) < 2:
print "Usage: python %s <find-replace-list> <file1> <file2>" % sys.argv[0]
sys.exit(1)
t = Trie(sys.argv[1])
for filename in sys.argv[2:]:
start = time.time()
fp = open(filename)
data = fp.read()
fp.close()
print filename, hashlib.md5(t.replace_all(data)).hexdigest(), time.time() - start
#print t.replace_all(data)
{: class=“prettyprint linenums:true”}
Results
# words | % found | Naive (seconds) | Regex (seconds) | Trie (seconds) |
---|---|---|---|---|
10 | 10% | 0.000076 | 0.000054 | 0.000190 |
50% | 0.000050 | 0.000068 | 0.000159 | |
100% | 0.000049 | 0.000065 | 0.000211 | |
——— | ——— | —————– | —————– | —————- |
100 | 10% | 0.000338 | 0.000680 | 0.000987 |
50% | 0.000397 | 0.000287 | 0.000878 | |
100% | 0.000389 | 0.000268 | 0.000635 | |
——— | ——— | —————– | —————– | —————- |
1000 | 10% | 0.020482 | 0.003205 | 0.008794 |
50% | 0.020797 | 0.001782 | 0.007492 | |
100% | 0.022443 | 0.001600 | 0.006307 | |
——— | ——— | —————– | —————– | —————- |
10,000 | 10% | 1.992886 | 0.017261 | 0.085929 |
50% | 2.053409 | 0.015645 | 0.082155 | |
100% | 2.163350 | 0.017048 | 0.069294 | |
——— | ——— | —————– | —————– | —————- |
100,000 | 10% | 213.5203 | 0.193326 | 0.910784 |
50% | 226.2446 | 0.182818 | 0.843763 | |
100% | 254.2690 | 0.198825 | 0.755635 |
Huh? Isn’t the Trie supposed to be faster?
Yes, a similar implementation of Naive, Regex and Trie in a language like C/C++ will always have the Trie outperforming the other two implementations. However, in Python, the overheads of implementing the Trie using a dictionary and a tuple make it rather slow compared to the Regex implementation (which has been written and optimised in C).
If I do get around to it, I will re-implement the above code in C++.
Note: Generating a fair input file
I have generated test files to test the performance of our algorithms against the following dimensions,
- Size of (find, replace) pairs.
- Size of the input file.
- Percentage occurrence - This is the percentage of times a word from the “find” set occurs in the input file. We will be testing for < 10%, 50%, > 90% occurrence.
Resources
Download the source code for this article (including the fair input file generator)