- Read the text file.text = open("filename.txt").read()
- Replace non-alphanumeric characters as a whitespace.import re text = re.sub('[^w&^d]', ' ', text)
- Change all characters to lowercase.text = text.lower()
- Split words into a list.text = text.split()
- Display the number for words in the text file.len(text)
- Display the number of unique words in the text file.len(set(text))
- Display the number of occurrences for each word.from collections import defaultdict
 wordsCount = defaultdict(int)
 for word in text:
 wordsCount[word] += 1
 for word, num in wordsCount.items():
 print(word, num)
Showing posts with label re. Show all posts
Showing posts with label re. Show all posts
Monday, June 1, 2009
Count the number of words using Python
This article represents a way to count the number of words, the number of unique words, and the number of each word occurrences in a text file.
Labels:
collections
                                              ,
                                            
count
                                              ,
                                            
defaultdict
                                              ,
                                            
defaultdict.items
                                              ,
                                            
file
                                              ,
                                            
len
                                              ,
                                            
number
                                              ,
                                            
open
                                              ,
                                            
Python
                                              ,
                                            
re
                                              ,
                                            
read
                                              ,
                                            
Set
                                              ,
                                            
words
Subscribe to:
Comments
                                      (
                                      Atom
                                      )