Python 프로그래밍 소개 - 단어 카운터 프로그램

소개



지난 기사에서는 if-elif-statements 문, match 문, for 및 while 루프, pass 문 및 try-except 문을 사용하여 프로그램 실행을 제어하는 ​​방법을 살펴보았습니다. 여기에서 우리는 몇 가지 프로그램을 만들 것입니다.

단어 카운터 프로그램



Building a word counter will allow us to put some of the concepts we have learned into practice.

Assuming you have a string variable declared literally or a text read from a file, you may decide to know the unique words as well as count them. Let us create a variable called text .

text = "A very long text declared or read from a file. \
    The text may contain very long lines but you should not worry because you have python skills. " \
    "Let us begin the conquest."

If you have a file in the same directory/folder as your python file, you can replace the above variable with

    file = open("your_file_name.txt")
    text = file.read()
    file.close()

Since python is case-sensitive, words like "The" and "the" are not the same. So, convert the text to lowercase.

text = text.lower()

To get the individual words in the text , use the split() method. Optionally, you can sort the text_split list using the list's sort() method.

text_split = text.split()

text_split.sort()

View the unique words in the list using the set() function.


unique_words = set(text_split)

print(unique_words)

{'read', 'from', 'file.', 'because', 'lines', 'very', 'declared', 'you', 'should', 'text', 'a', 'conquest', 'worry', 'may', 'us', 'not', 'have', 'long', 'python', 'let', 'contain', 'skills.', 'the', 'begin', 'but', 'or'}

Notice that some words like file and skills end with a period (.). Let us remove them using the map() function.

# a map returns a "map object"
text_split = map(lambda x: x[: -1] if x.endswith(".") else x,  text_split)

# convert back to list
text_split = list(text_split)

The map() takes a function and an iterable/sequence and applies the function to every element of the sequence. The lambda you see above is used to create a one-time-use function. We will see more of map and lambda in my next article on functions :)

The expression within the lambda function x[: -1] if x.endswith(".") else x is an example of an if expression. You would find it in the previous article. If the word ends with a period take a slice excluding the last [: -1]. If not, give me back the word.

print(text_split)
['a', 'a', 'because', 'begin', 'but', 'conquest', 'contain', 'declared', 'file', 'from', 'have', 'let', 'lines', 'long', 'long', 'may', 'not', 'or', 'python', 'read', 'should', 'skills', 'text', 'text', 'the', 'the', 'us', 'very', 'very', 'worry', 'you', 'you']

Next, create an empty dictionary to hold the words and their count. Let us call the variable summary .

summary = {}

Using a for loop, we will go through the text_split list and if we see a word already in the dictionary, we will increase its count by 1 else, we will add the variable to the dictionary and set its count to 1. For example

for a word in text_split:
    if word in summary:
        summary[word] = summary[word] + 1
    else:
        summary[word] = 1

The word in summary checks if the word is already a key in the dictionary.

Here is a shorter way to re-write the code above using the dictionary's get() method.

summary2 = {}

for word in text_split:
    summary2[word] = summary2.get(word, 0) + 1

The summary and summary have the same content.

print(summary == summary2)
True

summary2.get(word, 0) + 1 get the value/count associated with word. If there is nothing there that is, the word is not a key in the dictionary, give me zero (0) as the count. Add 1 to either set the summary[word] to 1 or increment the value contained in summary[word].

Finally, print the content of the summary . The order of the content may differ from yours but they will be the same.

print(summary)
{'a': 2, 'because': 1, 'begin': 1, 'but': 1, 'conquest': 1, 'contain': 1, 'declared': 1, 'file': 1, 'from': 1,'have': 1, 'let': 1, 'lines': 1, 'long': 2, 'may': 1, 'not': 1, 'or': 1, 'python': 1, 'read': 1, 'should': 1, 'skills': 1, 'text': 2, 'the': 2, 'us': 1, 'very': 2, 'worry': 1, 'you': 2}

To obtain a much prettier print, use the pp() method in the pprint module. We have not discussed modules so do not worry about it. Just type it for now.

from pprint import pp

# remember that summary and summary2 have the same content
pp(summary2)
{'a': 2,
'because': 1,
'begin': 1,
'but': 1,
'conquest': 1,
'contain': 1,
'declared': 1,
'file': 1,
'from': 1,
'have': 1,
'let': 1,
'lines': 1,
'long': 2,
'may': 1,
'not': 1,
'or': 1,
'python': 1,
'read': 1,
'should': 1,
'skills': 1,
'text': 2,
'the': 2,
'us': 1,
'very': 2,
'worry': 1,
'you': 2}

Now, here is the full program.

# This could come from a file
text = "A very long text declared or read from a file. \
    The text may contain very long lines but you should not worry because you have python skills. " \
    "Let us begin the conquest"

# convert to lower case
text = text.lower()

# create a list of words
text_split = text.split()

# sort in alphabetical order
text_split.sort()

# get a set of the unique words
unique_words = set(text_split)

print(unique_words)

# go through each word and remove any period (.)
text_split = map(lambda x: x[: -1] if x.endswith(".") else x,  text_split)

# convert the map output back to a list
text_split = list(text_split)

print(text_split)

summary = {}

for word in text_split:
    if word in summary:
        summary[word] = summary[word] + 1
    else:
        summary[word] = 1


summary2 = {}

for word in text_split:
    summary2[word] = summary2.get(word, 0) + 1


print(summary == summary2)

# print(summary)

from pprint import pp

pp(summary2)

결론



이 기사에서는 단어 카운터 프로그램을 만들었습니다. 파일을 읽은 경우 Linux 쉘의 wc 명령과 마찬가지로 행 수와 파일 크기를 바이트 단위로 포함할 수 있습니다. 다음 글에서는 함수에 대해 다루겠습니다. 읽어 주셔서 감사합니다.

좋은 웹페이지 즐겨찾기