python数据分析代写,python notebook代写,python留学作业代写
Homework 2
Please study lec_05, lec_06, lec_07, and lec_08 for this homework
There are three parts in the homework:
Part 1. practice encode, decode, and unicode
Part 2. practice using regular expression to extract information
Part 3. practice Part-of-Speech taggingThe deadline is: 2021.03.06 10:00pm
Please run your code in Google Colab to avoid unnecessary confusion and the expected outputs are shown below each code cell.
Write your code in the code cell commented as "write your code here / fill in the function"
Please do not remove the comment "write your code here / fill in the function"
Please do not change variable names
In [ ]:
import nltk
nltk.download('book') # download the data used in nltk bookIn [ ]:
# read the polish text file
file = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')f = open(file, encoding='latin2') # tell the function that this file is encoded with latin2
f_lines = f.readlines()
for line in f_lines:
line = line.strip() # remove the leading and the trailing characters (e.g., space, \n)
print(line)
Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą "Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
- check the unicode representation of each line
In [ ]:
# write your code here
b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105\\n' b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez\\n' b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n' b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki\\n' b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych\\n' b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.\\n'
- Print letters that are not in the ASCII range
- hint: ord(the letter) > 127
- print the following information of the letters:
- letter
- the byte representation of the letter encoded with 'latin2'
- the byte representation of the letter encoded with 'ISO-8859-2'
- the byte representation of the letter encoded with 'utf8'
- the Unicode representation
- the name assigned to this letter
In [ ]:
# write your code here
ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE ś b'\xb6' b'\xb6' b'\xc5\x9b' U+015b -> LATIN SMALL LETTER S WITH ACUTE Ś b'\xa6' b'\xa6' b'\xc5\x9a' U+015a -> LATIN CAPITAL LETTER S WITH ACUTE ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE
- Extract words consisting non-alphabet letters
- non-alphabet letters such as: ń,ó
- words consisting non-alphabet letters such as: państwowa
In [ ]:
from nltk import word_tokenize
special_letters = [ ] # define the list of special letters from the previous output
# write your code here
# note: "śląsk" appears twice because it contains two special characters "ś" and "ą", it's fine if you want to remove duplicated words.
państwowa nazwą niemców światowej śląsk śląsk zostały trafiły jagiellońskiej obejmują archiwaliów
In [ ]:
# get contents from a URL
from urllib import request
html = request.urlopen('http://nltk.org/').read().decode('utf8') print(html)
In [ ]:
# use BeautifulSoup to pull text data from the HTML file
from bs4 import BeautifulSoup
raw_text = # write your code here
print(raw_text)
Write a regular expression to extract digits from the text
- the order of elements in the output list doesn't matter
hints:
- re.findall()
- think carefully about the patterns of these digits
- one or more digits
- digits separated by dot
In [ ]:
import re
In [ ]:
# write your code here
['3.5', '3.5', '50', '3', '3', '2', '1', '0', '6', '0001', '0', '2009', '2020', '13', '2020', '2.4.4']
- Write a regular expression to extract upper-cased sequence of letters from the text
- all letters in the sequence are in upper-case format
- the sequence should have at least two upper-case letters
- use set() function to remove repeated elements
- the order of elements in the output doesn't matter
In [ ]:
# write your code here
{'NLP', 'NLTK', 'HOWTO', 'NNP', 'PERSON', 'RB', 'IN', 'NN', 'VB', 'NB', 'FAQ', 'JJ', 'API', 'VBD', 'CD', 'OS'}- Write a regular expression to substitute 'NLP' to 'Natural Language Processing' in the given sentence
- re.sub()
In [ ]:
Out[12]:
sentence = 'NLTK provides wrappers for industrial-strength NLP libraries'
re.sub() # fill the function
'NLTK provides wrappers for industrial-strength Natural Language Processing libraries'
- Write a regular expression to find words ending in ing
- use set() function to remove repeated elements
- the order of words in the output doesn't matter
In [ ]:
# write your code here
{'leading', 'categorizing', 'amazing', 'Processing', 'introducing', 'stemming', 'Installing', 'teaching', 'working', 'processing', 'reasoning', 'using', 'parsing', 'programming', 'writing', 'tagging', 'building', 'analyzing', 'morning'}- Write a regular expression to extract the url
- re.search()
In [ ]:
Out[14]:
# write your code here
<_sre.SRE_Match object; span=(1539, 1563), match='http://nltk.org/book_1ed'>
In [ ]:
text = "John's big idea isn't all that bad"
- For the given sentence, check the POS tag with tagset=None
In [ ]:
Out[16]:
# write your code here
[('John', 'NNP'),
("'s", 'POS'),
('big', 'JJ'),
('idea', 'NN'),
('is', 'VBZ'),
("n't", 'RB'),
('all', 'PDT'),
('that', 'DT'),
('bad', 'JJ')]- Check the POS tag for the given sentence with tagset = 'universal'
In [ ]:
Out[40]:
# write your code here
[('John', 'NOUN'),
("'s", 'PRT'),
('big', 'ADJ'),
('idea', 'NOUN'),
('is', 'VERB'),
("n't", 'ADV'),
('all', 'DET'),
('that', 'DET'),
('bad', 'ADJ')]In [ ]:
# check the specific meaning of each POS tag in the Penn Treebank tagset
nltk.help.upenn_tagset()
- Specify the training data and test data
- training data: brown corpus with news category
- testing data: bron corpus with humor category
In [ ]:
from nltk.corpus import brown
In [ ]:
train_sents = # write your code here
test_sents = # write your code here
print("%d training sentences, %d testing sentences" % (len(train_sents), len(test_sents)))4623 training sentences, 1053 testing sentences
- train a default tagger using the news category of brown corpus
- hint:
- find the most frequent tag of the news category
- train a DefaultTagger with the most frequent tag
- hint:
- Get the most frequent tag for the brown corpus training data (categories = 'news')
In [ ]:
Out[20]:
# write your code here
'NN'
- Train a DefaultTagger with the most frequent tag
In [ ]:
default_tagger = # write your code here
- Apply the DefaultTagger to tag the given text
In [ ]:
Out[22]:
text = "John's big idea isn't all that bad"
default_tagger.tag() # fill in the tag function
[('John', 'NN'),
("'s", 'NN'),
('big', 'NN'),
('idea', 'NN'),
('is', 'NN'),
("n't", 'NN'),
('all', 'NN'),
('that', 'NN'),
('bad', 'NN')]- Evaluate the DefaultTagger's performance on testing data
In [ ]:
Out[23]:
default_tagger.evaluate() # fill in the evaluate function
0.11832219405392948
- Train a UnigramTagger with the training data (categories='news')
- Get the top-n most frequent words and the most frequent tag for each word (in the training data)
In [ ]:
def topn_frequent_tags(topn):
"""
- get topn frequent words
- for each frequent word, get its most frequent tag
- return: dict{word}=tag """
# write your code here
return most_freq_wordtags
- The top-10 most frequent word and the corresponding tags in training data
In [ ]:
Out[41]:
topn_frequent_tags(topn=10)
{',': ',',
'.': '.',
'The': 'AT',
'a': 'AT',
'and': 'CC',
'for': 'IN',
'in': 'IN',
'of': 'IN',
'the': 'AT',
'to': 'TO'}- Build a UnigramTagger with the top-100 most frequent word tags
In [ ]:
unigram_tagger_top100 = # write your code here
- Evaluate the performance of the unigram_tagger_top100 on test data
In [ ]:
Out[26]:
# write your code here
0.491126987785204
- Build a UnigramTagger with the top-1000 most frequent word tags
In [ ]:
unigram_tagger_top1000 = # write your code here
- Evaluate the performance of the unigram_tagger_top1000 on test data
In [ ]:
Out[28]:
# write your code here
0.6397787508642544
- Build a UnigramTagger with all the tagged training sentences
In [ ]:
unigram_tagger = # write your code here
- Evaluate the performance of the unigram_tagger on test data
In [ ]:
Out[30]:
# write your code here
0.7951140815856188
- Build a BigramTagger with all the tagged training sentences and evaluate its performance on test data
In [ ]:
Out[31]:
t2 = # write your code here
t2.evaluate(test_sents)
0.11684719981562572
- Apply the t2 tagger on the given sentence
In [ ]:
Out[32]:
t2.tag() # fill in the tag function
[('John', 'NP'),
("'s", None),
('big', None),
('idea', None),
('is', None),
("n't", None),
('all', None),
('that', None),
('bad', None)]- Add the UnigramTagger as a backoff for the BigramTagger and evaluate its performance on test data
In [ ]:
Out[33]:
t2_t1 = # write your code here
t2_t1.evaluate() # fill in the evaluate function
0.805992164093109
- Apply the t2_t1 tagger on the given sentence
In [ ]:
Out[34]:
t2_t1.tag() # fill in the tag function
[('John', 'NP'),
("'s", None),
('big', 'JJ'),
('idea', 'NN'),
('is', 'BEZ'),
("n't", None),
('all', 'ABN'),
('that', 'DT'),
('bad', 'JJ')]- Add the DefaultTagger as a backoff for the BigramTagger and evaluate its performance on test data
In [ ]:
Out[35]:
t2_default = # write your code here
t2_default.evaluate() # fill in the evaluate function
0.7252823231159253
- Apply the t2_default tagger on the given sentence
In [ ]:
Out[36]:
t2_default.tag() # fill in the tag function
[('John', 'NP'),
("'s", 'NN'),
('big', 'JJ'),
('idea', 'NN'),
('is', 'BEZ'),
("n't", 'NN'),
('all', 'ABN'),
('that', 'DT'),
('bad', 'NN')]- Summarize the performance of your taggers
In [ ]:
print("The accuracy of DefaultTagger is %.3f" % default_tagger.evaluate(test_sents) )print("The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is %.3f" % unigram_tagger_top100.evaluate(test_sents))print("The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is %.3f" % unigram_tagger_top1000.evaluate(test_sents))print("The accuracy of UnigramTagger (trained with all tagged training data) is %.3f" % unigram_tagger.evaluate(test_sents))print("The accuracy of BigramTagger is %.3f" % t2.evaluate(test_sents))print("The accuracy of BigramTagger (with UnigramTagger as backoff) is %.3f" % t2_t1.evaluate(test_sents))print("The accuracy of BigramTagger (with DefaultTagger as backoff) is %.3f" % t2_default.evaluate(test_sents))The accuracy of DefaultTagger is 0.118 The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is 0.491 The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is 0.640 The accuracy of UnigramTagger (trained with all tagged training data) is 0.795 The accuracy of BigramTagger is 0.117 The accuracy of BigramTagger (with UnigramTagger as backoff) is 0.806 The accuracy of BigramTagger (with DefaultTagger as backoff) is 0.725
