python代写 python作业代写 python程序代写 代写python程序

python数据分析代写,python notebook代写,python留学作业代写

Homework 2

  • Please study lec_05, lec_06, lec_07, and lec_08 for this homework

  • There are three parts in the homework:

    Part 1. practice encode, decode, and unicode
    Part 2. practice using regular expression to extract information
    Part 3. practice Part-of-Speech tagging

  • The deadline is: 2021.03.06 10:00pm

  • Please run your code in Google Colab to avoid unnecessary confusion and the expected outputs are shown below each code cell.

    Write your code in the code cell commented as "write your code here / fill in the function"
    Please do not remove the comment "write your code here / fill in the function"
    Please do not change variable names

In [ ]:
import nltk
nltk.download('book') # download the data used in nltk book

Part 1: practice encoding, decoding, and unicode

In [ ]:
# read the polish text file
file = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(file, encoding='latin2') # tell the function that this file is encoded with latin2
f_lines = f.readlines()
for line in f_lines:
    line = line.strip() # remove the leading and the trailing characters (e.g., space, \n)
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
  • check the unicode representation of each line
In [ ]:
# write your code here

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105\\n'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez\\n'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki\\n'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych\\n'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.\\n'
  • Print letters that are not in the ASCII range
    • hint: ord(the letter) > 127
    • print the following information of the letters:
      • letter
      • the byte representation of the letter encoded with 'latin2'
      • the byte representation of the letter encoded with 'ISO-8859-2'
      • the byte representation of the letter encoded with 'utf8'
      • the Unicode representation
      • the name assigned to this letter
In [ ]:
# write your code here

ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE
ś b'\xb6' b'\xb6' b'\xc5\x9b' U+015b -> LATIN SMALL LETTER S WITH ACUTE
Ś b'\xa6' b'\xa6' b'\xc5\x9a' U+015a -> LATIN CAPITAL LETTER S WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE
ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE
ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE
  • Extract words consisting non-alphabet letters
    • non-alphabet letters such as: ń,ó
    • words consisting non-alphabet letters such as: państwowa
In [ ]:

from nltk import word_tokenize
special_letters = [ ]  # define the list of special letters from the previous output 
# write your code here
# note: "śląsk" appears twice because it contains two special characters "ś" and "ą", it's fine if you want to remove duplicated words.

państwowa
nazwą
niemców
światowej
śląsk
śląsk
zostały
trafiły
jagiellońskiej
obejmują
archiwaliów

Part 2: access a webpage and extract required information using regular expressions

In [ ]:
# get contents from a URL
from urllib import request
html = request.urlopen('http://nltk.org/').read().decode('utf8') 
print(html)

In [ ]:
# use BeautifulSoup to pull text data from the HTML file
from bs4 import BeautifulSoup
raw_text =  # write your code here
print(raw_text)

  • Write a regular expression to extract digits from the text

    • the order of elements in the output list doesn't matter
  • hints:

    • re.findall()
    • think carefully about the patterns of these digits
      • one or more digits
      • digits separated by dot
In [ ]:
import re

In [ ]:
# write your code here

['3.5', '3.5', '50', '3', '3', '2', '1', '0', '6', '0001', '0', '2009', '2020', '13', '2020', '2.4.4']
  • Write a regular expression to extract upper-cased sequence of letters from the text
    • all letters in the sequence are in upper-case format
    • the sequence should have at least two upper-case letters
    • use set() function to remove repeated elements
    • the order of elements in the output doesn't matter
In [ ]:
# write your code here

{'NLP', 'NLTK', 'HOWTO', 'NNP', 'PERSON', 'RB', 'IN', 'NN', 'VB', 'NB', 'FAQ', 'JJ', 'API', 'VBD', 'CD', 'OS'}
  • Write a regular expression to substitute 'NLP' to 'Natural Language Processing' in the given sentence
    • re.sub()
In [ ]:
sentence = 'NLTK provides wrappers for industrial-strength NLP libraries'
re.sub() # fill the function

Out[12]:
'NLTK provides wrappers for industrial-strength Natural Language Processing libraries'
  • Write a regular expression to find words ending in ing
    • use set() function to remove repeated elements
    • the order of words in the output doesn't matter
In [ ]:
# write your code here

{'leading', 'categorizing', 'amazing', 'Processing', 'introducing', 'stemming', 'Installing', 'teaching', 'working', 'processing', 'reasoning', 'using', 'parsing', 'programming', 'writing', 'tagging', 'building', 'analyzing', 'morning'}
  • Write a regular expression to extract the url
    • re.search()
In [ ]:
# write your code here

Out[14]:
<_sre.SRE_Match object; span=(1539, 1563), match='http://nltk.org/book_1ed'>

Part 3: practice POS tagging

In [ ]:
text = "John's big idea isn't all that bad" 

  • For the given sentence, check the POS tag with tagset=None
In [ ]:
# write your code here

Out[16]:
[('John', 'NNP'),
 ("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'PDT'),
 ('that', 'DT'),
 ('bad', 'JJ')]
  • Check the POS tag for the given sentence with tagset = 'universal'
In [ ]:
# write your code here

Out[40]:
[('John', 'NOUN'),
 ("'s", 'PRT'),
 ('big', 'ADJ'),
 ('idea', 'NOUN'),
 ('is', 'VERB'),
 ("n't", 'ADV'),
 ('all', 'DET'),
 ('that', 'DET'),
 ('bad', 'ADJ')]
In [ ]:
# check the specific meaning of each POS tag in the Penn Treebank tagset
nltk.help.upenn_tagset()

  • Specify the training data and test data
    • training data: brown corpus with news category
    • testing data: bron corpus with humor category
In [ ]:
from nltk.corpus import brown

In [ ]:
train_sents = # write your code here
test_sents = # write your code here
print("%d training sentences, %d testing sentences" % (len(train_sents), len(test_sents)))

4623 training sentences, 1053 testing sentences
  • train a default tagger using the news category of brown corpus
    • hint:
      • find the most frequent tag of the news category
      • train a DefaultTagger with the most frequent tag
  • Get the most frequent tag for the brown corpus training data (categories = 'news')
In [ ]:
# write your code here

Out[20]:
'NN'
  • Train a DefaultTagger with the most frequent tag
In [ ]:
default_tagger =  # write your code here

  • Apply the DefaultTagger to tag the given text
In [ ]:
text = "John's big idea isn't all that bad" 
default_tagger.tag() # fill in the tag function

Out[22]:
[('John', 'NN'),
 ("'s", 'NN'),
 ('big', 'NN'),
 ('idea', 'NN'),
 ('is', 'NN'),
 ("n't", 'NN'),
 ('all', 'NN'),
 ('that', 'NN'),
 ('bad', 'NN')]
  • Evaluate the DefaultTagger's performance on testing data
In [ ]:
default_tagger.evaluate() # fill in the evaluate function

Out[23]:
0.11832219405392948
  • Train a UnigramTagger with the training data (categories='news')
  • Get the top-n most frequent words and the most frequent tag for each word (in the training data)
In [ ]:
def topn_frequent_tags(topn):
    """
    - get topn frequent words
    - for each frequent word, get its most frequent tag
    - return: dict{word}=tag 
    """
    # write your code here
    
    return most_freq_wordtags

  • The top-10 most frequent word and the corresponding tags in training data
In [ ]:
topn_frequent_tags(topn=10)

Out[41]:
{',': ',',
 '.': '.',
 'The': 'AT',
 'a': 'AT',
 'and': 'CC',
 'for': 'IN',
 'in': 'IN',
 'of': 'IN',
 'the': 'AT',
 'to': 'TO'}
  • Build a UnigramTagger with the top-100 most frequent word tags
In [ ]:
unigram_tagger_top100 = # write your code here

  • Evaluate the performance of the unigram_tagger_top100 on test data
In [ ]:
# write your code here

Out[26]:
0.491126987785204
  • Build a UnigramTagger with the top-1000 most frequent word tags
In [ ]:
unigram_tagger_top1000 = # write your code here

  • Evaluate the performance of the unigram_tagger_top1000 on test data
In [ ]:
# write your code here

Out[28]:
0.6397787508642544
  • Build a UnigramTagger with all the tagged training sentences
In [ ]:
unigram_tagger = # write your code here

  • Evaluate the performance of the unigram_tagger on test data
In [ ]:
# write your code here

Out[30]:
0.7951140815856188
  • Build a BigramTagger with all the tagged training sentences and evaluate its performance on test data
In [ ]:
t2 = # write your code here
t2.evaluate(test_sents)

Out[31]:
0.11684719981562572
  • Apply the t2 tagger on the given sentence
In [ ]:
t2.tag() # fill in the tag function

Out[32]:
[('John', 'NP'),
 ("'s", None),
 ('big', None),
 ('idea', None),
 ('is', None),
 ("n't", None),
 ('all', None),
 ('that', None),
 ('bad', None)]
  • Add the UnigramTagger as a backoff for the BigramTagger and evaluate its performance on test data
In [ ]:
t2_t1 = # write your code here
t2_t1.evaluate() # fill in the evaluate function

Out[33]:
0.805992164093109
  • Apply the t2_t1 tagger on the given sentence
In [ ]:
t2_t1.tag() # fill in the tag function

Out[34]:
[('John', 'NP'),
 ("'s", None),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'BEZ'),
 ("n't", None),
 ('all', 'ABN'),
 ('that', 'DT'),
 ('bad', 'JJ')]
  • Add the DefaultTagger as a backoff for the BigramTagger and evaluate its performance on test data
In [ ]:
t2_default = # write your code here
t2_default.evaluate() # fill in the evaluate function

Out[35]:
0.7252823231159253
  • Apply the t2_default tagger on the given sentence
In [ ]:
t2_default.tag() # fill in the tag function

Out[36]:
[('John', 'NP'),
 ("'s", 'NN'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'BEZ'),
 ("n't", 'NN'),
 ('all', 'ABN'),
 ('that', 'DT'),
 ('bad', 'NN')]
  • Summarize the performance of your taggers
In [ ]:

print("The accuracy of DefaultTagger is %.3f" % default_tagger.evaluate(test_sents) )
print("The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is %.3f" % unigram_tagger_top100.evaluate(test_sents))
print("The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is %.3f" % unigram_tagger_top1000.evaluate(test_sents))
print("The accuracy of UnigramTagger (trained with all tagged training data) is %.3f" % unigram_tagger.evaluate(test_sents))
print("The accuracy of BigramTagger is %.3f" % t2.evaluate(test_sents))
print("The accuracy of BigramTagger (with UnigramTagger as backoff) is %.3f" % t2_t1.evaluate(test_sents))
print("The accuracy of BigramTagger (with DefaultTagger as backoff) is %.3f" % t2_default.evaluate(test_sents))

The accuracy of DefaultTagger is 0.118
The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is 0.491
The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is 0.640
The accuracy of UnigramTagger (trained with all tagged training data) is 0.795
The accuracy of BigramTagger is 0.117
The accuracy of BigramTagger (with UnigramTagger as backoff) is 0.806
The accuracy of BigramTagger (with DefaultTagger as backoff) is 0.725
京ICP备2025144562号-1
微信
程序代写,编程代写
使用微信扫一扫关注
在线客服
欢迎在线资讯
联系时间: 全天