python代写 python作业代写 python程序代写代写python程序

python数据分析代写，python notebook代写，python留学作业代写

管理员 2021-06-29发表

Homework 2

Please study lec_05, lec_06, lec_07, and lec_08 for this homework
There are three parts in the homework:
Part 1. practice encode, decode, and unicode
Part 2. practice using regular expression to extract information
Part 3. practice Part-of-Speech tagging
The deadline is: 2021.03.06 10:00pm
Please run your code in Google Colab to avoid unnecessary confusion and the expected outputs are shown below each code cell.
Write your code in the code cell commented as "write your code here / fill in the function"
Please do not remove the comment "write your code here / fill in the function"
Please do not change variable names

In [ ]:

import nltk

nltk.download('book') # download the data used in nltk book

Part 1: practice encoding, decoding, and unicode

In [ ]:

# read the polish text file

file = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

f = open(file, encoding='latin2') # tell the function that this file is encoded with latin2

f_lines = f.readlines()

for line in f_lines:

    line = line.strip() # remove the leading and the trailing characters (e.g., space, \n)

    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.

check the unicode representation of each line

In [ ]:

# write your code here

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105\\n'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez\\n'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki\\n'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych\\n'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.\\n'

Print letters that are not in the ASCII range
- hint: ord(the letter) > 127
- print the following information of the letters:
  - letter
  - the byte representation of the letter encoded with 'latin2'
  - the byte representation of the letter encoded with 'ISO-8859-2'
  - the byte representation of the letter encoded with 'utf8'
  - the Unicode representation
  - the name assigned to this letter

In [ ]:

# write your code here

ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE
ś b'\xb6' b'\xb6' b'\xc5\x9b' U+015b -> LATIN SMALL LETTER S WITH ACUTE
Ś b'\xa6' b'\xa6' b'\xc5\x9a' U+015a -> LATIN CAPITAL LETTER S WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE
ł b'\xb3' b'\xb3' b'\xc5\x82' U+0142 -> LATIN SMALL LETTER L WITH STROKE
ń b'\xf1' b'\xf1' b'\xc5\x84' U+0144 -> LATIN SMALL LETTER N WITH ACUTE
ą b'\xb1' b'\xb1' b'\xc4\x85' U+0105 -> LATIN SMALL LETTER A WITH OGONEK
ó b'\xf3' b'\xf3' b'\xc3\xb3' U+00f3 -> LATIN SMALL LETTER O WITH ACUTE

Extract words consisting non-alphabet letters
- non-alphabet letters such as: ń,ó
- words consisting non-alphabet letters such as: państwowa

In [ ]:

from nltk import word_tokenize

special_letters = [ ]  # define the list of special letters from the previous output

# write your code here

# note: "śląsk" appears twice because it contains two special characters "ś" and "ą", it's fine if you want to remove duplicated words.

państwowa
nazwą
niemców
światowej
śląsk
śląsk
zostały
trafiły
jagiellońskiej
obejmują
archiwaliów

Part 2: access a webpage and extract required information using regular expressions

In [ ]:

# get contents from a URL

from urllib import request

html = request.urlopen('http://nltk.org/').read().decode('utf8')

print(html)

In [ ]:

# use BeautifulSoup to pull text data from the HTML file

from bs4 import BeautifulSoup

raw_text =  # write your code here

print(raw_text)

Write a regular expression to extract digits from the text
- the order of elements in the output list doesn't matter
hints:
- re.findall()
- think carefully about the patterns of these digits
  - one or more digits
  - digits separated by dot

In [ ]:

import re

In [ ]:

# write your code here

['3.5', '3.5', '50', '3', '3', '2', '1', '0', '6', '0001', '0', '2009', '2020', '13', '2020', '2.4.4']

Write a regular expression to extract upper-cased sequence of letters from the text
- all letters in the sequence are in upper-case format
- the sequence should have at least two upper-case letters
- use set() function to remove repeated elements
- the order of elements in the output doesn't matter

In [ ]:

# write your code here

{'NLP', 'NLTK', 'HOWTO', 'NNP', 'PERSON', 'RB', 'IN', 'NN', 'VB', 'NB', 'FAQ', 'JJ', 'API', 'VBD', 'CD', 'OS'}

Write a regular expression to substitute 'NLP' to 'Natural Language Processing' in the given sentence
- re.sub()

In [ ]:

sentence = 'NLTK provides wrappers for industrial-strength NLP libraries'

re.sub() # fill the function

Out[12]:

'NLTK provides wrappers for industrial-strength Natural Language Processing libraries'

Write a regular expression to find words ending in ing
- use set() function to remove repeated elements
- the order of words in the output doesn't matter

In [ ]:

# write your code here

{'leading', 'categorizing', 'amazing', 'Processing', 'introducing', 'stemming', 'Installing', 'teaching', 'working', 'processing', 'reasoning', 'using', 'parsing', 'programming', 'writing', 'tagging', 'building', 'analyzing', 'morning'}

Write a regular expression to extract the url
- re.search()

In [ ]:

# write your code here

Out[14]:

<_sre.SRE_Match object; span=(1539, 1563), match='http://nltk.org/book_1ed'>

Part 3: practice POS tagging

In [ ]:

text = "John's big idea isn't all that bad"

For the given sentence, check the POS tag with tagset=None

In [ ]:

# write your code here

Out[16]:

[('John', 'NNP'),
 ("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'PDT'),
 ('that', 'DT'),
 ('bad', 'JJ')]

Check the POS tag for the given sentence with tagset = 'universal'

In [ ]:

# write your code here

Out[40]:

[('John', 'NOUN'),
 ("'s", 'PRT'),
 ('big', 'ADJ'),
 ('idea', 'NOUN'),
 ('is', 'VERB'),
 ("n't", 'ADV'),
 ('all', 'DET'),
 ('that', 'DET'),
 ('bad', 'ADJ')]

In [ ]:

# check the specific meaning of each POS tag in the Penn Treebank tagset

nltk.help.upenn_tagset()

Specify the training data and test data
- training data: brown corpus with news category
- testing data: bron corpus with humor category

In [ ]:

from nltk.corpus import brown

In [ ]:

train_sents = # write your code here

test_sents = # write your code here

print("%d training sentences, %d testing sentences" % (len(train_sents), len(test_sents)))

4623 training sentences, 1053 testing sentences

train a default tagger using the news category of brown corpus
- hint:
  - find the most frequent tag of the news category
  - train a DefaultTagger with the most frequent tag

Get the most frequent tag for the brown corpus training data (categories = 'news')

In [ ]:

# write your code here

Out[20]:

'NN'

Train a DefaultTagger with the most frequent tag

In [ ]:

default_tagger =  # write your code here

Apply the DefaultTagger to tag the given text

In [ ]:

text = "John's big idea isn't all that bad"

default_tagger.tag() # fill in the tag function

Out[22]:

[('John', 'NN'),
 ("'s", 'NN'),
 ('big', 'NN'),
 ('idea', 'NN'),
 ('is', 'NN'),
 ("n't", 'NN'),
 ('all', 'NN'),
 ('that', 'NN'),
 ('bad', 'NN')]

Evaluate the DefaultTagger's performance on testing data

In [ ]:

default_tagger.evaluate() # fill in the evaluate function

Out[23]:

0.11832219405392948

Train a UnigramTagger with the training data (categories='news')

Get the top-n most frequent words and the most frequent tag for each word (in the training data)

In [ ]:

def topn_frequent_tags(topn):

"""

    - get topn frequent words

    - for each frequent word, get its most frequent tag

    - return: dict{word}=tag

"""

    # write your code here

    return most_freq_wordtags

The top-10 most frequent word and the corresponding tags in training data

In [ ]:

topn_frequent_tags(topn=10)

Out[41]:

{',': ',',
 '.': '.',
 'The': 'AT',
 'a': 'AT',
 'and': 'CC',
 'for': 'IN',
 'in': 'IN',
 'of': 'IN',
 'the': 'AT',
 'to': 'TO'}

Build a UnigramTagger with the top-100 most frequent word tags

In [ ]:

unigram_tagger_top100 = # write your code here

Evaluate the performance of the unigram_tagger_top100 on test data

In [ ]:

# write your code here

Out[26]:

0.491126987785204

Build a UnigramTagger with the top-1000 most frequent word tags

In [ ]:

unigram_tagger_top1000 = # write your code here

Evaluate the performance of the unigram_tagger_top1000 on test data

In [ ]:

# write your code here

Out[28]:

0.6397787508642544

Build a UnigramTagger with all the tagged training sentences

In [ ]:

unigram_tagger = # write your code here

Evaluate the performance of the unigram_tagger on test data

In [ ]:

# write your code here

Out[30]:

0.7951140815856188

Build a BigramTagger with all the tagged training sentences and evaluate its performance on test data

In [ ]:

t2 = # write your code here

t2.evaluate(test_sents)

Out[31]:

0.11684719981562572

Apply the t2 tagger on the given sentence

In [ ]:

t2.tag() # fill in the tag function

Out[32]:

[('John', 'NP'),
 ("'s", None),
 ('big', None),
 ('idea', None),
 ('is', None),
 ("n't", None),
 ('all', None),
 ('that', None),
 ('bad', None)]

Add the UnigramTagger as a backoff for the BigramTagger and evaluate its performance on test data

In [ ]:

t2_t1 = # write your code here

t2_t1.evaluate() # fill in the evaluate function

Out[33]:

0.805992164093109

Apply the t2_t1 tagger on the given sentence

In [ ]:

t2_t1.tag() # fill in the tag function

Out[34]:

[('John', 'NP'),
 ("'s", None),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'BEZ'),
 ("n't", None),
 ('all', 'ABN'),
 ('that', 'DT'),
 ('bad', 'JJ')]

Add the DefaultTagger as a backoff for the BigramTagger and evaluate its performance on test data

In [ ]:

t2_default = # write your code here

t2_default.evaluate() # fill in the evaluate function

Out[35]:

0.7252823231159253

Apply the t2_default tagger on the given sentence

In [ ]:

t2_default.tag() # fill in the tag function

Out[36]:

[('John', 'NP'),
 ("'s", 'NN'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'BEZ'),
 ("n't", 'NN'),
 ('all', 'ABN'),
 ('that', 'DT'),
 ('bad', 'NN')]

Summarize the performance of your taggers

In [ ]:

print("The accuracy of DefaultTagger is %.3f" % default_tagger.evaluate(test_sents) )

print("The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is %.3f" % unigram_tagger_top100.evaluate(test_sents))

print("The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is %.3f" % unigram_tagger_top1000.evaluate(test_sents))

print("The accuracy of UnigramTagger (trained with all tagged training data) is %.3f" % unigram_tagger.evaluate(test_sents))

print("The accuracy of BigramTagger is %.3f" % t2.evaluate(test_sents))

print("The accuracy of BigramTagger (with UnigramTagger as backoff) is %.3f" % t2_t1.evaluate(test_sents))

print("The accuracy of BigramTagger (with DefaultTagger as backoff) is %.3f" % t2_default.evaluate(test_sents))

The accuracy of DefaultTagger is 0.118
The accuracy of UnigramTagger (trained with the top-100 most frequent word tags) is 0.491
The accuracy of UnigramTagger (trained with the top-1000 most frequent word tags) is 0.640
The accuracy of UnigramTagger (trained with all tagged training data) is 0.795
The accuracy of BigramTagger is 0.117
The accuracy of BigramTagger (with UnigramTagger as backoff) is 0.806
The accuracy of BigramTagger (with DefaultTagger as backoff) is 0.725

上一篇：Python算法作业代写，Python程序代写 2021.9.1 下一篇：Python代写，python assignment代写，python lab代写，python留学生作业代写

python数据分析代写，python notebook代写，python留学作业代写

Homework 2

Part 1: practice encoding, decoding, and unicode

Part 2: access a webpage and extract required information using regular expressions

Part 3: practice POS tagging

最新python代写 python作业代写 python程序代写代写python程序

python turtle程序代写, python turtle代写, turtle编程代做, python turtle代做

Python数据分析作业代写, Python数据分析代做, Python编程代写 Python数据分析代做

Python project代写, Python project代做, Python编程 project代写, Python编程作业代写

python机器学习编程代写, python机器学习程序代写, python机器学习代做, python机器学习程序代做

python程序作业代写, python代写, python程序代写, python编程代做

python算法编程代写, python算法编程作业代写, python算法编程代做, python算法作业代写

python编程作业代写, python作业代写, python程序 Lab代写, python代写

Python机器学习代写，机器学习作业代写，python作业代写，python ML代写

Python算法作业代写，Python程序代写 2021.9.1

python数据分析代写，python notebook代写，python留学作业代写

python数据分析代写，python notebook代写，python留学作业代写

Homework 2

Part 1: practice encoding, decoding, and unicode

Part 2: access a webpage and extract required information using regular expressions

Part 3: practice POS tagging

最新python代写 python作业代写 python程序代写 代写python程序

最新python代写 python作业代写 python程序代写代写python程序