• 大小: 0.50M
    文件类型: .pdf
    金币: 1
    下载: 0 次
    发布日期: 2021-03-28
  • 语言: 其他
  • 标签: 其他  

资源简介


机器学习导论课后习题答案,英文版的,很全,找了很久的希望大家支持!
Introduction 1. Imagine you have two possibilities: You can fax a document, that is, send the image, or you can use an oplical character reader(OCR)and send the text file. Discuss the advantage and disadvantages of the two approaches in a comparalive manner. When would one be preferable over the other? The text file typically is shorter than the image file but a faxed docu ment can also contain diagrams, pictures, etc. After using an OCR, we lose properties such as font, size etc (unless we also recognize and transmit such information) or the personal touch if it is handwritten text. OCR may not be perfect, and for ambigious cases, ocr should dentify those image blocks and transmit them as they are. a fax ma- chine is cheaper and easier to find than a computer with scanner and OCR Software OCR is good if we have high volume, good quality documents; for doc uments of few pages with small amount of text, it is better to transmit the image 2. Let us say we are building an OCR and for each character, we store the bitmap of that character as a template that we match with the read character pixel by pixel. Explain when such a system would fail. Why are barcode readers still used Such a system allows only one template per character and cannot dis tinguish characters from multiple fonts, for example. There are stan dardized fonts such as oCr-A and OCR-B the fonts you typically see in vouchers and banking slips, which are used with OCR soflware, and you may have aready noticed how the characters in these fonts have been slightly changed to minimize the similarities belween them. Bar- 1 ntroduction code readers are still used because reading barcodes is still a better (cheaper, more reliable, more available) technology than reading char acters B. Assume we are given the task to build a system that can distinguish junk e-mail. What is in a junk e-mail that lets us know that it is junk? How can the computer detect junk through a syntactic analysis? What would you like the computer to do if it detects a junk e-mail-delete it automatically, move it to a different file, or just highlight it on the screen? Typically, spam filters check for the existence/absence of words and symbols. Words such as opportunity,","viagra, dollars"as well as characters such as$, increase the probabilily that the email is spam. These probabilities are learned from a training set of example past emails that the user has previously marked as spam(One very frequently used method for spam filtering is the naive Bayes' classifier which we discuss in Section 5.7) The spam filters do not work with 100 percent reliability and fre- quently make errors in classification. If a junk mail is not filtered and showed to the user, this is not good but it is not as bad as filter ing a good mail as spam. Therefore, mail messages that the system considers as spam should not be automatically deleted but kept aside so that the user can see them if he/she wants to, especially in the early slages of using the spam filler when the syslem has not yel been trained sufficiently Note that filtering spam will probably never be solved completely as the spammers keep finding novel ways to outdo the filters: They use digit“ 0 instead of the letter'O’, digit‘’ instead of letter" to pass the word tests, add pieces of texts from regular messages for the mail to be considered not spam, or send it as image not as text (and lately distort the image in small random amounts to that it is not always the same image). Still, spam filtering is probably one of the best applica- tion areas of machine learning where learning systems can adapt to changes in the ways spam messages are generated 4. Let us say you are given the task of building an automated taxi. Define the constraintS. What ure che inpuls? Wr The oulpul? Ho communicate with the passenger? Do you need to communicate with che other automaled laxis, (hal is, do you need a "language"? An automated taxi should be able to pick a passenger and drive him/her to a destination. It should have some positioning system(GPS/GIS)and should have other sensors (cameras)to be able to sense cars, pedes trials, obstacles etc on the road. The output should be the sequence of actions to reach the destination in the smallest time with the min- imum inconvenience to the passenger The automated taxi needs to communicate with the passenger to receive commands and may also need to interact with other automated taxis to exhange information about road traffic or scheduling, load balancing, etc 5. In basket analysis, we want to find the dependence between two items X and Y Given a database of customer transactions, how can you find these dependencies? How would you generalize this to more than two items? This is discussed in section 3.9 6. How can you predict the next command to be typed by the user? Or the next page to be downloaded over the web? When would such a prediction be useful? When would it be annoying These are also other applications of basket analysis. The result of any statistical estimation has the risk of being wrong. That is, such dependencies should always be taken as an advice which the user can hen adopt or refuse. assuming them to be true and taking automatic action accordingly would be annoying Supervised learning 1. Write the computer program that finds s and g from a given training The matlab code given in ex2_1 m does not consider multiple possible generalizations of s or specializations of G and therefore may not work for small datasets. an example run is given in figure 2.1 ⊙ C L 2.5 Figure 2.1 + o' are the positive and negative examples. C, s and g are the actual concept, the most specific hypothesis and the most general hypothesis 2. Imagine you are given the training instances one at a time, instead of all at once. How can you incrementally adjust s and g in such a case? (Hint: See the candidate elimination algorithm in Mitchell 1997. The candidate elimination algoritm proposed by mitchell starts with S as the null set and g as containing the whole input space. At each instance x, S and G are updated as follows(Mitchell, 1997; p 33) a If x is a positive example, remove any g E G that covers x and expand any s E s that does nol cover x a If x is a negative example, remove any s e s that covers x and restrict any g C G that does cover x The important point is that when we are restricting a g E G(special- ization)or expanding a s E S(generalization), there may be more than one way of doing it, and this creates multiple hypotheses in S or G. For example, in figure 2.2, if we see a negative example at(20000, 2000) after two positive examples G=(-00<X 00,o<y<oo) splits in 20000 2000). These are two different ways of specializing G so that it does not include any positive example 1,600 1.400 10.00015,00020,000 x. Price Figure 2.2 There are two specializations of G 2 Supervised learning 3. Why is it better to use the average of s and g as the final hypothesis? If there is noise, instances may be slightly changed; in such a case, using halfway between S and G will make the hypothesis robust to such small perturbations 4. Let us say our hypothesis class is a circle instead of a rectangle. What are the parameters? How can the parameters of a circle hypothesis be calculated in such a case? What if it is an ellipse? Why does it make more sense to use an ellipse instead of a circle? How can you generalize your code lo K 2 classes In the case of a circle, the parameters are the center and the radius( see figure 2.3). We then need to find the tightest circle that includes all the positive examples as s and g will be the largest circle that includes all the positive examp d gative examp Price Figure 2.3 Hypothesis class is a circle with two parameters, the coordinates of its center and its rad It makes more sense to use an ellipse because the two axes need not have the same scale and an ellipse has two separate parameters for the widths in the two axes rather than a single radius When there are k>2 classes, we need a separate circle/ellipse for each class. For each class Ci, there will be one hypothesis which lakes all elements of Ci as positive examples and instances of all C, j+i as negative examples 5. Imagine our hypothesis is not one rectangle but a union of two(or m> 1) rectangles. What is the advantage of such a hypothesis class? Show that any class can be represented by such a hypothesis class with large enough m. In the case when there is a single rectangle, all the positive instances should form one single group; by increasing the number of rectangles we get flexibility. With lwo rectangles for example(see figure 2. 4) the positive instances can form two, possibly disjoint clusters in the input space. Nole thal each rectangle corresponds lo a conjunction on the two input attributes and having multiple rectangles, corresponds to a disjunction. Any logical formula can be written as a disjunction of conjunctions. In the worst case(m= N), we can have a separate rectangle for each positive instance xI Figure 2. 4 hypothesis class is a union of two rectangles 6. If we have a supervisor who can provide us with the label for any X where should we choose x to learn with fewer queries The region of ambiguity is belween S and G. It would be best lo be given queries there so that we can make this region of doubt smaller ir a given instance there turns out lo be posilive, this means we can 2 Supervised learning make S larger up to that instance if it is negative, this means we can shrink g up until there 7. In equation 2. 12, we summed up the squares of the differences between the actual value and the estimated value. This error function is the one most frequently used, but it is one of several possible error functions Because it sums up the squares of the differences, it is not robust to outliers. What would be a better error function to implement robust regression? As we see in Chapter 4, the squared error corresponds to assuming that there is gaussian noise. if the noise comes from a distribution with long tails, then summing up squared differences cause a few, far away points, i.e., outliers, to corrupt the fitted line To decrease the effect of outliers, we can sum up the absolute value of differences instead of squaring them N E(x)=∑ but note that we lose from differentiability. Support vector regres sion which we discuss in Chapter 10 uses an error function(see equa tion 10.61 and figure 10.13)which uses absolute difference and also has a term that neglects the error due to very small differences 8. Derive equation 2.16 We take the derivative of the sum of squared errors with respect to the two parameters, set them equal to o and solve these two equations in two unknowns E(Wi, wolX) aE ∑[ WIX +w vo=∑rN-M1∑x/N=7-1R E )x=0

资源截图

代码片段和文件信息

评论

共有 条评论