# Machine Learning Assignment

## Transcribed Text

Instructions for the assignment In this assignment, we have programming and writing questions. The Q4 is bonus for both undergraduate and graduate students. To switch between cell for code and for markdown, see the menu -> Cell -> Cell Type You could directly type the Latex equations in the markdown cell. Typing with Latex\markdown is required for all the written questions. Handwritten answers would not be accepted. If a question requires a picture, you could use this syntax "< imgsrc ="" style =" width : 300px; " / >" to include them within your ipython notebook. Using the autograder You will find two assignments on Gradescope that correspond to HW2: "HW2 - Programming" and "HW2 - Non-programming". You will submit your code for the autograder on "HW2 - Programming" in the following format: kmeans.py cleandata.py gmm.py semisupervised.py All you will have to do is to copy your implementations of the classes "Kmeans", "GMM", "CleanData", "SemiSupervised" onto the respective files. We provided you different .py files and we added libraries in those files please DO NOT remove those lines and add your code after those lines. Note that these are the only allowed libraries that you can use for the homework. You are allowed to make as many submissions until the deadline as you like. Additionally, note that the autograder tests each function separately, therefore it can serve as a useful tool to help you debug your code if you are not sure of what part of your implementation might have an issue. For the "HW2 - Non-programming" part, you will download your jupyter notbook as html and submit it on Gradescope. To download the notebook as html, click on "File" on the top left corner of this page and select "Download as > HTML". The non-programming part corresponds to Q2, Q3.3 (both your response(code) and the generated images with your implementation) and Q4.2. 0 Set up This notebook is tested under python 3.6.8, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages: jupyter notebook numpy matplotlib Please implement the functions that have "raise NotImplementedError", and after you finish the coding, please delete or comment "raise NotImplementedError". In [6]: ############################### ### DO NOT CHANGE THIS CELL ### ############################### from __future__ import absolute_import from __future__ import print_function from __future__ import division %matplotlib inline import sys import matplotlib import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import axes3d from tqdm import tqdm print('Version information') print('python: {}'.format(sys.version)) print('matplotlib: {}'.format(matplotlib.__version__)) print('numpy: {}'.format(np.__version__)) # Set random seed so output is all same np.random.seed(1) # Load image import imageio Version information python: 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] matplotlib: 2.2.2 numpy: 1.14.3 1. KMeans Clustering [5 + 30 + 10 + 5 pts] KMeans is trying to solve the following optimization problem: K min ∑ ∑ arg Si=1xj∈Si| |xj −μi| |2 where one needs to partition the N observations into K clusters: S = {S1, S2, ..., SK} and each cluster has μi as its center. 1.1 pairwise distance [5pts] In this section, you are asked to implement pairwise_dist function. Given X ∈ RNxD and Y ∈ RMxD , obtain the pairwise distance matrix dist ∈ RNxM using the euclidean distance metric, where disti , j = | | Xi − Yj | | 2 . DO NOT USE FOR LOOP in your implementation -- they are slow and will make your code too slow to pass our grader. Use array broadcasting instead. 1.2 KMeans Implementation [30pts] In this section, you are asked to implement _init_centers [5pts], _update_assignment [10pts], _update_centers [10pts] and _get_loss function [5pts]. For the function signature, please see the corresponding doc strings For the function signature, please see the corresponding doc strings. 1.3 Find the optimal number of clusters [10 pts] In this section, you are asked to implement find_optimal_num_clusters function. You will now use the elbow method to find the optimal number of clusters. In [2]: class KMeans(object): def ension def centers: KxD numpy array, where K is the number of clusters, and D is the dim def def __init__(self): #No need to implement pass pairwise_dist(self, x, y): # [5 pts] np.random.seed(1) for i in range(j): y[i, :] = np.sqrt(y[i, :] - x[i, :]) dist = y[i, :] return dist """ Args: x: N x D numpy array y: M x D numpy array Return: dist: N x M array, where dist2[i, j] is the euclidean distance between x[i, :] and y[j, :] """ raise NotImplementedError _init_centers(self, points, K, **kwargs): # [5 pts] centers = points """ Args: points: NxD numpy array, where N is # points and D is the dimensionality K: number of clusters kwargs: any additional arguments you want Return: centers: K x D numpy array, the centers. """ raise NotImplementedError _update_assignment(self, centers, points): # [10 pts] """ Args: def points: NxD numpy array, the observations Return: cluster_idx: numpy array of length N, the cluster assignment for each point Hint: You could call pairwise_dist() function. """ raise NotImplementedError _update_centers(self, old_centers, cluster_idx, points): # [10 pts] """ Args: old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension cluster_idx: numpy array of length N, the cluster assignment for each point points: NxD numpy array, the observations Return: centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension. """ gging) ebugging) points: NxD numpy array, where N is # points and D is the dimensionality K: number of clusters max_iters: maximum number of iterations (Hint: You could change it when debu abs_tol: convergence criteria w.r.t absolute change of loss rel_tol: convergence criteria w.r.t relative change of loss verbose: boolean to set whether method should print loss (Hint: helpful for d raise NotImplementedError def _get_loss(self, centers, cluster_idx, points): # [5 pts] """ Args: centers: KxD numpy array, where K is the number of clusters, and D is the dim ension def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=F alse, **kwargs): """ Args: cluster_idx: numpy array of length N, the cluster assignment for each point points: NxD numpy array, the observations Return: loss: a single float number, which is the objective function of KMeans. """ raise NotImplementedError kwargs: any additional arguments you want Return: cluster assignments: Nx1 int numpy array cluster centers: K x D numpy array, the centers loss: final loss value of the objective function of KMeans """ centers = self._init_centers(points, K, **kwargs) for it in range(max_iters): cluster_idx = self._update_assignment(centers, points) centers = self._update_centers(centers, cluster_idx, points) loss = self._get_loss(centers, cluster_idx, points) K = centers.shape[0] if it: diff = np.abs(prev_loss - loss) if diff < abs_tol and diff / prev_loss < rel_tol: break prev_loss = loss if verbose: print('iter %d, loss: %.4f' % (it, loss)) return cluster_idx, centers, loss def find_optimal_num_clusters(self, data, max_K=15): # [10 pts] np.random.seed(1) """Plots loss values for different number of clusters in K-Means Args: data: input data array max_K: number of clusters Return: None (plot loss values against number of clusters) """ raise NotImplementedError 1.4 COVID19 Clustering [5 pts] In this section, we are going to use our Kmeans algorithm to cluster the COVID19 dataset. The size for the dataset is 187 × 3 , which includes all the number of confirmed cases and deathtoll for COVID19 till May 20th, 2020. The three columns are: Countries The number of confirmed cases Death toll Death toll We are going to do the clustering task for just two columns which are the number of confirmed cases and death toll. The reason we have countries in our dataset is for you to associate the names of countries to each cluster. In [14]: # Helper function for checking the implementation of pairwise_distance fucntion. Please D O NOT change this function # TEST CASE x = np.random.randn(2, 2) y = np.random.randn(3, 2) print("*** Expected Answer ***") print("""==x== [[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]] ==y== [[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391 -0.24937038]] ==dist== [[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]]""") print("\n*** My Answer ***") print("==x==") print(x) print("==y==") print(y) print("==dist==") print(KMeans().pairwise_dist(x, y)) *** Expected Answer *** ==x== [[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]] ==y== [[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391 -0.24937038]] ==dist== [[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]] *** My Answer *** ==x== [[ 1.62434536 -0.61175641] [-0.52817175 -1.07296862]] ==y== [[ 0.86540763 -2.3015387 ] [ 1.74481176 -0.7612069 ] [ 0.3190391 -0.24937038]] ==dist== [[1.85239052 0.19195729 1.35467638] [1.85780729 2.29426447 1.18155842]] In [15]: #Helper function for reading the .csv file.You don't need to change this function def read_file(): data = np.zeros((187,2)) countries = [] cnt=0 with open(r'covid19_confirmed_deaths_052020.csv', 'r') as f: for line in f: country, confirmed,death = line.split(',') data[cnt,:]=[confirmed,death] countries.append(country) cnt+=1 return data, countries In [16]: # Helper function for visualizing cluster results. You don't have to modify it # If there are more than ten countries in the cluster, we are only going to show the firs t 10 countries as examples. def visualize (cluster_idx,centers, K,name_list): num_list = [np.sum(np.array(cluster_idx) == i) for i in range(0,K) ] x =list(range(len(num_list))) total_width, n = 0.8, 2 width = total_width / n plt.figure(figsize=(10,5)) plt.title('Visualization for '+ str(K) + ' clusters', fontdict = {'fontsize' : 18}) plt.bar(x, num_list, width=width, label='number',tick_label = name_list, fc = 'orchi d') print('{0}: Average confirmed: {1:.2f}, Average Deathtoll: {2:.2f}.'.format(name _list[i], centers[i][0], centers[i][1])) data = list(np.array(countries)[np.where(cluster_idx==i)]) print('Total number of countries in {0}: {1}'.format(name_list[i], len(data))) if len(data) > 10: data = data[:10] print(('{} '*len(data)).format(*data)) print('\n') plt.show() In [9]: # Helper function for running the algorithm for K=2 and K=5. You don't have to modify it name_list0 = ['Cluster 1','Cluster 2'] name_list1 = ['Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5'] data, countries = read_file() cluster_idx2, centers2, loss2 = KMeans()(data, 2) visualize(cluster_idx2, centers2, 2, name_list0) cluster_idx5, centers5, loss5 = KMeans()(data, 5) visualize(cluster_idx5, centers5, 5, name_list1) Cluster 1: Average confirmed: 18495.95, Average Deathtoll: 1260.88. Total number of countries in Cluster 1: 186 Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Ar menia Australia Austria Cluster 2: Average confirmed: 1551853.00, Average Deathtoll: 93439.00. Total number of countries in Cluster 2: 1 US plt.legend() for i in range(0, K): Cluster 1: Average confirmed: 3126.38, Average Deathtoll: 106.68. Total number of countries in Cluster 1: 157 Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Ar menia Australia Austria Cluster 2: Average confirmed: 238005.43, Average Deathtoll: 21997.29. Total number of countries in Cluster 2: 7 Brazil France Germany Italy Russia Spain UnitedKingdom Cluster 3: Average confirmed: 110274.17, Average Deathtoll: 4776.17. Total number of countries in Cluster 3: 6 Canada China India Iran Peru Turkey Cluster 4: Average confirmed: 1551853.00, Average Deathtoll: 93439.00. Total number of countries in Cluster 4: 1 US Cluster 5: Average confirmed: 38857.69, Average Deathtoll: 2196.06. Total number of countries in Cluster 5: 16 Bangladesh Belarus Belgium Chile Ecuador Ireland Mexico Netherlands Pakis tan Portugal In [9]: # Helper function for visualizing elbow method result. You don't have to modify it KMeans().find_optimal_num_clusters(data) Out[9]: [2807829821020.6094, 460767789772.7204, 87862229057.59686, 41815322446.12942, 25909441342.445744, 22273270978.877415, 14721982198.41217, 13070933147.453726, 12974926155.699297, 12953444351.317476, 12943241394.102552, 12941966467.803501, 12941490396.39252, 12941490396 39252, . 12941384242.601124, 12941334190.063885] 2 EM algorithm [20 pts] 2.1 Performing EM Algorithm [20 pts for CS 4641; 10 points for CS 7641] EM algorithm is a widely used approach to learning in the presence of unobserved variables. Consider the general framework of the EM algorithm, given a joint distribution P(x, z | θ) over observed variables x , hidden variable z and its distribution q(z) , and governing parameter θ , the goal is to maximize the likelihbood function and given the following expression: P(x | θ) ∑ ∑ P(x,z|θ) ∑ P(x,z|θ) ∑ P(z|x,θ)P(x|θ) logP(x|θ) = log( z P(x,z|θ)) = log( z q(z) q(z) ) ≥ z q(z)log q(z) = z q(z)log q(z) The inequality is based on the Jensen's Theorem. 2.1.1.Pleaseprovideabriefdescriptiononhowtoperformthe E step in the above equation. 2.1.2.Pleaseprovideabriefdescriptiononhowtoperformthe M step in the above equation. Suppose ∑ P(x,z|θ) F(q,θ) = z q(z)log q(z) P(x,z|θ) During the lecture, the expression ∑zq(z)log q(z) was explained through the sum of entropy and log-likelihood; the same approach can also be explained using KL-divergence. 2.1.3. Please derive that from the above equation ( Hint: use product rule for joint probability) and explain what will happen to the KL term in the E step. 2.2 EM Algorithm in Coin Toss problem [10 pts for CS 7641; 10 points Bonus for CS 4641] Suppose we have a bunch of coins C consisting three kinds of coins. Mathematically, it obeys a mixed Bernoulli distribution: X ∼ F = π F + π F + (1 − π − π )F where π ∈ [0,1] , and F1 = Ber(p1) , F2 = Ber(p2) , F3 = Ber(p3) .Thatistosay,eachcoinbelongsto F1 , F2 or F3 . Here Ber(p) means the coin gives 1 (head) with probability p and gives 0 (tail) with probability 1 − p 1 . We initialized parameters p1 = 2 5 , p2 = 6 1 , p3 = 3 1 , π1 = 4 1 , π2 = 2 . Now, we draw 3 coins X1 , X2 , X3 independently from C and have 6 independent trials for each of them. The result shows: Coins X1 X2 X3 Trial1 0 1 1 Trial2 0 1 1 Trial3 1 0 1 Trial4 0 1 1 Trial5 1 0 1 Trial6 1 0 0 1 2.2.1. Use EM algorithm for one step, we update F = F(p1 = 2 5 , p2 = 6 1 , p3 = 3 1 , π1 = 4 1 . (Round the answer to three decimal places.) (Hint: θnew = argmaxθ ∑Zp(Z | X, θold)lnp(X, Z | θ) ) 2.2.2. Can you explain the reason why we are getting the value of in 2.2.1? What will the values be if we implement more EM steps? , π2 = 2) ′′′′′′ to F (p1, p2, p3, π1, π2) .WritedownyourEMalgorithmandshowthevalueof p1,p2,p3,π1,π2 X ∼ F = π1F1 +π2F2 +(1 −π1 −π2)F3 (Hint: For example, why the values are increasing/ decreasing? Will the values be stable if we implement more steps? No need to calculate the real number) ′′′′′ ′′′′′ p1, p2, p3, π1, π2 3. GMM implementation [40+10 pts] GMM uses MLE to optimize its parameters. It approximates the distribution of data using a set of gaussian distributions. Given N samples X = [x1, x2, ..., xN] , we are asked to find K diagonal gaussian distributions to model the data X : NK max∑ ∑ {μk,σk}K i=1 (k=1 π(k)N(xi | μk, σk) ) k = 1 log For undergaduate student: you may assume the covariance matrix is diag matrix, which means the features are independent. (i.e. the red intensity of a pixel is independent from its blue intensity, etc). For graduate student: please assume full covariance matrix. Hints 1. Hereπ(⋅) is the prior of the latent variable. It is also called the mixture coefficient. To make it simple, we assume 1 K π(k) = . ,∀k = 1,2,...,K 2. As we create our model, we will need to use a multivariate Gaussian since our pixels are 3-dimensional vectors corresponding to red, green, and blue color intensities. It means that for each image, you need to convert it into a N*3 matrix, where N is the number of pixels, and 3 is the number of features. The following example from a machine learning textbook may be helpful: 1. In this question, each pixel has three features, which are R, G, and B. 2. At EM steps, gamma means τ zk () at our slide of GMM, which is called the responsibility. If we have K components, each data point (pixel) will have K responsibility values. τ zk matrix size is n × 1 . For this homework, you will work with τ(z) which has a size of n × k which means that you have all the responsibility values in one matrix. 3. For E steps, we already get the log-likelihood at ll joint() function. For the fomula at our slide: $$\tau\left(z{k}\right)=\frac{\pi{k} N\left(x | \mu{k}, \Sigma{k}\right)}{\sum{j=1}^{K} \pi{j} N\left(x | \mu{j}, \Sigma_{j}\right)},$$ ll_joint equals to the N here. Thus, You should be able to finish E steps with just a few lines of code by using ll_joint() and softmax() defined above. 3.1 Helper functions To facilitate some of the operations in the GMM implementation, we would like you to implement the following two helper functions. In these functions, "logit" refers to an input array of size NxD . softmax Given logit ∈ RNxD , calculate prob ∈ RNxD exp (logiti,j) () D ∑ exp(logit ) , where D ∑ exp(logiti,k) ,whereprobi,j= d=1 . Note that it is possible that logiti , j is very large, making exp( ⋅ ) of it to explode. To make sure it is numerical stable, you may need to subtract the maximum for each row of logits . As in calculating pairwise distances, DO NOT USE A FOR LOOP. logsumexp Given logit ∈ RNxD , calculate s ∈ RN , where si = log(∑Dj=1exp(logiti,j)) . Again, pay attention to the numerical problem. You may want to use similar trick as in the softmax function. DO NOT USE A FOR LOOP. 3.2 GMM Implementations [40pts] In [17]: class GMM(object): def __init__(self): # No need to implement pass softmax(self,logits): # [5pts] """ Args: logits: N x D numpy array """ raise NotImplementedError logsumexp(self,logits): # [5pts] """ Args: logits: N x D numpy array Return: s: N x 1 array where s[i,0] = logsumexp(logits[i,:]) """ raise NotImplementedError _init_components(self, points, K, **kwargs): # [5pts] """ Args: points: NxD numpy array, the observations K: number of components kwargs: any other args you want Return: pi: numpy array of length K, prior mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. def def def You will have KxDxD numpy array for full covariance matrix case """ raise NotImplementedError def _ll_joint(self, points, pi, mu, sigma, **kwargs): # [10pts] """ Args: points: NxD numpy array, the observations pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. You will have KxDxD numpy array for full covariance matrix case Return: ll(log-likelihood): NxK array, where ll(i, j) = log pi(j) + log NormalPDF(po ints_i | mu[j], sigma[j]) Hint for undergraduate: Assume that each dimension of our multivariate gaussian a re independent. This allows you to write treat it as a product of univariate gaussians. """ raise NotImplementedError def _E_step(self, points, pi, mu, sigma, **kwargs): # [5pts] """ Args: points: NxD numpy array, the observations pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian.Y ou will have KxDxD numpy array for full covariance matrix case Return: gamma(tau): NxK array, the posterior distribution (a.k.a, the soft cluster as signment) for each observation. Hint: You should be able to do this with just a few lines of code by using _ll_jo int() and softmax() defined above. """ raise NotImplementedError def _M_step(self, points, gamma, **kwargs): # [10pts] """ Args: points: NxD numpy array, the observations gamma(tau): NxK array, the posterior distribution (a.k.a, the soft cluster as signment) for each observation. Return: pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. You will have KxDxD numpy array for full covariance matrix case Hint: There are formulas in the slide. """ raise NotImplementedError def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, **kwargs) : """ Args: points: NxD numpy array, where N is # points and D is the dimensionality K: number of clusters max_iters: maximum number of iterations abs_tol: convergence criteria w.r.t absolute change of loss rel_tol: convergence criteria w.r.t relative change of loss kwargs: any additional arguments you want Return: gamma(tau): NxK array, the posterior distribution (a.k.a, the soft cluster as signment) for each observation. (pi, mu, sigma): (1xK np array, KxD numpy array, KxDxD numpy array) Hint: You do not need to change it. For each iteration, we process E and M steps, then """ pi, mu, sigma = self._init_components(points, K, **kwargs) pbar = tqdm(range(max_iters)) for it in pbar: # E-step gamma = self._E_step(points, pi, mu, sigma) # M-step pi, mu, sigma = self._M_step(points, gamma) # calculate the negative log-likelihood of observation joint_ll = self._ll_joint(points, pi, mu, sigma) loss = -np.sum(self.logsumexp(joint_ll)) if it: diff = np.abs(prev_loss - loss) if diff < abs_tol and diff / prev_loss < rel_tol: break prev_loss = loss pbar.set_description('iter %d, loss: %.4f' % (it, loss)) return gamma, (pi, mu, sigma) 3.3 Japanese art and pixel clustering [10pts] Ukiyo-e is a Japanese art genre predominant from the 17th through 19th centuries. In order to produce the intricate prints that came to represent the genre, artists carved wood blocks with the patterns for each color in a design. Paint would be applied to the block and later transfered to the print to form the image. In this section, you will use your GMM algorithm implementation to do pixel clustering and estimate how many wood blocks were likely used to produce a single print. (Hint: you can justify your answer based on visual inspection of the resulting images or on a different metric of your choosing) You do NOT need to submit your code for this question to the autograder. Instead you should include whatever images/information you find relevant in the report. In [11]: # helper function for performing pixel clustering. You don't have to modify it def cluster_pixels_gmm(image, K): """Clusters pixels in the input image Args: image: input image of shape(H, W, 3) K: number of components Return: clustered_img: image of shape(H, W, 3) after pixel clustering """ im_height, im_width, im_channel = image.shape flat_img = np.reshape(image, [-1, im_channel]).astype(np.float32) gamma, (pi, mu, sigma) = GMM()(flat_img, K=K, max_iters=100) cluster_ids = np.argmax(gamma, axis=1) centers = mu gmm_img = np.reshape(centers[cluster_ids], (im_height, im_width, im_channel)) return gmm_img # helper function for plotting images. You don't have to modify it def plot_images(img_list, title_list, figsize=(20, 10)): assert len(img_list) == len(title_list) fig, axes = plt.subplots(1, len(title_list), figsize=figsize) for i, ax in enumerate(axes): ax.imshow(img_list[i] / 255.0) ax.set_title(title_list[i]) ax.axis('off') In [12]: # pick 2 of the images in this list: url0 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Kuniyoshi_Utagawa%2C_Su ikoden_Series_4.jpg/320px-Kuniyoshi_Utagawa%2C_Suikoden_Series_4.jpg' url1 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Shibai_Ukie_by_Masanobu _Okumura.jpg/640px-Shibai_Ukie_by_Masanobu_Okumura.jpg' url2 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Flickr_-_%E2%80%A6trial sanderrors_-_Utamaro%2C_Kushi_%28Comb%29%2C_ca._1785.jpg/388px-Flickr_-_%E2%80%A6trialsan derrors_-_Utamaro%2C_Kushi_%28Comb%29%2C_ca._1785.jpg' url3 = 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Geisha_Playing_the_Hand -Game_Kitsune-ken_%28%E7%8B%90%E6%8B%B3%29%2C_a_Japanese_rock-paper-scissors_variant%2C_b y_Kikukawa_Eizan_%28%E8%8F%8A%E5%B7%9D%E8%8B%B1%E5%B1%B1%29.jpg/640px-Geisha_Playing_the_ Hand-Game_Kitsune-ken_%28%E7%8B%90%E6%8B%B3%29%2C_a_Japanese_rock-paper-scissors_variant% 2C_by_Kikukawa_Eizan_%28%E8%8F%8A%E5%B7%9D%E8%8B%B1%E5%B1%B1%29.jpg' # example of loading image from url0 image = imageio.imread(imageio.core.urlopen(url0).read()) # this is for you to implement def find_n_woodblocks(image, min_clusters=5, max_clusters=15): """Uses the Args: image: input image of shape(H, W, 3) K: number of components Return: plot: comparison between original image and image pixel clustering (you can use t he helper function) optional: any other information/metric/plot you think is necessary. """ raise NotImplementedError iter 99, loss: 3846331.6228: 100%|██████████| 100/100 [01:08<00:00, 1.46it/s] iter 99, loss: 3436594.8308: 100%|██████████| 100/100 [02:16<00:00, 1.37s/it] iter 99, loss: 2409300.5302: 100%|██████████| 100/100 [00:50<00:00, 1.99it/s] iter 99, loss: 2228146.3128: 100%|██████████| 100/100 [01:33<00:00, 1.06it/s] 4 (Bonus for All) Messy, messy data and semi-supervised learning [30 pts] (Preamble: This part of the assignment was designed to expose you to interesting topics we did not cover in class, while allowing you to do minimal work by reusing most of your previous implementations with some modifications.) Two students at Georgia Tech want to improve the safety of composite Lithium-ion batteries by leveraging data obtained from quality control tests and machine learning. They ordered several battery specimens − rated as safe or unsafe − from various manufacturers. They proceeded to measure the chemical stability, mechanical resistance and charging rate of each specimen. When the campus shutdown was announced in the Spring 2020, the students rushed to the lab to try and collect the hard disks where the data had been stored. After settling back in their hometowns, they compiled the dataset and formatted it such that each row corresponds to the characterization results of a specimen, organized as follows: Chemical stability, mechanical resistance, charging rate, [safe/unsafe] They soon realized they have two major problems: They only have the safe/unsafe labels for characterization tests performed on batteries from one manufacturer (20% of the data), while the labels are missing for all specimens by other manufacturers. Due to a number of corrupt files, several of the labeled tests (30%) are missing some characterization data, which they labeled as NaN on their dataset. The students are aware that the few data points with complete information do not reflect the overall variance of the data. They realize they cannot exclude neither the remaining unlabeled data nor the messy labeled data. Your job is to assist the students in cleaning their data and implementing a semi-supervised learning framework to help them create a general classifier. To help you with this task the students shared four datasets: Labeled_materials_complete.txt: containing the complete material characterization data and corresponding labels (safe = 1 and unsafe = 0); Labeled_materials_incomplete.txt: containing partial material characterization data and corresponding labels (safe = 1 and unsafe = 0); Unlabeled_materials.txt: containing only complete material characterization results; Independent_materials.txt: a labeled dataset the students obtained from a previous student in the laboratory, which you can use to test your model after training. 4.1 Data cleaning with k-NN [10pts] The first step in this task is to clean the Labeled_materials_incomplete dataset by filling in the missing values with probable ones derived from complete data. A useful approach to this type of problem is using a k-nearest neighbors (k-NN) algorithm. For this application, the method consists of replacing the missing value of a given point with the mean of the closest k-neighbors to that point. In [54]: class CleanData(object): def __init__(self): # No need to implement pass def pairwise_dist(self, x, y): # [0pts] - copy from kmeans """ Args: x: N x D numpy array y: M x D numpy array Return: dist: N x M array, where dist2[i, j] is the euclidean distance between x[i, :] and y[j, :] """ raise NotImplementedError def __call__(self, incomplete_points, complete_points, K, **kwargs): # [10pts] """ Args: incomplete_points: N_incomplete x (D+1) numpy array, the incomplete labeled observations ations your calculation on print("*** Expected print("""==complete [[ 1. 5. 3. 1.] [7. 8. 9. 0.] [16. 17. 18. 1.] [22. 23. 24. 0.]] ==incomplete data== [[ 1. nan 3. 1.] [ 7. nan 9. 0.] [nan 17. 18. 1.] Answer - k = 2 ***") data== complete_points: N_complete x (D+1) numpy array, the complete labeled observ K: integer, corresponding to the number of nearest neighbors you want to base kwargs: any other args you want Return: clean_points: (N_incomplete + N_complete) x (D-1) X D numpy array of length K, containing both complete points and recently filled points Hints: (1) You want to find the k-nearest neighbors within each class separately; (2) There are missing values in all of the features. It might be more conv enient to address each feature at a time. """ raise NotImplementedError In [68]: complete_data = np.array([[1.,2.,3.,1],[7.,8.,9.,0],[16.,17.,18.,1],[22.,23.,24.,0]]) incomplete_data = np.array([[1.,np.nan,3.,1],[7.,np.nan,9.,0],[np.nan,17.,18.,1],[np.nan ,23.,24.,0]]) clean_data = CleanData()(incomplete_data, complete_data, 2) [nan 23. 24. 0.]] ==clean_data== [[1. 2. 3. 1.] [7. 8. 9. 0.] [16. 17. 18. 1. ] [22. 23. 24. 0. ] [14.5 23. 24. 0. ] [7. 15.5 9. 0.] [8.517. 18. 1.] [ 1. 9.5 3. 1. ]]""") print("\n*** My Answer - k = 2***") print(clean_data) *** Expected Answer - k = 2 *** ==complete data== [[ 1. 5. 3. 1.] [7. 8. 9. 0.] [16. 17. 18. 1.] [22. 23. 24. 0.]] ==incomplete data== [[1.nan 3. 1.] [7.nan 9. 0.] [nan 17. 18. 1.] [nan 23. 24. 0.]] ==clean_data== [[1. 2. 3. 1.] [7. 8. 9. 0.] [16. 17. 18. 1. ] [22. 23. 24. 0. ] [14.5 23. 24. 0. ] [7. 15.5 9. 0.] [8.517. 18. 1.] [1. 9.5 3. 1.]] *** My Answer - k = 2*** [[1. 2. 3. 1.] [7. 8. 9. 0.] [16. 17. 18. 1. ] [22. 23. 24. 0. ] [14.5 23. 24. 0. ] [7. 15.5 9. 0.] [8.517. 18. 1.] [1. 9.5 3. 1.]] 4.2 Getting acquainted with semi-supervised learning approaches. [5pts] YouwillimplementaversionofthealgorithmpresentedinTable1ofthepaper "TextClassificationfromLabeled and Unlabeled Documents using EM" by Nigam et al. (2000). While you are recommended to read the whole paper this assignment focuses on items 1 − 5.2 and 6.1. Write a brief summary of three interesting highlights of the paper (50-word maximum). 4.3 Implementing the EM algorithm. [10 pts] In your implementation of the EM algorithm proposed by Nigam et al. (2000) on Table 1, you will use a Gaussian Naive Bayes (GNB) classifier as opposed to a naive Bayes (NB) classifier. (Hint: Using a GNB in place of an NB will enable you to reuse most of the implementation you developed for GMM in this assignment. In fact, you can successfully solve the problem by simply modifying the call method.) In [76]: class SemiSupervised(object): def __init__(self): # No need to implement pass def softmax(self,logits): # [0 pts] - can use same as for GMM """ Args: logits: N x D numpy array """ raise NotImplementedError def logsumexp(self,logits): # [0 pts] - can use same as for GMM """ Args: logits: N x D numpy array Return: s: N x 1 array where s[i,0] = logsumexp(logits[i,:]) """ raise NotImplementedError def _init_components(self, points, K, **kwargs): # [5 pts] - modify from GMM """ Args: points: Nx(D+1) numpy array, the observations K: number of components kwargs: any other args you want Return: pi: numpy array of length K, prior mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. Hint: The paper describes how you should initialize your algorithm. """ raise NotImplementedError def _ll_joint(self, points, pi, mu, sigma, **kwargs): # [0 pts] - can use same as fo r GMM """ Args: points: NxD numpy array, the observations pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. Return: ll(log-likelihood): NxK array, where ll(i, j) = log pi(j) + log NormalPDF(po ints_i | mu[j], sigma[j]) Hint: Assume that the three properties of the lithium-ion batteries (multivariate gaussian) are independent. This allows you to treat it as a product of univariate gaussians. """ raise NotImplementedError def _E_step(self, points, pi, mu, sigma, **kwargs): # [0 pts] - can use same as for GMM ent) for each observation. Hint: You should be able to do this with just a few lines of code by using _ll_jo int() and softmax() defined above. """ raise NotImplementedError def _M_step(self, points, gamma, **kwargs): # [0 pts] - can use same as for GMM """ Args: points: NxD numpy array, the observations gamma: NxK array, the posterior distribution (a.k.a, the soft cluster assignm ent) for each observation. Return: pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. Hint: There are formulas in the slide. """ raise NotImplementedError def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, **kwargs) : # [5 pts] - modify from GMM """ Args: points: NxD numpy array, where N is # points and D is the dimensionality K: number of clusters max_iters: maximum number of iterations abs_tol: convergence criteria w.r.t absolute change of loss rel_tol: convergence criteria w.r.t relative change of loss kwargs: any additional arguments you want Return: gamma: NxK array, the posterior distribution (a.k.a, the soft cluster assignm ent) for each observation. """ Args: points: NxD numpy array, the observations pi: np array of length K, the prior of each component mu: KxD numpy array, the center for each gaussian. sigma: KxDxD numpy array, the diagonal standard deviation of each gaussian. Return: gamma: NxK array, the posterior distribution (a.k.a, the soft cluster assignm (pi, mu, sigma): (1xK np array, KxD numpy array, KxD numpy array), mu and sig ma. """ raise NotImplementedError 4.4 Demonstrating the performance of the algorithm. [5pts] Compare the classification error based on the Gaussian Naive Bayes (GNB) classifier you implemented following the Nigam et al. (2000) approach to the performance of a GNB classifier trained using only labeled data. Since you have not covered supervised learning in class, you are allowed to use the scikit learn library for training the GNB classifier based only on labeled data: https://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

\$50.00 for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

### Find A Tutor

View available Machine Learning Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.