QuestionQuestion

Transcribed TextTranscribed Text

Introduction to Machine Learning Programming Project For the seventh and final programming assignment, you will implement a reinforcement learner and apply it to the racetrack problem. The racetrack problem is a standard control problem. The goal is to control the movement of a race car along a pre-defined r acetrack. You want your race c ar to g et f rom t he starting line to the finish line i n a minimum amount o f t ime. I n t his version o f t he p roblem, your a gent will b e the only racer on the track, so it is more like a time trial than a full competitive race. More formally, at each time step, the state of the agent can be encoded in four variables: xt and yt are the x and y coordinates corresponding to the location of the car at time step t (we treat the racetrack as being layed out on a Cartesian grid). The variables x˙ t and y˙t represent the x and y components of the car’s velocity at time t. The control variables for the car are ax and ay, which represent the x and y components of an acceleration vector to be applied at the current time step. The system is governed by an approximation of the standard laws of kinematics: xt ≡ x position yt ≡ y position x˙ t = xt − xt−1 ≡ x speed y˙t = yt − yt−1 ≡ y speed axt = x¨t = x˙ t − x˙ t−1 ≡ x acceleration ayt = y¨y = y˙t − y˙t−1 ≡ y acceleration At any given time step, your car only has active control over the values of ax and ay and must use these control variables to influence t he car’s s tate. T his e ssentially gives you t he a bility to a ccelerate, decelerate, and turn. There is a further restriction that the set of values that may be assigned to a control variable is −1, 0, and 1. That is, a{x,y} ∈ {−1, 0, 1}. The velocity of your car at any given time is limited to (x˙ t, y˙t) ∈ [±5, ±5]. Any attempt to accelerate or decelerate beyond these limits will be ignored. As an example, if at time t = 0 your car is at location (2, 2) with velocity (1, 0), it is essentially moving towards the east. If you apply an acceleration of (1, 1), then at timestep t = 1 your position will be (4, 3) and your velocity will be (2, 1). At each time step, your acceleration is applied to your velocity before your position is updated. Strictly speaking, this is not a good model of Newtonian physics, but we’re dealing with integer position and velocity values, so some simplifying abstractions need to be made. If this were the extent of the problem, it would be quite easy to solve for any given track, so to make things a bit more interesting, we are going add in a small amount of non-determinism. To do this, we assign a probability to each attempted action for each possible outcome. For this assignment, we specify that for any attempt to accelerate, there is a 20% chance that attempt will simply fail, and the velocity will remain unchanged at the next timestep. Thus, at each timestep, the probability of accelerating as specified i s 80% and the probability of having no acceleration is 20%. In this problem, there is an additional requirement that you stay on the track; crashing into the wall is bad. You will experiment with two different versions of how “bad” it is to crash. The first variant says that, if the car crashes into a wall, it is placed at the nearest position on the track to the place where it crashed, and its velocity is set to (0, 0). The second, harsher variant says that when a car crashes, its position is set back to the original starting position, as well as zeroing its velocity. Essentially, in this latter case if you crash, you have to start over from the beginning. You should implement both variants in your program so that you can experiment with what effects they have on the strategies your car learns. Since you have a limited time to run experiments, however, you are only expected to do a side by side comparison of the two definitions of a “ crash” on the R shaped track (in the file R- track.txt). For the other tracks, use the version where crashing stops the car but leaves it in the location it crashed. The cost function is 1 for each move, except when you reach the finish l ine. The finish line lo cations are absorbing states with cost 0. Since you are attempting to minimize cost, this translates to attempting to 1 complete the race in as few time steps as possible. You do not need to stop on a finish state. It is sufficient to cross the finish line, so this means the “wall” behind the finish line really isn’t a wall. Here are the specific steps that need to be followed. • Implement the racetrack problem as described above, including a racing simulator that takes a track specification (described below). • Implement the Value Iteration algorithm and either the Q-learning or SARSA algorithm, and apply them to the racetrack problem. • For extra credit (for up to 10 additional points), implement both Q-learning and SARSA and apply them to the racetrack problem. • For extra credit (for up to 20 additional points), implement TD-learning with function approximation (i.e., with a feedforward neural network) and apply it to the racetrack problem. • Test your algorithms on the two of the three provided tracks in the data files described below. One of the tracks must be the R-track, but you can choose between the L-track and the O-track. The files are named L-track.txt, O-track.txt, and R-track.txt and can be found within Blackboard. Test your algorithms on each of the two crash scenarios. It is up to you to determine appropriate design parameters such as learning rate, discount factor, exploration strategy, etc. • Use the data files for the tracks provided, represented in ASCII. The first line lists the size of the track as an comma delimited pair hrows, colsi. The rest of the file is a grid of the specified dimensions with one character at each point. The legend for the files is: – S – This square is on the starting line. – F – This square is on the finish line. – . – This square is open racetrack. – # – This square is off the racetrack (i.e., a wall). As a simple example: 5,5 FF### ..### ..### ....S ....S • Run experiments, keeping track of the number of training iterations and the number of steps the race car needs to take before finding the finish line. Make sure you run at least 10 experiments for each track (preferably more) to get reasonable statistics. Also generate data so you can plot learning curves. • Write a paper that incorporates the following elements, summarizing the results of your experiments. Make sure you explain the experimental setup, the tuning process, and the final parameters used for each algorithm. 1. Title and author names 2. A brief, one paragraph abstract summarizing the results of the experiments 3. Problem statement, including hypothesis (how do you expect the algorithms to do?) 4. Description of algorithms implemented 5. Description of your experimental approach 6. Presentation of the results of your experiments 7. A discussion of the behavior of your algorithms, combined with any conclusions you can draw 8. Summary and 9. Reference

Solution PreviewSolution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

import numpy as np
import random
import os
import time
import pyprind
import matplotlib.pyplot as plt


class Racetrack:
   
    """
    A class that implements a reinforcement learning and applying it to the racetrack problem.
    The class implements diiferent algorithms that solves the problem:
    1) Value Iteration
    2) Q-learning
    3) SARSA
    4) TD- with function approximation
    """
   
    def __init__(self, v_min=-5, v_max=5, gamma=0.9, action_probab=0.8, acceleration = (-1,0,1), learning_rate=0.2):
      
       self.actions = [(i,j) for j in acceleration for i in acceleration] # a list of all possible actions
       self.gamma = gamma # discount rate
       self.v_min = v_min # maximum velocity
       self.v_max = v_max # minimum velocity
       self.velocities = np.arange(v_min, v_max+1, 1) # list of all possible velocities
       self.action_probab = action_probab # the probability of accelerating
       self.learning_rate = learning_rate # learning rate (for Q-learning and SARSA)
       self.threshold = 0.02 # if the change of Q-value is below the threshold, we can assume that it is stabilized
      
       self.number_of_iterations = 50
      
       # keep track if the car gets stuck
       self.y_stuck = 0
       self.x_stuck = 0
       self.stuck_counter = 0
      
       self.print_racetrack_ = False
       self.number_of_steps = []
      
      
    def load_track(self):
       """
       method that reads the reactrack
       racetrack is stored as 2D numpy array
       """
       self.track = []
       # open the file
       with open(self.track_path) as file:
            track_lines = file.readlines()[1:] # skip the first line
            # iterate over all lines
            for line in track_lines:
                line = line.strip('\n')
                self.track.append(list(line))
       self.track = np.asarray(self.track)
   
    def start_position(self):
       """
       method that randomly selects starting position
       """
       start_positions = list(zip(*np.where(self.track == 'S')))
       self.y, self.x = random.choice(start_positions)
   
    def final_positions(self):
       """
       method that creates a list of final positions
       """
       positions = list(zip(*np.where(self.track == 'F')))
       self.final = np.asarray(positions)
   
    def update_velocities(self, action):
       """
       method that updates velocities
       """
       v_y_temp = self.v_y + action[0]
       v_x_temp = self.v_x + action[1]
      
       # velocity of the car is limited
       # update the velocity only if it is within limit
       if abs(v_x_temp) <= self.v_max:
            self.v_x = v_x_temp
       if abs(v_y_temp) <= self.v_max:
            self.v_y = v_y_temp
            
    def within_track(self):
       """
       function that checks if the current coordinates of the car are within the environment
       """
       if ((self.y >= self.track.shape[0] or self.x >= self.track.shape[1]) or
            (self.y<0 or self.x<0)):
            return False
       return True
      
      
    def update_state(self, action, probability):
       """
       method that updates the state state of the environment, i.e updates position and velocity of the car
       """
      
       # the probability of accelerating is 0.8
       if np.random.uniform() < probability:
            self.update_velocities(action) # update velocity
      
      
       y_temp, x_temp = self.y, self.x
      
       # update position
       self.x += self.v_x
       self.y += self.v_y
      
       """"
       prevent the car to go through the wall, so that if "#" character (wall) is between
       the current and the next position of the car, do not update position of the car
       """
       if self.within_track() and self.track[self.y, self.x] != "#":
            if self.v_y == 0:
                if "#" in self.track[y_temp, min(self.x, x_temp):max(self.x, x_temp)].ravel():
                   self.x = x_temp
                   self.v_y, self.v_x = 0, 0
                  
            elif self.v_x == 0:
                if "#" in self.track[min(self.y, y_temp):max(self.y, y_temp), self.x].ravel():
                   self.y = y_temp
                   self.v_y, self.v_x = 0, 0
                  
            elif self.v_x == self.v_y:
                if "#" in self.track[min(self.y, y_temp):max(self.y, y_temp), min(self.x, x_temp):max(self.x, x_temp)]:
                   self.x, self.y = x_temp, y_temp
                   self.v_y, self.v_x = 0, 0
            else:
                if "#" in self.track[min(self.y, y_temp):max(self.y, y_temp), min(self.x, x_temp):max(self.x, x_temp)].ravel():
                   self.x, self.y = x_temp, y_temp
                   self.v_y, self.v_x = 0, 0
               
       # if the car crashes into the wall, call method return_to_track
       if not self.within_track() or self.track[self.y, self.x] == "#":
            self.return_to_track()

      
    def return_to_track(self):
       """
       method that returns the car to the racetrack when it crashes into the wall
       there are two scenarios:
       1) return the car to the starting position
       2) return the car to the nearest open cell (where it crashed)
      
       """
      
       open_cells = ".FS"
      
       # return track to the nearest open cell
       if self.start_from == "nearest_position":
            # go back to the position before crash
            self.x += -self.v_x
            self.y += -self.v_y
            
            
            L = []
            for k in range(abs(self.v_x)):
                L.append(1)
            for k in range(abs(self.v_y)):
                L.insert(2*k+1, 0)
            
            for i in L:
                if i:
                   self.x += np.sign(self.v_x)
                   if self.within_track():
                        if self.track[self.y, self.x] == "#":
                            self.x += -np.sign(self.v_x)
                            break
                else:
                   self.y += np.sign(self.v_y)
                   if self.within_track():
                        if self.track[self.y, self.x] == "#":
                            self.y += -np.sign(self.v_y)
                            break

       elif self.start_from == "starting_position":
            self.start_position()
      
       # set car velocity to zero
       self.v_y, self.v_x = 0, 0
      
    def is_stuck(self):
       """
       check if the car have gotten stuck
       if the car has not been moving for 4 steps, return True, else return False
       """
       if (self.y_stuck == self.y and self.x_stuck == self.x):
            self.stuck_counter += 1
            self.y_stuck = self.y
            self.x_stuck = self.x
            if self.stuck_counter >= 4:
                return True
       else:
            self.stuck_counter = 0
            self.y_stuck = self.y
            self.x_stuck = self.x
      
       return False
   
               
    def value_iteration_train(self):
       """
       method that implements Value iteration algorithm
       """
       num_of_iterations = 50
       print("Algorithm: Value Iteration")
       print("Number of iterations:", self.episodes)
       print("\nProgress:\n")
      
       # initialize a progress bar object allows visuzalization of an computation
       bar = pyprind.ProgBar(self.episodes)
      
       for iteration in range(self.episodes):
            # iterate over all possible states
            for y in range(self.track.shape[0]):
                for x in range(self.track.shape[1]):
                   for v_y in self.velocities:
                        for v_x in self.velocities:
                            if self.track[y, x] == '#':
                               self.V[y, x, v_y, v_x] = -10
                               continue
                           
                            self.y, self.x, self.v_y, self.v_x = y, x, v_y, v_x
                           
                            for a_index, a in enumerate(self.actions):
                               if self.track[y, x] == "F":
                                    self.reward = 0
                               else:
                                    self.reward = -1
                              
                               self.y, self.x, self.v_y, self.v_x = y, x, v_y, v_x
                               # update state
                               self.update_state(a, 1)
                               new_state = self.V[self.y, self.x, self.v_y, self.v_x]
                              
                               self.y, self.x, self.v_y, self.v_x = y, x, v_y, v_x
                               self.update_state((0, 0), 1)
                               new_state_failed = self.V[self.y, self.x, self.v_y, self.v_x]

                               expected_value = self.action_probab*new_state +\
                               (1-self.action_probab)*new_state_failed
                               self.Q[y, x, v_y, v_x, a_index] = self.reward + self.gamma*expected_value                     

                            self.V[y, x, v_y, v_x] = np.max(self.Q[y, x, v_y, v_x])
                           
            self.Q[self.final[:, 0], self.final[:, 1], :, :, :] = 0
            self.V[self.final[:, 0], self.final[:, 1], :, :] = 0...
$200.00 for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Python Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats