Question
Build search for the top 1000 IMDB movies. Users should be able to search for different aspects of the movie (e.g. director name) and get back the set of movies related to it. For instance, in my implementation:
1. Searching for “spielberg” returned the following list: ["Schindler's List","Saving Private Ryan","Raiders of the Lost Ark","Indiana Jones and the Last Crusade","Jurassic Park","Catch Me If You Can","Jaws","E.T. the Extra-Terrestrial","Empire of the Sun","The Color Purple","Minority Report","Close Encounters of the Third Kind","Bridge of Spies","Indiana Jones and the Temple of Doom","Munich"]
2. Searching for "spielberg hanks" returns: ["Saving Private Ryan","Catch Me If You Can","Bridge of Spies"]. This is the list of all movies associated with both Spielberg and Hanks.
Components
1. Crawl: You will need to crawl the IMDB listing pages (there are multiple pages) and all the movies linked off of the listing pages.
2. Parse: You will need to parse the pages to extract the right information.
3. Search database: “Index” the information in some way. Given the scale of the problem is not large, you can use a simple in-memory data structure.
4. Expose a simple API that returns the movie names given a search term.
Non Goals
This is a very open-ended problem. The goal is not to build a comprehensive solution. Feel free to take decisions on the scope of what you want to accomplish. You can take as much or as little time on the exercise but my recommendation would be spending 3-5 hours at the most. Some decision you can make on the scope:
• Relevance and ranking are not in scope.
• Your data structure should be optimized for lookup time (take into account the number of lookups). But you don't have to worry about memory scale at this point.
• There is no UX needed. A simple API would do.
Feel free to make your own decisions on the scope.
Helpful tips
• Most languages provide libraries for extracting information from web pages using CSS selectors (e.g. BeautifulSoup for python)
• Chrome provides you the ability to look at the CSS selector for a given element on the page
Deliverables
• Code for this exercise along with instructions on how to run it.
• Thoughts on some simplifying assumptions you made and why.
• Some thoughts on If you had more time, how you would improve the solution from the perspective of software architecture, scale, performance or quality.
Solution Preview
These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.
movies_result = list()movie_count = 0
for i in range(1, 21):
print("Processing page: {0} \n".format(i))
url_get = requests.get(main_url.format(i))
html_soup = BeautifulSoup(url_get.content, 'html5lib')
movie_list = html_soup.select('.lister-item')
for movie in movie_list:
movie_count += 1
div = movie.select('.col-title')[0]
span = div.select('span')[0]
a_tag = div.select('a')[0]
link = a_tag.get('href')
title = a_tag.text
movie_dict = dict()
movie_dict["title"] = title
movie_dict["url"] = "http://www.imdb.com" + link
movie_dict["rank"] = movie_count
print("")
print(u"Processing {0}: {1}".format(movie_count, movie_dict["title"]))
url_get = requests.get(movie_dict["url"])...
By purchasing this solution you'll be able to access the following files:
Solution.zip.