博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
使用Python和BeautifulSoup进行网页爬取
阅读量:2518 次
发布时间:2019-05-11

本文共 51933 字,大约阅读时间需要 173 分钟。

To source data for data science projects, you’ll often rely on and databases, , or ready-made CSV data sets.

为了为数据科学项目提供数据,您通常将依赖于和数据库, 或现成的CSV数据集。

The problem is that you can’t always find a data set on your topic, databases are not kept current and APIs are either expensive or have usage limits.

问题在于,您无法始终找到有关主题的数据集,数据库无法保持最新​​状态,API要么昂贵要么受到使用限制。

If the data you’re looking for is on an web page, however, then the solution to all these problems is web scraping.

但是,如果您要查找的数据在网页上,则所有这些问题的解决方案是web scraping

In this tutorial we’ll learn to scrape multiple web pages with Python using and . We’ll then perform some simple analysis using , and .

在本教程中,我们将学习使用和使用Python抓取多个网页。 然后,我们将使用和进行一些简单的分析。

You should already have some basic understanding of HTML, a good grasp of Python’s basics, and a rough idea about what web scraping is. If you are not comfortable with these, I recommend this .

您应该已经对HTML有了一些基本的了解,对Python的基本知识有了很好的了解,并对Web抓取是一个大致的了解。 如果您对这些内容不满意,建议您阅读此 。

抓取超过2000部电影的数据 (Scraping data for over 2000 movies)

We want to analyze the distributions of and movie ratings to see if we find anything interesting. To do this, we’ll first scrape data for over 2000 movies.

我们想分析和电影收视率的分布,看看是否发现任何有趣的东西。 为此,我们将首先抓取超过2000部电影的数据。

It’s essential to identify the goal of our scraping right from the beginning. Writing a scraping script can take a lot of time, especially if we want to scrape more than one web page. We want to avoid spending hours writing a script which scrapes data we won’t actually need.

从一开始就确定我们的抓取目标至关重要。 编写抓取脚本会花费很多时间,尤其是如果我们要抓取多个网页时,尤其如此。 我们希望避免花费大量时间编写一个脚本,该脚本会收集我们实际上不需要的数据。

找出要抓取的页面 (Working out which pages to scrape)

Once we’ve established our goal, we then need to identify an efficient set of pages to scrape.

一旦确定了目标,我们就需要确定一组有效的页面进行抓取。

We want to find a combination of pages that requires a relatively small number of requests. A is what happens whenever we access a web page. We ‘request’ the content of a page from the server. The more requests we make, the longer our script will need to run, and the greater the strain on the server.

我们希望找到需要较少数量请求的页面组合。 是我们访问网页时发生的事情。 我们从服务器“请求”页面的内容。 我们发出的请求越多,脚本需要运行的时间就越长,并且对服务器的压力也越大。

One way to get all the data we need is to compile a list of movie names, and use it to access the web page of each movie on both IMDB and Metacritic websites.

获取我们所需所有数据的一种方法是编译电影名称列表,并使用它来访问IMDB和Metacritic网站上每部电影的网页。

Since we want to get over 2000 ratings from both IMDB and Metacritic, we’ll have to make at least 4000 requests. If we make one request per second, our script will need a little over an hour to make 4000 requests. Because of this, it’s worth trying to identify more efficient ways of obtaining our data.

由于我们希望同时获得IMDB和Metacritic的2000多个评分,因此我们必须至少提出4000个请求。 如果我们每秒发出一个请求,那么脚本将需要一个多小时来发出4000个请求。 因此,值得尝试找出更有效的方式来获取我们的数据。

If we explore the IMDB website, we can discover a way to halve the number of requests. Metacritic scores are shown on the IMDB movie page, so we can scrape both ratings with a single request:

如果我们浏览IMDB网站,我们可以找到一种将请求数量减半的方法。 Metacritic得分显示在IMDB电影页面上,因此我们可以在一个请求中同时抓取两个等级:

If we investigate the IMDB site further, we can discover the page shown below. It contains all the data we need for 50 movies. Given our aim, this means we’ll only have to do about 40 requests, which is 100 times less than our first option. Let’s explore this last option further.

如果我们进一步调查IMDB网站,则会发现下面显示的页面。 它包含了50部电影所需的所有数据。 按照我们的目标,这意味着我们只需要处理大约40个请求,这比我们的第一种选择少100倍。 让我们进一步探讨最后一个选项。

识别URL结构 (Identifying the URL structure)

Our challenge now is to make sure we understand the logic of the URL as the pages we want to scrape change. If we can’t understand this logic enough so we can implement it into code, then we’ll reach a dead end.

现在,我们面临的挑战是确保我们了解要抓取更改的页面时URL的逻辑。 如果我们对这种逻辑不够了解,无法将其实现为代码,那么我们将陷入僵局。

If you go on IMDB’s advanced search , you can browse movies by :

如果您进入IMDB的高级搜索 ,则可以按浏览电影:

Let’s browse by year 2017, sort the movies on the first page by number of votes, then switch to the next page. We’ll arrive at this , which has this URL:

让我们按2017年浏览,按票数对第一页上的电影进行排序,然后切换到下一页。 我们将到达此 ,该具有以下URL:

In the image above, you can see that the URL has several parameters after the question mark:

在上图中,您可以看到URL在问号后面有几个参数:

  • release_date – Shows only the movies released in a specific year.
  • sort – Sorts the movies on the page. sort=num_votes,desc translates to sort by number of votes in a descending order.
  • page – Specifies the page number.
  • ref_ – Takes us to the the next or the previous page. The reference is the page we are currently on. adv_nxt and adv_prv are two possible values. They translate to advance to the next page, and advance to the previous page, respectively.
  • release_date –仅显示特定年份发行的电影。
  • sort –对页面上的影片进行排序。 sort=num_votes,desc转换为按票数降序排列。
  • page –指定页码。
  • ref_ –将我们带到下一页或上一页。 参考是我们当前所在的页面。 adv_nxtadv_prv是两个可能的值。 它们分别翻译以前进到下一页和前进到前一页。

If you navigate through those pages and observe the URL, you will notice that only the values of the parameters change. This means we can write a script to match the logic of the changes and make far fewer requests to scrape our data.

如果您浏览这些页面并观察URL,您会注意到只有参数值会更改。 这意味着我们可以编写一个脚本来匹配更改的逻辑,并且发出更少的请求来抓取我们的数据。

Let’s start writing the script by requesting the content of this single web page: http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1. In the following code cell we will:

让我们通过请求以下单个网页的内容开始编写脚本: http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1 : http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1 ?release_date=2017&sort=num_votes,desc&page http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1 。 在以下代码单元中,我们将:

  • Import the get() function from the requests module.
  • Assign the address of the web page to a variable named url.
  • Request the server the content of the web page by using get(), and store the server’s response in the variable response.
  • Print a small part of response’s content by accessing its .text attribute (response is now a Response object).
  • requests模块导入get()函数。
  • 将网页地址分配给名为url的变量。
  • 使用get()向服务器请求网页的内容,并将服务器的响应存储在变量response
  • 通过访问response文本的.text属性( response现在是Response对象),打印response内容的一小部分。
from from requests requests import import getgeturl url = = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1''http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'response response = = getget (( urlurl ))printprint (( responseresponse .. texttext [:[: 500500 ])])

了解单个页面HTML结构 (Understanding the HTML structure of a single page)

As you can see from the first line of response.text, the server sent us an HTML document. This document describes the overall structure of that web page, along with its specific content (which is what makes that particular page unique).

response.text的第一行可以看到,服务器向我们发送了一个HTML文档。 本文档描述了该网页的整体结构及其特定内容(这是使该特定页面唯一的原因)。

All the pages we want to scrape have the same overall structure. This implies that they also have the same overall HTML structure. So, to write our script, it will suffice to understand the HTML structure of only one page. To do that, we’ll use the browser’s Developer Tools.

我们要抓取的所有页面都具有相同的整体结构。 这意味着它们也具有相同的整体HTML结构。 因此,编写我们的脚本,仅了解一页HTML结构就足够了。 为此,我们将使用浏览器的开发人员工具

If you use , right-click on a web page element that interests you, and then click Inspect. This will take you right to the HTML line that corresponds to that element:

如果您使用的是 ,请右键单击您感兴趣的网页元素,然后单击“检查”。 这将使您直接转到与该元素对应HTML行:

Right-click on the movie’s name, and then left-click Inspect. The HTML line highlighted in gray corresponds to what the user sees on the web page as the movie’s name.

右键单击电影的名称,然后左键单击“检查”。 用灰色突出显示HTML行与用户在网页上看到的电影名称相对应。

You can also do this using both and DevTools.

您也可以同时使用和 DevTools进行此操作。

Notice that all of the information for each movie, including the poster, is contained in a div tag.

请注意,每个电影(包括海报)的所有信息都包含在div标签中。

There are a lot of HTML lines nested within each div tag. You can explore them by clicking those little gray arrows on the left of the HTML lines corresponding to each div. Within these nested tags we’ll find the information we need, like a movie’s rating.

每个div标签中嵌套了许多HTML行。 您可以通过单击与每个div对应HTML行左侧的灰色小箭头来浏览它们。 在这些嵌套标签中,我们将找到所需的信息,例如电影的分级。

There are 50 movies shown per page, so there should be a div container for each. Let’s extract all these 50 containers by parsing the HTML document from our earlier request.

每页显示50个电影,因此每个电影应有一个div容器。 让我们通过解析先前请求中HTML文档来提取所有这50个容器。

使用BeautifulSoup解析HTML内容 (Using BeautifulSoup to parse the HTML content)

To parse our HTML document and extract the 50 div containers, we’ll use a Python module called , the most common web scraping module for Python.

为了解析HTML文档并提取50个div容器,我们将使用一个名为的Python模块,这是Python最常见的Web抓取模块。

In the following code cell we will:

在以下代码单元中,我们将:

  • Import the BeautifulSoup class creator from the package bs4.
  • Parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The 'html.parser' argument indicates that we want to do the parsing using .
  • 从包bs4导入BeautifulSoup类创建者。
  • 通过创建BeautifulSoup对象来解析response.text ,并将该对象分配给html_soup'html.parser'参数指示我们要使用 。
bs4.BeautifulSoup

Before extracting the 50 div containers, we need to figure out what distinguishes them from other div elements on that page. Often, the distinctive mark resides in the class . If you inspect the HTML lines of the containers of interest, you’ll notice that the class attribute has two values: lister-item and mode-advanced. This combination is unique to these div containers. We can see that’s true by doing a quick search (Ctrl + F). We have 50 such containers, so we expect to see only 50 matches:

在提取50个div容器之前,我们需要弄清它们与该页面上其他div元素的区别。 通常,独特标记位于class 。 如果检查感兴趣的容器HTML行,则会注意到class属性具有两个值: lister-itemmode-advanced 。 这些组合对于这些div容器是唯一的。 通过执行快速搜索( Ctrl + F ),我们可以看到这是正确的。 我们有50个这样的容器,因此我们预计只会看到50个匹配项:

Now let’s use the find_all() to extract all the div containers that have a class attribute of lister-item mode-advanced:

现在,让我们使用find_all() 提取所有具有lister-item mode-advanced class属性的div容器:

movie_containers movie_containers = = html_souphtml_soup .. find_allfind_all (( 'div''div' , , class_ class_ = = 'lister-item mode-advanced''lister-item mode-advanced' ))printprint (( typetype (( movie_containersmovie_containers ))))printprint (( lenlen (( movie_containersmovie_containers ))))
50

find_all() returned a ResultSet object which is a list containing all the 50 divs we are interested in.

find_all()返回一个ResultSet对象,该对象是一个列表,其中包含我们感兴趣的所有50个divs

Now we’ll select only the first container, and extract, by turn, each item of interest:

现在,我们将只选择第一个容器,然后依次提取每个感兴趣的项目:

  • The name of the movie.
  • The year of release.
  • The IMDB rating.
  • The Metascore.
  • The number of votes.
  • 电影的名称。
  • 发布年份。
  • IMDB评级。
  • Metascore。
  • 票数。

We can access the first container, which contains information about a single movie, by using list notation on movie_containers.

通过使用movie_containers上的列表符号,我们可以访问第一个容器,其中包含有关单个电影的信息。

1.Logan(2017)

R|137 min|Action, Drama, Sci-Fi

8.3
Rate this
12345678910
8.3/10
X
77 Metascore

In the near future, a weary Logan cares for an ailing Professor X somewhere on the Mexican border. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.

Director:James Mangold| Stars:Hugh Jackman, Patrick Stewart, Dafne Keen, Boyd Holbrook

Votes:320,428| Gross:$226.26M

As you can see, the HTML content of one container is very long. To find out the HTML line specific to each data point, we’ll use DevTools once again.

如您所见,一个容器HTML内容很长。 为了找出特定于每个数据点HTML行,我们将再次使用DevTools。

电影名称 (The name of the movie)

We begin with the movie’s name, and locate its correspondent HTML line by using DevTools. You can see that the name is contained within an anchor tag (<a>). This tag is nested within a header tag (<h3>). The <h3> tag is nested within a <div> tag. This <div> is the third of the divs nested in the container of the first movie. We stored the content of this container in the first_movie variable.

我们从电影的名称开始,然后使用DevTools查找其对应HTML行。 您可以看到该名称包含在锚标记( <a> )中。 该标签嵌套在标头标签( <h3> )中。 <h3>标记嵌套在<div>标记内。 此<div>是第三所述的divs嵌套在第一部电影的容器。 我们将此容器的内容存储在first_movie变量中。

first_movie is a Tag , and the various HTML tags within it are stored as its attributes. We can access them just like we would access any attribute of a Python object. However, using a tag name as an attribute will only select the first tag by that name. If we run first_movie.div, we only get the content of the first div tag:

first_movie是一个Tag ,其中的各种HTML标签都作为其属性存储。 我们可以像访问Python对象的任何属性一样访问它们。 但是,将标签名称用作属性只会选择该名称的第一个标签。 如果运行first_movie.div ,则仅获得第一个div标签的内容:

first_moviefirst_movie .. divdiv

Accessing the first anchor tag (<a>) doesn’t take us to the movie’s name. The first <a> is somewhere within the second div:

访问第一个锚标记( <a> )并不会把我们带到电影的名字。 第一个<a>在第二个div

However, accessing the first <h3> tag brings us very close:

但是,访问第一个<h3>标签会使我们非常接近:

first_moviefirst_movie .. h3h3

1.Logan(2017)

From here, we can use attribute notation to access the first <a> inside the <h3> tag:

在这里,我们可以使用属性符号来访问<h3>标记内的第一个<a>

Now it’s all just a matter of accessing the text from within that <a> tag:

现在,只需要从<a>标记中访问文本即可:

first_name first_name = = first_moviefirst_movie .. h3h3 .. aa .. texttextfirst_namefirst_name

电影上映年份 (The year of the movie’s release)

We move on with extracting the year. This data is stored within the <span> tag below the <a> that contains the name.

我们继续提取年份。 此数据存储在包含名称的<a>下的<span>标记内。

Dot notation will only access the first span element. We’ll search by the distinctive mark of the second <span>. We’ll use the find() which is almost the same as find_all(), except that it only returns the first match. In fact, find() is equivalent to find_all(limit = 1). The limit limits the output to the first match.

点表示法将仅访问第一个span元素。 我们将通过第二个<span>的独特标记进行搜索。 我们将使用与find_all()几乎相同的find() ,不同之处在于它仅返回第一个匹配项。 实际上, find()等同于find_all(limit = 1)limit 将输出限制为第一个匹配项。

The distinguishing mark consists of the values lister-item-year text-muted unbold assigned to the class attribute. So we look for the first <span> with these values within the <h3> tag:

区别标记由分配给class属性的值lister-item-year text-muted unbold 。 因此,我们在<h3>标记内查找具有以下值的第一个<span>

(2017)

From here, we just access the text using attribute notation:

在这里,我们仅使用属性符号访问文本:

first_year first_year = = first_yearfirst_year .. texttextfirst_yearfirst_year

We could easily clean that output and convert it to an integer. But if you explore more pages, you will notice that for some movies the year takes unpredictable values like (2017)(I) or (2015)(V). It’s more efficient to do the cleaning after the scraping, when we’ll know all the year values.

我们可以轻松清除该输出并将其转换为整数。 但是,如果您浏览更多页面,您会发现对于某些电影来说,这一年具有不可预测的值,例如(2017)(I)或(2015)(V)。 当我们知道所有年份的值时,在刮擦后进行清洁会更有效。

IMDB评级 (The IMDB rating)

We now focus on extracting the IMDB rating of the first movie.

现在,我们专注于提取第一部电影的IMDB分级。

There are a couple of ways to do that, but we’ll first try the easiest one. If you inspect the IMDB rating using DevTools, you’ll notice that the rating is contained within a <strong> .

有两种方法可以做到这一点,但我们首先尝试最简单的方法。 如果使用DevTools检查IMDB评级,您会注意到该评级包含在<strong>

Let’s use attribute notation, and hope that the first <strong> will also be the one that contains the rating.

让我们使用属性符号,并希望第一个<strong>也会是包含评分的那个。

8.3

Great! We’ll access the text, convert it to the float type, and assign it to the variable first_imdb:

大! 我们将访问文本,将其转换为float类型,然后将其分配给变量first_imdb

first_imdb first_imdb = = floatfloat (( first_moviefirst_movie .. strongstrong .. texttext ))first_imdbfirst_imdb

元分数 (The Metascore)

If we inspect the Metascore using DevTools, we’ll notice that we can find it within a span tag.

如果我们使用DevTools检查Metascore,则会注意到我们可以在span标签中找到它。

Attribute notation clearly isn’t a solution. There are many <span> tags before that. You can see one right above the <strong> tag. We’d better use the distinctive values of the class attribute (metascore favorable).

属性表示法显然不是解决方案。 在此之前有许多<span>标签。 您可以在<strong>标记上方看到一个。 我们最好使用class属性的独特值( metascore favorable )。

Note that if you copy-paste those values from DevTools’ tab, there will be two white space characters between metascore and favorable. Make sure there will be only one whitespace character when you pass the values as arguments to the class_ parameter. Otherwise, find() won’t find anything.

需要注意的是,如果你复制粘贴DevTools'选项卡中的值,就会出现之间有两个空格字符metascorefavorable 。 当您将值作为参数传递给class_参数时,请确保只有一个空格字符。 否则, find()将找不到任何东西。

The favorable value indicates a high Metascore and sets the rating’s background color to green. The other two possible values are unfavorable and mixed. What is specific to all Metascore ratings though is only the metascore value. This is the one we are going to use when we’ll write the script for the entire page.

favorable值表示Metascore高,并将等级的背景色设置为绿色。 其他两个可能的值是unfavorablemixed 。 尽管所有Metascore评级所特有的只是metascore值。 这是当我们为整个页面编写脚本时将要使用的脚本。

% include signup.html %}

%include signup.html%}

票数 (The number of votes)

The number of votes is contained within a <span> tag. Its distinctive mark is a name attribute with the value nv.

投票数包含在<span>标记内。 它的独特标记是name属性,其值为nv

The name attribute is different from the class attribute. Using BeautifulSoup we can access elements by any attribute. The find() and find_all() functions have a parameter named attrs. To this we can pass in the attributes and values we are searching for as dictionary:

name属性不同于class属性。 使用BeautifulSoup,我们可以按任何属性访问元素。 find()find_all()函数具有一个名为attrs的参数。 为此,我们可以将要搜索的属性和值作为字典传递:

first_votes first_votes = = first_moviefirst_movie .. findfind (( 'span''span' , , attrs attrs = = {
{
'name''name' :: 'nv''nv' })})first_votesfirst_votes
320,428

We could use .text notation to access the <span> tag’s content. It would be better though if we accessed the value of the data-value attribute. This way we can convert the extracted datapoint to an int without having to strip a comma.

我们可以使用.text表示法来访问<span>标记的内容。 不过,如果我们访问data-value属性的值会更好。 这样,我们可以将提取的数据点转换为int而不必去除逗号。

You can treat a Tag object just like a dictionary. The HTML attributes are the dictionary’s keys. The values of the HTML attributes are the values of the dictionary’s keys. This is how we can access the value of the data-value attribute:

您可以像处理字典一样对待Tag对象。 HTML属性是字典的键。 HTML属性的值是字典键的值。 这是我们可以访问data-value属性data-value

Let’s convert that value to an integer, and assign it to first_votes:

让我们将该值转换为整数,并将其分配给first_votes

first_votes first_votes = = intint (( first_votesfirst_votes [[ 'data-value''data-value' ])])

That’s it! We’re now in a position to easily write a script for scraping a single page.

而已! 现在,我们可以轻松地编写用于抓取单个页面的脚本。

单页脚本 (The script for a single page)

Before piecing together what we’ve done so far, we have to make sure that we’ll extract the data only from the containers that have a Metascore.

在拼凑到目前为止所做的工作之前,我们必须确保仅从具有Metascore的容器中提取数据。

We need to add a condition to skip movies without a Metascore.

我们需要添加一个条件以跳过没有Metascore的电影。

Using DevTools again, we see that the Metascore section is contained within a <div> tag. The class attribute has two values: inline-block and ratings-metascore. The distinctive one is clearly ratings-metascore.

再次使用DevTools,我们看到Metascore部分包含在<div>标记内。 class属性具有两个值: inline-blockratings-metascore 。 与众不同的显然是ratings-metascore

We can use find() to search each movie container for a div having that distinct mark. When find() doesn’t find anything, it returns a None object. We can use this result in an if statement to control whether a movie is scraped.

我们可以使用find()在每个影片容器中搜索具有该明显标记的div 。 当find()找不到任何东西时,它将返回None对象。 我们可以在if语句中使用此结果来控制影片是否被抓取。

Let’s look on the to search for a movie container that doesn’t have a Metascore, and see what find() returns.

让我们看一下以搜索没有Metascore的影片容器,并查看find()返回的内容。

Important: when I ran the following code, the eighth container didn’t have a Metascore. However, this is a moving target, because the number of votes constantly changes for each movie. To get the same outputs as I did in the next demonstrative code cell, you should search a container that doesn’t have a Metascore at the time you’re running the code.

重要提示:当我运行以下代码时,第八个容器没有Metascore。 但是,这是一个移动的目标,因为每部电影的票数不断变化。 为了获得与下一个说明性代码单元中相同的输出,您应该在运行代码时搜索没有Metascore的容器。

Now let’s put together the code above, and compress it as much as possible, but only insofar as it’s still easily readable. In the next code block we:

现在,让我们将上面的代码放在一起,并尽可能地对其进行压缩,但仅限于仍易于阅读的范围内。 在下一个代码块中,我们:

  • Declare some list variables to have something to store the extracted data in.
  • Loop through each container in movie_containers (the variable which contains all the 50 movie containers).
  • Extract the data points of interest only if the container has a Metascore.
  • 声明一些list变量以存储提取的数据。
  • 循环遍历movie_containers每个容器(该变量包含所有50个电影容器)。
  • 仅当容器具有Metascore时才提取感兴趣的数据点。
# Lists to store the scraped data in# Lists to store the scraped data innames names = = [][]years years = = [][]imdb_ratings imdb_ratings = = [][]metascores metascores = = [][]votes votes = = [][]# Extract data from individual movie container# Extract data from individual movie containerfor for container container in in movie_containersmovie_containers :        :        # If the movie has Metascore, then extract:    # If the movie has Metascore, then extract:    if if containercontainer .. findfind (( 'div''div' , , class_ class_ = = 'ratings-metascore''ratings-metascore' ) ) is is not not NoneNone :                :                # The name        # The name        name name = = containercontainer .. h3h3 .. aa .. text        text        namesnames .. appendappend (( namename )                )                # The year        # The year        year year = = containercontainer .. h3h3 .. findfind (( 'span''span' , , class_ class_ = = 'lister-item-year''lister-item-year' )) .. text        text        yearsyears .. appendappend (( yearyear )                )                # The IMDB rating        # The IMDB rating        imdb imdb = = floatfloat (( containercontainer .. strongstrong .. texttext )        )        imdb_ratingsimdb_ratings .. appendappend (( imdbimdb )                )                # The Metascore        # The Metascore        m_score m_score = = containercontainer .. findfind (( 'span''span' , , class_ class_ = = 'metascore''metascore' )) .. text        text        metascoresmetascores .. appendappend (( intint (( m_scorem_score ))                ))                # The number of votes        # The number of votes        vote vote = = containercontainer .. findfind (( 'span''span' , , attrs attrs = = {
{
'name''name' :: 'nv''nv' })[})[ 'data-value''data-value' ] ] votesvotes .. appendappend (( intint (( votevote ))))

Let’s check the data collected so far. makes it easy for us to see whether we’ve scraped our data successfully.

让我们检查到目前为止收集的数据。 ,我们可以轻松查看是否成功抓取了数据。

RangeIndex: 32 entries, 0 to 31Data columns (total 5 columns):imdb 32 non-null float64metascore 32 non-null int64movie 32 non-null objectvotes 32 non-null int64year 32 non-null objectdtypes: float64(1), int64(2), object(2)memory usage: 1.3+ KBNone
imdb 数据库 metascore 元分数 movie 电影 votes 票数 year
0 0 8.3 8.3 77 77 Logan 洛根 320428 320428 (2017) (2017)
1 1个 8.1 8.1 67 67 Guardians of the Galaxy Vol. 2 银河护卫队 2 175443 175443 (2017) (2017)
2 2 8.1 8.1 76 76 Wonder Woman 神奇女侠 152067 152067 (2017) (2017)
3 3 7.7 7.7 75 75 John Wick: Chapter 2 约翰·威克(John Wick):第2章 140784 140784 (2017) (2017)
4 4 7.5 7.5 65 65 Beauty and the Beast 美女和野兽 137713 137713 (2017) (2017)
5 5 7.8 7.8 84 84 Get Out 出去 136435 136435 (I) (2017) (一)(2017)
6 6 6.8 6.8 62 62 Kong: Skull Island Kong:骷髅岛 112012 112012 (2017) (2017)
7 7 7.0 7.0 56 56 The Fate of the Furious 愤怒的命运 97690 97690 (2017) (2017)
8 8 6.8 6.8 65 65 Alien: Covenant 外星人:盟约 88697 88697 (2017) (2017)
9 9 6.7 6.7 54 54 Life 生活 80897 80897 (I) (2017) (一)(2017)
10 10 7.0 7.0 39 39 Pirates of the Caribbean: Dead Men Tell No Tales 加勒比海盗:死人不讲故事 77268 77268 (2017) (2017)
11 11 6.6 6.6 52 52 Ghost in the Shell 攻壳机动队 68521 68521 (2017) (2017)
12 12 7.4 7.4 75 75 The LEGO Batman Movie 乐高蝙蝠侠电影 61263 61263 (2017) (2017)
13 13 5.2 5.2 42 42 xXx: Return of Xander Cage xXx:Xander笼子的归来 50697 50697 (2017) (2017)
14 14 4.6 4.6 33 33 Fifty Shades Darker 五十道阴影更深 50022 50022 (2017) (2017)
15 15 7.4 7.4 67 67 T2 Trainspotting T2 Trainspotting 48134 48134 (2017) (2017)
16 16 6.3 6.3 44 44 Power Rangers 电力别动队 44733 44733 (2017) (2017)
17 17 5.8 5.8 34 34 The Mummy 木乃伊 34171 34171 (2017) (2017)
18 18 6.4 6.4 50 50 The Boss Baby 老板宝贝 32976 32976 (2017) (2017)
19 19 6.6 6.6 43 43 A Dog’s Purpose 狗的目的 29528 29528 (2017) (2017)
20 20 4.5 4.5 25 25 Rings 戒指 20913 20913 (2017) (2017)
21 21 5.8 5.8 37 37 Baywatch 游侠 20147 20147 (2017) (2017)
22 22 6.4 6.4 33 33 The Space Between Us 我们之间的空间 19044 19044 (I) (2017) (一)(2017)
23 23 5.3 5.3 28 28 Transformers: The Last Knight 变形金刚:最后的骑士 17727 17727 (2017) (2017)
24 24 6.1 6.1 56 56 War Machine 战争机器 16740 16740 (2017) (2017)
25 25 5.7 5.7 37 37 Fist Fight 拳战 16445 16445 (2017) (2017)
26 26 7.7 7.7 60 60 Gifted 天才 14819 14819 (2017) (2017)
27 27 7.0 7.0 75 75 I Don’t Feel at Home in This World Anymore 我不再在这个世界上感到宾至如归 14281 14281 (2017) (2017)
28 28 5.5 5.5 34 34 Sleepless 不眠 13776 13776 (III) (2017) (三)(2017)
29 29 6.3 6.3 55 55 The Discovery 发现 13207 13207 (2017) (2017)
30 30 6.4 6.4 58 58 Before I Fall 在我跌倒之前 13016 13016 (2017) (2017)
31 31 8.5 8.5 26 26 The Ottoman Lieutenant 奥斯曼帝国中尉 12868 12868 (2017) (2017)

Everything went just as expected!

一切都按预期进行!

As a side note, if you run the code from a country where English is not the main language, it’s very likely that you’ll get some of the movie names translated into the main language of that country.

附带说明一下,如果您从英语不是主要语言的国家/地区运行代码,则很有可能会将某些电影名称翻译成该国家/地区的主要语言。

Most likely, this happens because the server infers your location from your IP address. Even if you are located in a country where English is the main language, you may still get translated content. This may happen if you’re using a VPN while you’re making the GET requests.

这很可能是因为服务器从您的IP地址推断出您的位置。 即使您位于英语为主要语言的国家/地区,您仍可能会获得翻译后的内容。 如果在发出GET请求时使用VPN,则可能会发生这种情况。

If you run into this issue, pass the following values to the headers parameter of the get() function:

如果遇到此问题,请将以下值传递给get()函数的headers参数:

headers headers = = {
{
"Accept-Language""Accept-Language" : : "en-US, en;q=0.5""en-US, en;q=0.5" }}

This will communicate the server something like “I want the linguistic content in American English (en-US). If en-US is not available, then other types of English (en) would be fine too (but not as much as en-US).”. The q parameter indicates the degree to which we prefer a certain language. If not specified, then the values is set to 1 by default, like in the case of en-US. You can read more about this .

这将向服务器传达类似“我想要美式英语(en-US)的语言内容”之类的信息。 如果无法使用en-US,那么其他类型的英语(en)也可以(但不如en-US那么好)。” q参数表示我们偏爱某种语言的程度。 如果未指定,则默认值设置为1 ,就像在en-US的情况下一样。 您可以了解更多信息。

Now let’s start building the script for all the pages we want to scrape.

现在,让我们开始为要抓取的所有页面构建脚本。

多页脚本 (The script for multiple pages)

Scraping multiple pages is a bit more challenging. We’ll build upon our one-page script by doing three more things:

刮多个页面更具挑战性。 我们将通过做三件事来建立我们的一页脚本:

  1. Making all the requests we want from within the loop.
  2. Controlling the loop’s rate to avoid bombarding the server with requests.
  3. Monitoring the loop while it runs.
  1. 从循环内发出我们想要的所有请求。
  2. 控制循环速率,以避免用请求轰炸服务器。
  3. 在循环运行时对其进行监视。

We’ll scrape the first 4 pages of each year in the interval 2000-2017. 4 pages for each of the 18 years makes for a total of 72 pages. Each page has 50 movies, so we’ll scrape data for 3600 movies at most. But not all the movies have a Metascore, so the number will be lower than that. Even so, we are still very likely to get data for over 2000 movies.

在2000-2017年之间,我们将抓取每年的前4页。 18年中的每一年共4页,总计72页。 每个页面上有50部电影,因此我们最多将抓取3600部电影的数据。 但并非所有电影都具有Metascore,因此该数字将低于此数。 即使这样,我们仍然很有可能获得超过2000部电影的数据。

更改URL的参数 (Changing the URL’s parameters)

As shown earlier, the URLs follow a certain logic as the web pages change.

如前所述,URL在网页更改时遵循一定的逻辑。

As we are making the requests, we’ll only have to vary the values of only two parameters of the URL: the release_date parameter, and page. Let’s prepare the values we’ll need for the forthcoming loop. In the next code cell we will:

在发出请求时,我们仅需更改URL的两个参数的值: release_date参数和page 。 让我们准备即将到来的循环所需的值。 在下一个代码单元中,我们将:

  • Create a list called pages, and populate it with the strings corresponding to the first 4 pages.
  • Create a list called years_url and populate it with the strings corresponding to the years 2000-2017.
  • 创建一个名为pages的列表,并使用与前4页相对应的字符串填充它。
  • 创建一个名为years_url的列表,并使用与2000- years_url年相对应的字符串填充它。

控制爬网速率 (Controlling the crawl-rate)

Controlling the rate of crawling is beneficial for us, and for the website we are scraping. If we avoid hammering the server with tens of requests per second, then we are much less likely to get our IP address banned. We also avoid disrupting the activity of the website we scrape by allowing the server to respond to other users’ requests too.

控制爬网速度对我们和我们正在爬网的网站都是有益的。 如果我们避免以每秒数十个请求的速度处理服务器,那么我们被禁止的IP地址的可能性将大大降低。 我们还允许服务器也响应其他用户的请求,从而避免破坏我们抓取的网站的活动。

We’ll control the loop’s rate by using the sleep() from Python’s time . sleep() will pause the execution of the loop for a specified amount of seconds.

我们将使用Python的timesleep() 来控制循环的速率。 sleep()将在指定的秒数内暂停循环的执行。

To mimic human behavior, we’ll vary the amount of waiting time between requests by using the randint() from the Python’s random . randint() randomly generates integers within a specified interval.

为了模仿人类行为,我们将使用Python的randomrandint() 来改变请求之间的等待时间。 randint()在指定间隔内随机生成整数。

For now, let’s just import these two functions to prevent overcrowding in the code cell containing our main loop.

现在,让我们仅导入这两个函数,以防止包含主循环的代码单元中过度拥挤。

from from time time import import sleepsleepfrom from random random import import randintrandint

监控循环是否仍在进行 (Monitoring the loop as it’s still going)

Given that we’re scraping 72 pages, it would be nice if we could find a way to monitor the scraping process as it’s still going. This feature is definitely optional, but it can be very helpful in the testing and debugging process. Also, the greater the number of pages, the more helpful the monitoring becomes. If you are going to scrape hundreds or thousands of web pages in a single code run, I would say that this feature becomes a must.

鉴于我们要抓取72页,如果我们能找到一种方法来监控抓取过程,那就太好了。 此功能绝对是可选的,但在测试和调试过程中可能非常有用。 同样,页数越多,监视就越有帮助。 如果您要在一次代码运行中抓取成百上千个网页,我想说这个功能是必须的。

For our script, we’ll make use of this feature, and monitor the following parameters:

对于我们的脚本,我们将利用此功能,并监视以下参数:

  • The frequency (speed) of requests, so we make sure our program is not overloading the server.
  • The number of requests, so we can halt the loop in case the number of expected requests is exceeded.
  • The of our requests, so we make sure the server is sending back the proper responses.
  • 请求频率(速度) ,因此我们确保程序不会使服务器超载。
  • 请求数 ,因此如果超出了预期的请求数,我们可以停止循环。
  • 我们请求的 ,因此我们确保服务器正在发送回正确的响应。

To get a frequency value we’ll divide the number of requests by the time elapsed since the first request. This is similar to computing the speed of a car – we divide the distance by the time taken to cover that distance. Let’s experiment with this monitoring technique at a small scale first. In the following code cell we will:

为了获得频率值,我们将请求数除以自第一个请求以来经过的时间。 这类似于计算汽车的速度-我们将距离除以覆盖该距离所花费的时间。 首先让我们以小规模尝试这种监视技术。 在以下代码单元中,我们将:

  • Set a starting time using the time() from the time , and assign the value to start_time.
  • Assign 0 to the variable requests which we’ll use to count the number of requests.
  • Start a loop, and then with each iteration:
    • Simulate a request.
    • Increment the number of requests by 1.
    • Pause the loop for a time interval between 8 and 15 seconds.
    • Calculate the elapsed time since the first request, and assign the value to elapsed_time.
    • Print the number of requests and the frequency.
  • 使用timetime() 设置开始时间,然后将值分配给start_time
  • 将0分配给变量requests ,我们将使用它来计算请求的数量。
  • 开始一个循环,然后进行每次迭代:
    • 模拟一个请求。
    • 将请求数增加1。
    • 暂停循环8到15秒之间的时间间隔。
    • 计算自第一个请求以来的经过时间,并将该值分配给elapsed_time
    • 打印请求数和频率。
Request: 1; Frequency: 0.49947650463238624 requests/sRequest: 2; Frequency: 0.4996998027377252 requests/sRequest: 3; Frequency: 0.5995400143227362 requests/sRequest: 4; Frequency: 0.4997272043465967 requests/sRequest: 5; Frequency: 0.4543451628627026 requests/s

Since we’re going to make 72 requests, our work will look a bit untidy as the output accumulates. To avoid that, we’ll clear the output after each iteration, and replace it with information about the most recent request. To do that we’ll use the clear_output() from the IPython’s core.display . We’ll set the wait parameter of clear_output() to True to wait with replacing the current output until some new output appears.

由于我们将要发出72个请求,因此随着输出的累积,我们的工作会显得有些混乱。 为避免这种情况,我们将在每次迭代后清除输出,并用有关最新请求的信息替换它。 为此,我们将使用IPython的core.displayclear_output() 。 我们将clear_output()wait参数设置为True以等待替换当前输出,直到出现一些新输出为止。

from from IPython.core.display IPython.core.display import import clear_outputclear_outputstart_time start_time = = timetime ()()requests requests = = 00for for _ _ in in rangerange (( 55 ):    ):    # A request would go here    # A request would go here    requests requests += += 1    1    sleepsleep (( randintrandint (( 11 ,, 33 ))    ))    current_time current_time = = timetime ()    ()    elapsed_time elapsed_time = = current_time current_time - - start_time    start_time    printprint (( 'Request: 'Request:  {}{} ; Frequency: ; Frequency:  {}{}  requests/s' requests/s' .. formatformat (( requestsrequests , , requestsrequests // elapsed_timeelapsed_time ))    ))    clear_outputclear_output (( wait wait = = TrueTrue ))
Request: 5; Frequency: 0.6240351700607663 requests/s

The output above is the output you will see once the loop has run. Here’s what it looks like while it’s running

上面的输出是循环运行后将看到的输出。 这是它运行时的样子

To monitor the status code we’ll set the program to warn us if there’s something off. A is indicated by a status code of 200. We’ll use the warn() from the warnings to throw a warning if the status code is not 200.

为了监视状态码,我们将程序设置为在有问题的情况下向我们发出警告。 由状态码200指示。如果状态码不是200,我们将使用warningswarn() warnings

/Users/joshuadevlin/.virtualenvs/everday-ds/lib/python3.4/site-packages/ipykernel/__main__.py:3: UserWarning: Warning Simulation  app.launch_new_instance()

We chose a warning over breaking the loop because there’s a good possibility we’ll scrape enough data, even if some of the requests fail. We will only break the loop if the number of requests is greater than expected.

我们选择了打破循环的警告,因为即使某些请求失败,我们也很有可能会抓取足够的数据。 仅当请求数量大于预期时,我们才会中断循环。

拼凑起来 (Piecing everything together)

Now let’s piece together everything we’ve done so far! In the following code cell, we start by:

现在,让我们拼凑到目前为止已完成的所有工作! 在下面的代码单元格中,我们开始于:

  • Redeclaring the lists variables so they become empty again.
  • Preparing the monitoring of the loop.
  • 重新声明列表变量,以便它们再次变为空。
  • 准备监视循环。

Then, we’ll:

然后,我们将:

  • Loop through the years_url list to vary the release_date parameter of the URL.
  • For each element in years_url, loop through the pages list to vary the page parameter of the URL.
  • Make the GET requests within the pages loop (and give the headers parameter the right value to make sure we get only English content).
  • Pause the loop for a time interval between 8 and 15 seconds.
  • Monitor each request as discussed before.
  • Throw a warning for non-200 status codes.
  • Break the loop if the number of requests is greater than expected.
  • Convert the response’s HTML content to a BeautifulSoup object.
  • Extract all movie containers from this BeautifulSoup object.
  • Loop through all these containers.
  • Extract the data if a container has a Metascore.
  • 循环浏览years_url列表以更改URL的release_date参数。
  • 对于years_url每个元素,循环浏览pages列表以更改URL的page参数。
  • pages循环内进行GET请求(并为headers参数指定正确的值,以确保仅获得英语内容)。
  • 暂停循环8到15秒之间的时间间隔。
  • 如前所述监视每个请求。
  • 引发非200状态代码的警告。
  • 如果请求数量大于预期数量,则中断循环。
  • responseHTML内容转换为BeautifulSoup对象。
  • 从这个BeautifulSoup对象中提取所有影片容器。
  • 遍历所有这些容器。
  • 如果容器具有Metascore,则提取数据。
# Redeclaring the lists to store data in# Redeclaring the lists to store data innames names = = [][]years years = = [][]imdb_ratings imdb_ratings = = [][]metascores metascores = = [][]votes votes = = [][]# Preparing the monitoring of the loop# Preparing the monitoring of the loopstart_time start_time = = timetime ()()requests requests = = 00# For every year in the interval 2000-2017# For every year in the interval 2000-2017for for year_url year_url in in years_urlyears_url :        :        # For every page in the interval 1-4    # For every page in the interval 1-4    for for page page in in pagespages :                :                # Make a get request        # Make a get request        response response = = getget (( 'http://www.imdb.com/search/title?release_date=' 'http://www.imdb.com/search/title?release_date=' + + year_url year_url +         +         '&sort=num_votes,desc&page=' '&sort=num_votes,desc&page=' + + pagepage , , headers headers = = headersheaders )                )                # Pause the loop        # Pause the loop        sleepsleep (( randintrandint (( 88 ,, 1515 ))                ))                # Monitor the requests        # Monitor the requests        requests requests += += 1        1        elapsed_time elapsed_time = = timetime () () - - start_time        start_time        printprint (( 'Request:'Request: {}{} ; Frequency: ; Frequency:  {}{}  requests/s' requests/s' .. formatformat (( requestsrequests , , requestsrequests // elapsed_timeelapsed_time ))        ))        clear_outputclear_output (( wait wait = = TrueTrue )                      )                      # Throw a warning for non-200 status codes        # Throw a warning for non-200 status codes        if if responseresponse .. status_code status_code != != 200200 :            :            warnwarn (( 'Request: 'Request:  {}{} ; Status code: ; Status code:  {}{} '' .. formatformat (( requestsrequests , , responseresponse .. status_codestatus_code ))                      ))                      # Break the loop if the number of requests is greater than expected        # Break the loop if the number of requests is greater than expected        if if requests requests > > 7272 :            :            warnwarn (( 'Number of requests was greater than expected.''Number of requests was greater than expected.' )              )              break                 break                 # Parse the content of the request with BeautifulSoup        # Parse the content of the request with BeautifulSoup        page_html page_html = = BeautifulSoupBeautifulSoup (( responseresponse .. texttext , , 'html.parser''html.parser' )                )                # Select all the 50 movie containers from a single page        # Select all the 50 movie containers from a single page        mv_containers mv_containers = = page_htmlpage_html .. find_allfind_all (( 'div''div' , , class_ class_ = = 'lister-item mode-advanced''lister-item mode-advanced' )                )                # For every movie of these 50        # For every movie of these 50        for for container container in in mv_containersmv_containers :            :            # If the movie has a Metascore, then:            # If the movie has a Metascore, then:            if if containercontainer .. findfind (( 'div''div' , , class_ class_ = = 'ratings-metascore''ratings-metascore' ) ) is is not not NoneNone :                                :                                # Scrape the name                # Scrape the name                name name = = containercontainer .. h3h3 .. aa .. text                text                namesnames .. appendappend (( namename )                                )                                # Scrape the year                 # Scrape the year                 year year = = containercontainer .. h3h3 .. findfind (( 'span''span' , , class_ class_ = = 'lister-item-year''lister-item-year' )) .. text                text                yearsyears .. appendappend (( yearyear )                )                # Scrape the IMDB rating                # Scrape the IMDB rating                imdb imdb = = floatfloat (( containercontainer .. strongstrong .. texttext )                )                imdb_ratingsimdb_ratings .. appendappend (( imdbimdb )                )                # Scrape the Metascore                # Scrape the Metascore                m_score m_score = = containercontainer .. findfind (( 'span''span' , , class_ class_ = = 'metascore''metascore' )) .. text                text                metascoresmetascores .. appendappend (( intint (( m_scorem_score ))                ))                # Scrape the number of votes                # Scrape the number of votes                vote vote = = containercontainer .. findfind (( 'span''span' , , attrs attrs = = {
{
'name''name' :: 'nv''nv' })[})[ 'data-value''data-value' ] ] votesvotes .. appendappend (( intint (( votevote ))))
Request:72; Frequency: 0.07928964663062842 requests/s

Nice! The scraping seems to have worked perfectly. The script ran for about 16 minutes.

真好! 刮擦似乎效果很好。 该脚本运行了大约16分钟。

Now let’s merge the data into a pandas DataFrame to examine what we’ve managed to scrape. If everything is as expected, we can move on with cleaning the data to get it ready for analysis.

现在,让我们将数据合并到pandas DataFrame以检查我们设法抓取的内容。 如果一切都按预期进行,我们可以继续清理数据以准备进行分析。

检查抓取的数据 (Examining the scraped data)

In the next code block we:

在下一个代码块中,我们:

  • Merge the data into a pandas DataFrame.
  • Print some informations about the newly created DataFrame.
  • Show the first 10 entries.
  • 将数据合并到pandas DataFrame
  • 打印有关新创建的DataFrame一些信息。
  • 显示前10个条目。
RangeIndex: 2862 entries, 0 to 2861Data columns (total 5 columns):imdb 2862 non-null float64metascore 2862 non-null int64movie 2862 non-null objectvotes 2862 non-null int64year 2862 non-null objectdtypes: float64(1), int64(2), object(2)memory usage: 111.9+ KBNone
imdb 数据库 metascore 元分数 movie 电影 votes 票数 year
0 0 8.5 8.5 67 67 Gladiator 角斗士 1061075 1061075 (2000) (2000年)
1 1个 8.5 8.5 80 80 Memento 纪念 909835 909835 (2000) (2000年)
2 2 8.3 8.3 55 55 Snatch 抢夺 643588 643588 (2000) (2000年)
3 3 8.4 8.4 68 68 Requiem for a Dream 梦之安魂曲 617747 617747 (2000) (2000年)
4 4 7.4 7.4 64 64 X-Men X战警 485485 485485 (2000) (2000年)
5 5 7.7 7.7 73 73 Cast Away 抛弃 422251 422251 (2000) (2000年)
6 6 7.6 7.6 64 64 American Psycho 美国心理 383669 383669 (2000) (2000年)
7 7 7.2 7.2 62 62 Unbreakable 牢不可破 273907 273907 (2000) (2000年)
8 8 7.0 7.0 73 73 Meet the Parents 认识父母 272023 272023 (2000) (2000年)
9 9 6.1 6.1 59 59 Mission: Impossible II 任务:不可能II 256789 256789 (2000) (2000年)

The output of info() shows we collected data for well over 2000 movies. We can also see that there are no null values in our dataset whatsoever.

info()的输出显示我们收集了超过2000部电影的数据。 我们还可以看到,我们的数据集中根本没有null值。

I have checked the ratings of these first 10 movies against the IMDB’s website. They were all correct. You may want to do the same thing yourself.

我已经根据IMDB的网站检查了前10部电影的收视率。 他们都是正确的。 您可能想自己做同样的事情。

We can safely proceed with cleaning the data.

我们可以安全地进行数据清理。

清除抓取的数据 (Cleaning the scraped data)

We’ll clean the scraped data with two goals in mind: plotting the distribution of IMDB and Metascore ratings, and sharing the dataset. Consequently, our data cleaning will consist of:

我们将牢记两个目标来清理刮取的数据:绘制IMDB和Metascore评分的分布图,以及共享数据集。 因此,我们的数据清理将包括:

  • Reordering the columns.
  • Cleaning the year column and convert the values to integers.
  • Checking the extreme rating values to determine if all the ratings are within the expected intervals.
  • Normalizing one of the ratings type (or both) for generating a comparative .
  • 重新排序列。
  • 清洗year列并将值转换为整数。
  • 检查极限额定值,以确定所有额定值是否均在预期间隔内。
  • 归一化评级类型(或两者)以生成比较 。

Let’s start by reordering the columns:

让我们从重新排列列开始:

movie_ratings movie_ratings = = movie_ratingsmovie_ratings [[[[ 'movie''movie' , , 'year''year' , , 'imdb''imdb' , , 'metascore''metascore' , , 'votes''votes' ]]]]movie_ratingsmovie_ratings .. headhead ()()
movie 电影 year imdb 数据库 metascore 元分数 votes 票数
0 0 Gladiator 角斗士 (2000) (2000年) 8.5 8.5 67 67 1061075 1061075
1 1个 Memento 纪念 (2000) (2000年) 8.5 8.5 80 80 909835 909835
2 2 Snatch 抢夺 (2000) (2000年) 8.3 8.3 55 55 643588 643588
3 3 Requiem for a Dream 梦之安魂曲 (2000) (2000年) 8.4 8.4 68 68 617747 617747
4 4 X-Men X战警 (2000) (2000年) 7.4 7.4 64 64 485485 485485

Now let’s convert all the values in the year column to integers.

现在,让我们将year列中的所有值转换为整数。

Right now all the values are of the object type. To avoid ValueErrors upon conversion, we want the values to be composed only from numbers from 0 to 9.

现在,所有值都是object类型。 为避免转换时出现ValueErrors ,我们希望值仅由0到9之间的数字组成。

Let’s examine the unique values of the year column. This helps us to get an idea of what we could do to make the conversions we want. To see all the unique values, we’ll use the unique() method:

让我们检查year列的唯一值。 这有助于我们了解如何进行所需的转换。 要查看所有唯一值,我们将使用unique()方法:

array(['(2000)', '(I) (2000)', '(2001)', '(I) (2001)', '(2002)',       '(I) (2002)', '(2003)', '(I) (2003)', '(2004)', '(I) (2004)',       '(2005)', '(I) (2005)', '(2006)', '(I) (2006)', '(2007)',       '(I) (2007)', '(2008)', '(I) (2008)', '(2009)', '(I) (2009)',       '(II) (2009)', '(2010)', '(I) (2010)', '(II) (2010)', '(2011)',       '(I) (2011)', '(IV) (2011)', '(2012)', '(I) (2012)', '(II) (2012)',       '(2013)', '(I) (2013)', '(II) (2013)', '(2014)', '(I) (2014)',       '(II) (2014)', '(III) (2014)', '(2015)', '(I) (2015)',       '(II) (2015)', '(VI) (2015)', '(III) (2015)', '(2016)',       '(II) (2016)', '(I) (2016)', '(IX) (2016)', '(V) (2016)', '(2017)',       '(I) (2017)', '(III) (2017)', '(IV) (2017)'], dtype=object)

Counting from the end toward beginning, we can see that the years are always located from the fifth character to the second. We’ll use the .str() to select only that interval. We’ll also convert the result to an integer using the astype() :

从结束到开始计数,我们可以看到年份总是从第五个字符到第二个字符。 我们将使用.str() 仅选择该间隔。 我们还将使用astype() 将结果转换为整数:

movie_ratingsmovie_ratings .. locloc [:, [:, 'year''year' ] ] = = movie_ratingsmovie_ratings [[ 'year''year' ]] .. strstr [[ -- 55 :: -- 11 ]] .. astypeastype (( intint ))

Let’s visualize the first 3 values of the year column for a quick check. We can also see the type of the values on the last line of the output:

让我们可视化year列的前三个值以进行快速检查。 我们还可以在输出的最后一行看到值的类型:

0    20001    20002    2000Name: year, dtype: int64

Now we’ll check the minimum and maximum values of each type of rating. We can do this very quickly by using pandas’ describe() . When applied on a DataFrame, this method returns various descriptive statistics for each numerical column of the DataFrame. In the next line of code we select only those rows that describe the minimum and maximum values, and only those columns which describe IMDB ratings and Metascores.

现在,我们将检查每种等级的最小值和最大值。 我们可以使用pandas的describe() 非常快地完成此操作。 当在施加DataFrame ,此方法返回的各种描述性统计对的每个数值列DataFrame 。 在下一行代码中,我们仅选择那些描述最小值和最大值的行,并且仅选择那些描述IMDB评级和Metascores的列。

movie_ratingsmovie_ratings .. describedescribe ()() .. locloc [[[[ 'min''min' , , 'max''max' ], ], [[ 'imdb''imdb' , , 'metascore''metascore' ]]]]
imdb 数据库 metascore 元分数
min 1.6 1.6 7.0 7.0
max 最高 9.0 9.0 100.0 100.0

There are no unexpected outliers.

没有意外的异常值。

From the values above, you can see that the two ratings have different scales. To be able to plot the two distributions on a single graph, we’ll have to bring them to the same scale. Let’s normalize the imdb column to a 100-points scale.

从上面的值中,您可以看到两个等级的等级不同。 为了能够在单个图形上绘制两个分布,我们必须将它们设置为相同比例。 让我们将imdb列标准化为100点刻度。

We’ll multiply each IMDB rating by 10, and then we’ll do a quick check by looking at the first 3 rows:

我们将每个IMDB等级乘以10,然后通过查看前3行进行快速检查:

movie 电影 year imdb 数据库 metascore 元分数 votes 票数 n_imdb n_imdb
0 0 Gladiator 角斗士 2000 2000 8.5 8.5 67 67 1061075 1061075 85.0 85.0
1 1个 Memento 纪念 2000 2000 8.5 8.5 80 80 909835 909835 85.0 85.0
2 2 Snatch 抢夺 2000 2000 8.3 8.3 55 55 643588 643588 83.0 83.0

Nice! We are now in a position to save this dataset locally, so we can share it with others more easily. I have already shared it publicly on my . There are other places where you can share a dataset, like , or .

真好! 现在,我们可以将此数据集本地保存,因此我们可以更轻松地与他人共享它。 我已经在上公开分享了它。 您可以在其他地方共享数据集,例如或 。

So let’s save it:

因此,让我们保存一下:

movie_ratingsmovie_ratings .. to_csvto_csv (( 'movie_ratings.csv''movie_ratings.csv' ))

As a side note, I strongly recommend saving the scraped dataset before exiting (or restarting) your notebook kernel. This way you will only have to import the dataset when you resume working, and don’t have to run the scraping script again. This becomes extremely useful if you scrape hundreds or thousands of web pages.

附带说明,强烈建议您在退出(或重新启动)笔记本内核之前保存抓取的数据集。 这样,您仅需在恢复工作时导入数据集,而不必再次运行抓取脚本。 如果您抓取数百或数千个网页,这将变得非常有用。

Finally, let’s plot the distributions!

最后,让我们绘制分布!

绘制和分析分布 (Plotting and analyzing the distributions)

In the following code cell we:

在以下代码单元中,我们:

  • Import the matplotlib.pyplot submodule.
  • Run the Jupyter magic %matplotlib to activate Jupyter’s matplotlib mode and add inline to have our graphs displayed inside the notebook.
  • Create a figure object with 3 axes.
  • Plot the distribution of each unnormalized rating on an individual ax.
  • Plot the normalized distributions of the two ratings on the same ax.
  • Hide the top and right spines of all the three axes.
  • 导入matplotlib.pyplot子模块。
  • 运行Jupyter magic %matplotlib以激活Jupyter的matplotlib模式并添加inline以在笔记本中显示我们的图形。
  • 创建一个具有3个axesfigure对象。
  • 将每个未归一化等级的分布绘制在单个ax
  • 在同一ax上绘制两个等级的归一化分布。
  • 隐藏所有三个axes的顶部和右侧棘刺。

Starting with the IMDB , we can see that most ratings are between 6 and 8. There are few movies with a rating greater than 8, and even fewer with a rating smaller than 4. This indicates that both very good movies and very bad movies are rarer.

从IMDB ,我们可以看到大多数收视率都在6到8之间。收视率大于8的电影很少,而收视率小于4的电影则更少。这表明非常好的电影和非常差的电影比较稀有

The distribution of Metascore ratings resembles – most ratings are average, peaking at the value of approximately 50. From this peak, the frequencies gradually decrease toward extreme rating values. According to this distribution, there are indeed fewer very good and very bad movies, but not that few as the IMDB ratings indicate.

Metascore评级的分布类似于 –大多数评级是平均的,在约50的峰值处达到峰值。从该峰值开始,频率逐渐降低至极限评级值。 根据这种分布,确实有非常好的和非常差的电影确实很少,但正如IMDB分级所表明的那样,并不是那么少。

On the comparative graph, it’s clearer that the IMDB distribution is highly skewed toward the higher part of the average ratings, while the Metascore ratings seem to have a much more balanced distribution.

在比较图中,很明显,IMDB的分布高度偏向于平均评分的较高部分,而Metascore评分似乎具有更加平衡的分布。

What might be the reason for that skew in the IMDB distribution? One hypothesis is that many users tend to have a binary method of assessing movies. If they like the movie, they give it a 10. If they don’t like the movie, they give it a very small rating, or they don’t bother to rate the movie. This an interesting problem that’s worth being explored in more detail.

IMDB发行版出现这种偏斜的原因可能是什么? 一种假设是,许多用户倾向于使用一种评估电影的二进制方法。 如果他们喜欢这部电影,则给它10分。如果他们不喜欢这部电影,则给它一个很小的评分,或者不打扰给电影评分。 这是一个有趣的问题,值得更详细地探讨。

下一步 (Next steps)

We’ve come a long way from requesting the content of a single web page to analyzing ratings for over 2000 movies. You should now know how to scrape many web pages with the same HTML and URL structure.

从请求单个网页的内容到分析超过2000部电影的收视率,我们已经走了很长一段路。 现在,您应该知道如何使用相同HTML和URL结构来抓取许多网页。

  • Scrape data for different time and page intervals.
  • Scrape additional data about the movies.
  • Find a different website to scrape something that interests you. For example, you could scrape data about to see how prices vary over time.
  • 在不同的时间和页面间隔下抓取数据。
  • 刮刮有关电影的其他数据。
  • 查找其他网站,以抓取您感兴趣的内容。 例如,您可以抓取有关数据,以查看价格随时间的变化情况。

翻译自:

转载地址:http://ssqwd.baihongyu.com/

你可能感兴趣的文章
python3安装scrapy
查看>>
Git(四) - 分支管理
查看>>
PHP Curl发送数据
查看>>
HTTP协议
查看>>
CentOS7 重置root密码
查看>>
Centos安装Python3
查看>>
PHP批量插入
查看>>
laravel连接sql server 2008
查看>>
Laravel框架学习笔记之任务调度(定时任务)
查看>>
Laravel 的生命周期
查看>>
Nginx
查看>>
Navicat远程连接云主机数据库
查看>>
Nginx配置文件nginx.conf中文详解(总结)
查看>>
jxl写入excel实现数据导出功能
查看>>
linux文件目录类命令|--cp指令
查看>>
.net MVC 404错误解决方法
查看>>
linux系统目录结构
查看>>
学习进度
查看>>
使用Postmark测试后端存储性能
查看>>
NSTextView 文字链接的定制化
查看>>