原文:
www.kdnuggets.com/2021/10/simple-text-scraping-parsing-processing-python-library.html
由 Peter Lawrence 在 Unsplash 上拍摄的照片
寻找一个帮助抓取、解析、处理和提取新闻文章元数据的库? Newspaper 可以提供帮助。Newspaper 是一个 "[n]ews, full-text, and article metadata extraction in Python 3" 库。我会以最尊敬的态度说,Newspaper 是一个快速而简单的文本解析和处理库。它不是万无一失的,并且不会总能满足你对每篇文章的所有需求。然而,它通常能非常好地完成任务,并且速度很快。
1. Google 网络安全证书 - 快速通往网络安全职业生涯的捷径。
2. Google 数据分析专业证书 - 提升你的数据分析水平
3. Google IT 支持专业证书 - 支持你的组织在 IT 方面
让我们开始,以便查看你能多快、多容易地利用这个库。如果你使用的是 Python 3,安装可以通过以下方式完成:
pip install newspaper3k
安装后,Newspaper 非常易于使用。
让我们导入这个库,定义一个我们想用来处理的网页文章,并下载这篇文章。我们将使用最近的 KDnuggets 文章 避免这五种让你看起来像数据新手的行为 来进行这些操作。
from newspaper import Article
kdn_article = Article(url="https://www.kdnuggets.com/2021/10/avoid-five-behaviors-data-novice.html", language='en')
kdn_article.download()
现在,让我们看看我们下载了什么。
# Print out the raw article
print(kdn_article.html)
<!DOCTYPE html>
<html lang="en-US">
<head profile="https://gmpg.org/xfn/11">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name=viewport content="width=device-width, initial-scale=1">
...
<!-- Dynamic page generated in 1.581 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2021-10-28 20:02:56 -->
<!-- Compression = gzip -->
这是文章的整个 HTML 页面。这不是很有用;我们先移除第一个处理步骤,使用 Newspaper 解析文章。完成后,我们打印解析后的文章文本。
# Parse the article, print parsed content
kdn_article.parse()
print(kdn_article.text)
If you are new to the Data Science industry or a well-versed veteran in all things data and analytics, there are always key pitfalls that each of us can easily slide into if we are not careful. These behaviors not only make us appear like novices, but they can risk our position as a trustworthy, likable data partner with stakeholder.
...
Original. Reposted with permission.
Bio: Tessa Xie is a Data scientist in the AV industry, and ex-McKinsey, and 3x Top Medium Writer. Tessa is also at the tip of the data spear by day, a writer by night, and a painter, diver, and much more on the weekends.
Related:
看起来更好。我们已移除了与下载的文章无关的 HTML,并从剩余的 HTML 中提取了有用的文本。
让我们看看可以从解析的文章中提取哪些元数据。
# Article title
print(kdn_article.title)
Avoid These Five Behaviors That Make You Look Like A Data Novice
# Article's top image
print(kdn_article.top_image)
https://www.kdnuggets.com/wp-content/uploads/avoid-five-behaviors-data-novice.jpg
# All article images
print(kdn_article.images)
{'https://www.kdnuggets.com/wp-content/uploads/envelope.png',
'https://www.kdnuggets.com/wp-content/uploads/tripled-my-income-data-science-18-months-small.jpg',
'https://www.kdnuggets.com/wp-content/uploads/avoid-five-behaviors-data-novice.jpg',
'https://www.kdnuggets.com/images/in_c48.png',
'https://www.kdnuggets.com/images/fb_c48.png',
'https://www.kdnuggets.com/images/tw_c48.png',
'https://www.kdnuggets.com/images/menu-30.png',
'https://www.kdnuggets.com/images/search-icon.png'}
# Article author
print(kdn_article.authors)
# Article publication date
print(kdn_article.publish_date)
[]
None
从上述内容可以看出,某些元数据已经被轻松提取,而在出版日期和作者的情况下,Newspaper 结果为空。这就是我在文章开头提到的;库并非魔法,如果文章格式不利于 Newspaper 的模式匹配,这些元数据的识别和提取将不会发生。
了解你可以用解析过的文章 这里 完成的其他任务。
接下来是更有趣的内容……一旦下载并解析了文章,还可以使用 Newspaper 内置的 NLP 功能来处理,方法如下:
# Perform higher level processing on article
kdn_article.nlp()
下面是我们可以对处理过的文章执行的一些任务。
# Article keywords
print(kdn_article.keywords)
['avoid', 'data', 'dont', 'instead', 'things', 'insights', 'novice', 'work', 'quality', 'understand', 'behaviors', 'stakeholders', 'look', 'sample']
# Article summary
print(kdn_article.summary)
If you are new to the Data Science industry or a well-versed veteran in all things data and analytics, there are always key pitfalls that each of us can easily slide into if we are not careful.
There are noticeable differences between people who are new to the data world and those who truly understand how to handle data and be helpful data partners.
So as a data expert, you should know better than trusting data quality at face value.
But in reality, unless you are an ML engineer, you rarely need 10-layer neural networks in your day-to-day data work.
Make sure to QC your data and sanity-check your insights, and always caveat findings when data quality or the sample size is a concern.
这无疑比上面解析的文章元数据提取更有趣,尽管从文章中提取的处理和解析数据肯定都很有用。
请记住,有许多方法可以使用各种不同的库和工具来自动总结文章;然而,Newspaper 提供了一种方法,能够在一行代码中提供合理的结果,无需测试参数。你可以将其与我之前的文章 《自动文本总结入门》 中使用简单词频方法在 Python 中实现类似的提取总结过程进行比较,你会发现需要更多的代码来获得类似的结果。
只对利用 Newspaper 的总结功能感兴趣?这是一个快速的、自包含的示例:
from newspaper import Article
cnn_article = Article(url="https://www.cnn.com/2021/10/28/tech/facebook-mark-zuckerberg-keynote-announcements/index.html", language='en')
cnn_article.download()
cnn_article.parse()
cnn_article.nlp()
print(cnn_article.summary)
The company formerly known as Facebook also said in a press release that it plans to begin trading under the stock ticker "MVRS" on December 1.
Facebook is one of the most used products in the history of the world," Zuckerberg said on Thursday.
"Today we're seen as a social media company," he added, "but in our DNA, we are a company that builds technology to connect people.
But on Zuckerberg's personal Facebook page , his job title was changed to: "Founder and CEO at Meta."
When asked by The Verge if he would remain CEO at Facebook in the next 5 years, he said: "Probably.
就是这样。
了解更多你可以用处理过的文章 这里 完成的任务。
Newspaper 并不完美,存在一些限制,但你可以看到它调用和利用的速度和简便程度,即使在遇到一些限制的情况下也很有用。就个人而言,我编写了自己的代码来执行上述步骤中的许多任务,并且还利用了几个不同的库来完成其他任务,通常需要更多的努力。
实际上,你可以用这个库完成更多的任务,我鼓励你 调查可能性。希望你能够在自己的项目中使用 Newspaper。
相关:
-
使用 HuggingFace Pipelines 的简单问答 Web 应用
-
应用语言技术:实用方法
-
Python 中的日期处理和特征工程