Настроим виртуальную среду:
# Install virtualenv (If you don't have one)
## installing | Window
pip install --upgrade virtualenv
## installing | Linux | Mac
pip3 install --upgrade virtualenv
# create virtual environment | Window | Linux | Mac
virtualenv pyp-env
# activating pyp-env | Window
pyp-env\Scripts\activate
# activating pyp-env | Linux | Mac
source pyp-env/bin/activate
Установим Pyppeteer и зависимости:
# Installing using pip | Window
C:\> python -m pip install pyppeteer
# Installing using pip | Linux | Mac
$ python3 -m pip install pyppeteer
# Installing from a GitHub repository | Window
C:\> python -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev
# Installing from a GitHub repository | Linux | Mac
$ python3 -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev
Также стоит отметить, что Pyppeteer по умолчанию имеет асинхронную поддержку - он позволяет вашему скрипту/приложению асинхронно обрабатывать запросы. Это повышает производительность.
Импортируйте необходимые библиотеки:
import asyncio
from pyppeteer import launch
Создадим асинхронную функцию, которая будет открыть веб-сайт и делать снимок экрана:
import asyncio
from pyppeteer import launch
async def main():
# launch chromium browser in the background
browser = await launch()
# open a new tab in the browser
page = await browser.newPage()
# add URL to a new page and then open it
await page.goto("https://www.python.org/")
# create a screenshot of the page and save it
await page.screenshot({"path": "python.png"})
# close the browser
await browser.close()
print("Starting...")
asyncio.get_event_loop().run_until_complete(main())
print("Screenshot has been taken")
Когда вы увидите "Screenshot has been taken", вы сможете увидеть новое изображение под названием "python.png" в текущем каталоге. Это должно выглядеть примерно так:
Допустим нам необходимо получить статьи с сайта Educative.io/edpresso. Содержимое будет отображаться в интерактивном режиме на основе того, что вы вводите в поле поиска, как показано здесь:
# launch browser in non-headless mode
browser = await launch({"headless": False})
# It's also good choice to allow full screen
# To enable full screen on the launched browser
# Here how you do that
browser = await launch({"headless": False, "args": ["--start-maximized"]}
import asyncio
from typing import List
from pyppeteer import launch
async def get_article_titles(keywords: List[str]):
# launch browser in headless mode
browser = await launch({"headless": False, "args": ["--start-maximized"]})
# create a new page
page = await browser.newPage()
# set page viewport to the largest size
await page.setViewport({"width": 1600, "height": 900})
# navigate to the page
await page.goto("https://www.educative.io/edpresso")
# locate the search box
entry_box = await page.querySelector(
"#__next > div.ed-grid > div.ed-grid-main > div > div.flex.flex-row.items-center.justify-around.bg-gray-50.dark\:bg-dark.lg\:py-0.lg\:px-6 > div > div.w-full.p-0.m-0.flex.flex-col.lg\:w-1\/2.lg\:py-0.lg\:px-5 > div.pt-6.px-4.pb-0.lg\:sticky.lg\:p-0 > div > div > div.w-full.dark\:bg-dark.h-12.flex-auto.text-sm.font-normal.rounded-sm.cursor-text.inline-flex.items-center.hover\:bg-alphas-black06.dark\:hover\:bg-gray-A900.border.border-solid.overflow-hidden.focus-within\:ring-1.border-gray-400.dark\:border-gray-900.focus-within\:border-primary.dark\:focus-within\:border-primary-light.focus-within\:ring-primary.dark\:focus-within\:ring-primary-light > input"
)
# Type keyword in search box
await entry_box.type(keyword)
# wait for search results to load
await page.waitFor(4000)
# extract the article titles
topics = await page.querySelectorAll("h2")
for topic in topics:
title = await topic.getProperty("textContent")
# print the article titles
print(await title.jsonValue())
# clear the input box
for _ in range(len(keyword)):
await page.keyboard.press("Backspace")
for keyword in keywords:
# type keyword in search box
await entry_box.type(keyword)
# wait for search results to load
await page.waitFor(4000)
# extract the article titles
topics = await page.querySelectorAll("h2")
# print the article titles
for topic in topics:
title = await topic.getProperty("textContent")
print(await title.jsonValue())
# clear the input box
for _ in range(len(keyword)):
await page.keyboard.press("Backspace")
Теперь, когда вы написали различные части алгоритма, пришло время собрать весь сценарий. Готовый исходный код должен выглядеть следующим образом:
import asyncio
from typing import List
from pyppeteer import launch
async def get_article_titles(keywords: List[str]):
# launch browser in headless mode
browser = await launch({"headless": False, "args": ["--start-maximized"]})
# create a new page
page = await browser.newPage()
# set page viewport to the largest size
await page.setViewport({"width": 1600, "height": 900})
# navigate to the page
await page.goto("https://www.educative.io/edpresso")
# locate the search box
entry_box = await page.querySelector(
"#__next > div.ed-grid > div.ed-grid-main > div > div.flex.flex-row.items-center.justify-around.bg-gray-50.dark\:bg-dark.lg\:py-0.lg\:px-6 > div > div.w-full.p-0.m-0.flex.flex-col.lg\:w-1\/2.lg\:py-0.lg\:px-5 > div.pt-6.px-4.pb-0.lg\:sticky.lg\:p-0 > div > div > div.w-full.dark\:bg-dark.h-12.flex-auto.text-sm.font-normal.rounded-sm.cursor-text.inline-flex.items-center.hover\:bg-alphas-black06.dark\:hover\:bg-gray-A900.border.border-solid.overflow-hidden.focus-within\:ring-1.border-gray-400.dark\:border-gray-900.focus-within\:border-primary.dark\:focus-within\:border-primary-light.focus-within\:ring-primary.dark\:focus-within\:ring-primary-light > input"
)
for keyword in keywords:
print("====================== {} ======================".format(keyword))
# type keyword in search box
await entry_box.type(keyword)
# wait for search results to load
await page.waitFor(4000)
# extract the article titles
topics = await page.querySelectorAll("h2")
for topic in topics:
title = await topic.getProperty("textContent")
# print the article titles
print(await title.jsonValue())
# clear the input box
for _ in range(len(keyword)):
await page.keyboard.press("Backspace")
print("Starting...")
asyncio.get_event_loop().run_until_complete(
get_article_titles(["python", "opensource", "opencv"])
)
print("Finished extracting articles titles")
Пришло время посмотреть на работу скрипта. Запустите сценарий так, как вы обычно запускаете сценарий Python, как показано здесь:
$ python3 app.py
Starting...
====================== python ======================
What is the contextlib module?
What is the difference between String find() and index() method?
Installing pip3 in Ubuntu
What is a private heap space?
......
====================== opensource ======================
Knative
How to use ASP.NET Core
What is apache Hadoop?
What is OpenJDK?
What is Azure Data Studio?
.....
====================== opencv ======================
What is OpenCV in Python?
Eye Blink detection using OpenCV, Python, and Dlib
How to capture a frame from real-time camera video using OpenCV
Finished extracting articles titles