{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Web scraping for PcDeMaNo\n", "To get values from websites which don't provide an API is often only through scraping. It can be very tricky to get to the right values but this example here should help you to get started. This is very similar to the work-flow the [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/) is using." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importing the needed modules." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# POLUCION" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "URL = 'http://gestiona.madrid.org/azul_internet/html/web/DatosEstacionAccion.icm?ESTADO_MENU=2&idEstacion=12'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `requests` the website is retrieved and with `BeautifulSoup` parsed." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "raw_html = requests.get(URL).text\n", "data = BeautifulSoup(raw_html, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have the complete content of the page. [CSS selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) can be used to identify the counter. We have several options to get the part in question. As `BeautifulSoup` is giving us a list with the findings, we only need to identify the position in the list." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 : \n", "\"Comunidad\n", "\n", "2 : \r\n", "  \r\n", " \n", "3 : \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\r\n", " D.G. del Medio Ambiente | Consejería de Medio Ambiente, Administración Local y Ordenación del Territorio\r\n", "
\r\n", " Área de Calidad Atmosférica - Red de Calidad del Aire\r\n", " \n", "
\n", "
\n", "Inicio\n", " > \n", "Datos de la Red\n", "
\n", "\n", "4 : \r\n", " D.G. del Medio Ambiente | Consejería de Medio Ambiente, Administración Local y Ordenación del Territorio\r\n", " \n", "5 : \r\n", " Área de Calidad Atmosférica - Red de Calidad del Aire\r\n", " \n", "6 : \n", "\n", "7 : \n", "\n", "8 : \n", "Inicio\n", " > \n", "Datos de la Red\n", "\n", "9 : \n", "\n", "\"Portal\n", "\n", "\n", "10 : \n", "11 :  \n", "12 : \n", "\n", " Estación de Majadahonda\n", " \n", "\n", "13 : \n", "\n", " Ultima media horaria a las\n", " 13:00\n", " \n", "\n", "14 : \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Contaminantes
\n", " TIN\n", " \n", " (\n", " ºC\n", " )\n", " \n", "\n", " 17.2\n", " \n", "
\n", " NO\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", " 7\n", " \n", "
\n", " NO2\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", " 7\n", " \n", "
\n", " PM10\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", " 4\n", " \n", "
\n", " NOX\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", " ***\n", " N\n", "
\n", " O3\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", " 75\n", " \n", "
\n", "\n", "15 : Contaminantes \n", "16 : \n", " TIN\n", " \n", " (\n", " ºC\n", " )\n", " \n", "\n", "17 : \n", " 17.2\n", " \n", " \n", "18 : \n", " NO\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", "19 : \n", " 7\n", " \n", " \n", "20 : \n", " NO2\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", "21 : \n", " 7\n", " \n", " \n", "22 : \n", " PM10\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", "23 : \n", " 4\n", " \n", " \n", "24 : \n", " NOX\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", "25 : \n", " ***\n", " N\n", " \n", "26 : \n", " O3\n", " \n", " (\n", " µg/m3\n", " )\n", " \n", "\n", "27 : \n", " 75\n", " \n", " \n", "28 : \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Meteorología
\n", " VV\n", " \n", " (\n", " m/s\n", " )\n", " \n", "\n", " 4.3\n", " \n", "
\n", " DV\n", " \n", " (\n", " Grd\n", " )\n", " \n", "\n", " 71\n", " \n", "
\n", " Tmp\n", " \n", " (\n", " ºC\n", " )\n", " \n", "\n", " 14.7\n", " \n", "
\n", " HR\n", " \n", " (\n", " %\n", " )\n", " \n", "\n", " 61\n", " \n", "
\n", " Pre\n", " \n", " (\n", " mbar\n", " )\n", " \n", "\n", " 938\n", " \n", "
\n", " RS\n", " \n", " (\n", " W/m2\n", " )\n", " \n", "\n", " 864\n", " \n", "
\n", " Llu\n", " \n", " (\n", " l/m2\n", " )\n", " \n", "\n", " 0.0\n", " \n", "
\n", "\n", "29 : Meteorología \n", "30 : \n", " VV\n", " \n", " (\n", " m/s\n", " )\n", " \n", "\n", "31 : \n", " 4.3\n", " \n", " \n", "32 : \n", " DV\n", " \n", " (\n", " Grd\n", " )\n", " \n", "\n", "33 : \n", " 71\n", " \n", " \n", "34 : \n", " Tmp\n", " \n", " (\n", " ºC\n", " )\n", " \n", "\n", "35 : \n", " 14.7\n", " \n", " \n", "36 : \n", " HR\n", " \n", " (\n", " %\n", " )\n", " \n", "\n", "37 : \n", " 61\n", " \n", " \n", "38 : \n", " Pre\n", " \n", " (\n", " mbar\n", " )\n", " \n", "\n", "39 : \n", " 938\n", " \n", " \n", "40 : \n", " RS\n", " \n", " (\n", " W/m2\n", " )\n", " \n", "\n", "41 : \n", " 864\n", " \n", " \n", "42 : \n", " Llu\n", " \n", " (\n", " l/m2\n", " )\n", " \n", "\n", "43 : \n", " 0.0\n", " \n", " \n", "44 :  \n", "45 : \n", "46 :  \n", "47 : \n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\"IconoCopyright © Comunidad de Madrid.\n", "Aviso Legal |\n", "Privacidad |\n", "Contacto |\n", "Accesibilidad\n", "
\n", "\n", "48 : \n", "\"Icono\n", "49 : \n", "50 : Copyright © Comunidad de Madrid.\n" ] } ], "source": [ "for i in range (50):\n", " print(i+1,\":\", data.select('td')[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`nth-of-type(x)` gives you element `x` back." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# POLENES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make your selector as robust as possible, it's recommended to look for unique elements like `id`, `URL`, etc." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "URL = 'http://www.madrid.org/cs/Satellite?cid=1265185300196&language=es&pagename=PortalSalud%2FPage%2FPTSA_pintarContenidoFinal&vest=1265185299945'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `requests` the website is retrieved and with `BeautifulSoup` parsed." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "raw_html = requests.get(URL).text\n", "data = BeautifulSoup(raw_html, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have the complete content of the page. [CSS selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) can be used to identify the counter. We have several options to get the part in question. As `BeautifulSoup` is giving us a list with the findings, we only need to identify the position in the list." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 : \n", "Consejería de Sanidad\n", "\n", "2 : Estás en\n", "3 : Datos del día 12 de abril\n", "4 : Niveles \n", "5 : MEDIOS\n", "6 : de polen de \n", "7 : Plátano\n", "8 : \n", "9 : con \n", "10 : 163\n", "11 : \n", "12 : granos de polen por metro cúbico de aire\n", "13 : con un máximo de \n", "14 : 676\n", "15 : \n", "16 : granos en Alcalá de Henares\n", "17 : con un mínimo de \n", "18 : 1 \n", "19 : granos de polen en \n", "20 : Las Rozas\n", "21 : Niveles BAJOS de polen de Gramíneas con \n", "22 : 2 \n", "23 : granos de polen por metro cúbico
\n", "con un máximo de 5 granos en
\n", "24 : Getafe\n", "25 : con un mínimo de \n", "26 : 0 \n", "27 : granos de polen en Leganés\n", "28 : Niveles NULOS de polen de \n", "29 : Plantago\n", "30 : con 0 granos de polen por metro cúbico\n", "31 : Las escalas para cada tipo de polen atienden únicamente a criterios aerobiológicos\n", "32 : Las escalas para cada tipo de polen atienden únicamente a criterios aerobiológicos\n", "33 : Las escalas para cada tipo de polen atienden únicamente a criterios \n", "34 : aerobiológicos\n", "35 : Los niveles de concentración se expresan como granos de polen por metro cúbico de aire y corresponden a los datos de concentración medios para toda la Red Palinocam\n", "36 : Los \n", "37 : niveles de concentración se expresan como granos de polen por metro cúbico de aire y corresponden a los datos de concentración medios para toda la Red \n", "38 : Palinocam\n", "39 : MAPA POLEN\n", "40 : MAPA POLEN\n" ] } ], "source": [ "for i in range (40):\n", " print(i+1,\":\", data.select('span')[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value extration is handled with `value_template` by the [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/). The next two step are only shown here to show all manual steps.\n", "\n", "We only need the actual text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a string and can be manipulated. We focus on the number." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the number of the current platforms/components from the [Component overview](https://home-assistant.io/components/) which are available in Home Assistant.\n", "\n", "The details you identified here can be re-used to configure [`scrape` sensor](https://home-assistant.io/components/sensor.scrape/)'s `select`. This means that the most efficient way is to apply `nth-of-type(x)` to your selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }