{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "\"\"\"\n", "Created on Tue Jan 12 19:40:03 2021" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@author: Usuario\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para instalar la libreria BeautifulSoup, abrimos Anaconda prompt y escribimos:
\n", " pip install beautifulsoup4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importamos las librerias necesarias" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "from urllib.request import urlopen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Crear un objeto BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos acceder a la url para ver como es la web que queremos scrapear" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "url = \"http://olympus.realpython.org/profiles/dionysus\"\n", "# Abrimos la URL\n", "page = urlopen(url)\n", "# Leemos y decodificamos\n", "html = page.read().decode(\"utf-8\")\n", "# Crear objeto BeautifulSoup\n", "soup = BeautifulSoup(html, \"html.parser\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "Profile: Dionysus\n", "\n", "\n", "
\n", "

\n", "\n", "

Name: Dionysus

\n", "

\n", "Hometown: Mount Olympus\n", "

\n", "Favorite animal: Leopard
\n", "
\n", "Favorite Color: Wine\n", "
\n", "\n", "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " CUIDADO :
\n", "Puede que la variable soup, no aparezca en el explorador.
\n", "Esquina superior derecha -- Desmarcar: Excluir objetos llamables y modulos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Utilizando el objeto BeautifulSoup creado ####" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mediante los metodos contenidos dentro de los objetos BeautifulSoup,
\n", "se pueden llevar a cabo diferentes acciones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extraemos todo el texto eliminando las etiquetas HTML" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "url http://olympus.realpython.org/profiles/dionysus\n" ] } ], "source": [ "print (\"url\"+url}\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Profile: Dionysus\n", "\n", "\n", "\n", "\n", "\n", "Name: Dionysus\n", "\n", "Hometown: Mount Olympus\n", "\n", "Favorite animal: Leopard \n", "\n", "Favorite Color: Wine\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "print(soup.get_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podriamos eliminar las lineas en blanco" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Profile: DionysusName: DionysusHometown: Mount OlympusFavorite animal: Leopard Favorite Color: Wine\n" ] } ], "source": [ "noblanklines=soup.get_text().replace(\"\\n\",\"\")\n", "print(noblanklines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "O realizar busquedas en el texto" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "109" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "findtext=soup.get_text().find('Wine')\n", "findtext # Devuelve la posicion de la primera ocurrencia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A veces interesa mantener las etiquetas HTML para poder realizar
\n", "busquedas de elementos especifios, como imagenes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[, ]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all(\"img\")\n", "# Devuelve una lista con los elementos contenidos en esa etiqueta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos extraer el contenido de cada etiqueta en una variable" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "image1, image2 = soup.find_all(\"img\")\n", "# Dentro de las variables creadas, podemos observar los metodos o propiedades" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos pedir que nos diga el tipo de eqtiqueta HTML del objeto
\n", "con la propiedad .name" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'img'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image1.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se podria acceder a los atributos HTML de las etiquetas
\n", "especificando su nombre entre corchetes. (Como en un diccionario)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Etiqueta con un solo atributo
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Etiqueta con dos atributos
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos obtener la fuente de las imagenes contenidas en la URL" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/static/dionysus.jpg'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image1[\"src\"]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/static/grapes.png'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image2[\"src\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se puede acceder a determinadas etiquetas mediante sus propiedades " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # Por ejemplo, sacamos el titulo de la etiqueta mediante la propiedad .title" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Profile: Dionysus" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si accedemos al codigo fuente de la url, podremos observar como la etiqueta" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " # está escrita de la misma manera." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Con la propiedad .string, visualizamos unicamente el contenido de la etiqueta" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Profile: Dionysus'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ " \n", "soup.title.string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Se pueden realizar busquedas para localizar atributos concretos" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \n", "soup.find_all(\"img\", src=\"/static/dionysus.jpg\")\n", "# Devuelve los elementos contenidos en las etiquetas 'img' \n", "# cuyos atributos 'src' coinciden con el valor especificado. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<br>\n", "Esta herramienta es util sobre todo cuando queremos scrapear<br>\n", " partes concretas o elementos contenidos en etiquetas. <br>\n", "Si pasamos algún tiempo navegando por sitios web y viendo los codigos<br>\n", " fuente, notamos que muchos tienen estructuras HTML extremadamente complicadas.<br>\n", "A menudo nos pueden interesar partes concretas de una página.<br>\n", " Dedicando un tiempo a examinar el documento HTML, podremos identificar <br>\n", " que etiquetas con atributos únicos poder utilizar para extraer los datos.<br>\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \n", " ## Ejercicio: Programa que saque el HTML completo de una URL\n", " # URL: http://olympus.realpython.org/profiles\n", " # Imprimir una lista de todos los enlaces buscando las etiquetas <a>,\n", " # y recuperando el valor tomado por el atributo 'href'\n", " \n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }