{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "\"\"\"\n", "Created on Tue Jan 12 17:43:43 2021" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@author: borja\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lo primero que vamos a hacer es abrir una pagina
\n", "http://olympus.realpython.org/profiles/aphrodite" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "url = \"http://olympus.realpython.org/profiles/aphrodite\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "page = urlopen(url)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A continuacion sacamos la estructura del HTML" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "html_bytes = page.read()\n", "html = html_bytes.decode(\"utf-8\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Profile: Aphrodite\n", "\n", "\n", "
\n", "

\n", "\n", "

Name: Aphrodite

\n", "

\n", "Favorite animal: Dove\n", "

\n", "Favorite color: Red\n", "

\n", "Hometown: Mount Olympus\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Procedemos a extraer diferentes partes a través de extracción de texto.
\n", "La extracción del texto será marcando el inicio y el fin. Debemos tener en cuenta que tras cada línea existe un salto de línea (\\n) que también cuenta en la posición inicial del texto a extraer.\n", "Buscamos el titulo." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title_index = html.find(\"\")\n", "title_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seleccionamos donde comienza el contenido del titulo" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "start_index = title_index + len(\"<title>\")\n", "start_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Y el final con la barra (\"/\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "39" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "end_index = html.find(\"\")\n", "end_index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sacamos el titulo" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Profile: Aphrodite'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title = html[start_index:end_index]\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probamos este sistema con una nueva página. Si intentamos repetir lo mismo para Poseidon hay problemas." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\n\\nProfile: Poseidon'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = \"http://olympus.realpython.org/profiles/poseidon\"\n", "page = urlopen(url)\n", "html = page.read().decode(\"utf-8\")\n", "start_index = html.find(\"<title>\") + len(\"<title>\")\n", "end_index = html.find(\"\")\n", "title = html[start_index:end_index]\n", "title" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vemos que el titulo sale mal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pasamos a ver la pagina" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Profile: Poseidon\n", "\n", "\n", "
\n", "

\n", "\n", "

Name: Poseidon

\n", "

\n", "Favorite animal: Dolphin\n", "

\n", "Favorite color: Blue\n", "

\n", "Hometown: Sea\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "print(html)\n", "# Vemos que despues del titulo aparece un espacio." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "html.find(\"\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len('<title>')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n" ] } ], "source": [ "# Cuando buscamos el titulo devuelve un -1 porque no existe.\n", "# El principio es 6\n", "print(start_index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "El carácter en el índice 6 de la cadena html es un carácter de nueva <br>\n", "línea (\\ n) justo antes del corchete de ángulo de apertura (<) de <br>\n", "la etiqueta <head>. Esto significa que html [start_index: end_index] <br>\n", "devuelve todo el HTML comenzando con esa nueva línea y terminando justo antes<br>\n", " la etiqueta .\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este tipo de cosas hay que evitarlas para lo que utilizamos expresiones regulares" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importamos la libreria" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vamos a ver un ejemplo.
\n", "Utilizamos findall() para buscar cualquier texto dentro de una cadena
\n", "que coincida con una expresión regular determinada" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ac']" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"ac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En esta funcion el primer argumento es la expresion que se desea hacer coincidir.
\n", "El asterisco (*) representa cero o más de lo que viene justo antes del asterisco.
\n", "El segundo argumento es la cadena a probar.
\n", "Aqui se busca ab*c en la cadena ac (donde b puede aparecer o no en abcd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mas ejemplos" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abc']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"abcd\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ac']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"acc\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ac']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"acd\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"adc\")" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abc', 'ac']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"abcac\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"abdc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si no indicamos nada diferencia mayusculas" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"ABC\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pero podemos evitarlo" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ABC']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"ab*c\", \"ABC\", re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El patrón .* dentro de una expresión regular representa cualquier carácter repetido
\n", "cualquier número de veces.
\n", "Por ejemplo, \"a. * C\" se puede usar para encontrar cada subcadena que comience con \"a\"
\n", "y termine con \"c\", independientemente de la letra o letras que se encuentren entre ellas:
\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abc']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"a.*c\", \"abc\")" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['abbc']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"a.*c\", \"abbc\")" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ac']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"a.*c\", \"ac\")" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['acc']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(\"a.*c\", \"acc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "A menudo, usa re.search () para buscar un patrón particular dentro de una cadena.
\n", "Esta función es algo más complicada que re.findall () porque devuelve un objeto
\n", "llamado MatchObject que almacena diferentes grupos de datos.
\n", "Esto se debe a que puede haber coincidencias dentro de otras coincidencias
\n", "y re.search () devuelve todos los resultados posibles.
\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ABC'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match_results = re.search(\"ab*c\", \"ABC\", re.IGNORECASE)\n", "match_results.group()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".group () en un MatchObject devolverá el primer y más inclusivo resultado (el qué más se ajuste a nuestra búsqueda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "re.sub () permite reemplazar texto en una cadena que coincide con una
\n", "expresión regular con texto nuevo. En el ejemplo sustituimos desde el primer < hasta el último >\n", "Se sustituye if it's in por ELEPHANTS\n", ".* busca la cadena más larga posible a sustituir " ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Everything is ELEPHANTS.'" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = \"Everything is if it's in .\"\n", "string = re.sub(\"<.*>\", \"ELEPHANTS\", string)\n", "string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En este caso solo cambia el string que mas se le parezca.
\n", "Vemos que ocurre cambiando la palabra." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Everything is tac.'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = \"Everything is if it's in .\"\n", "string = re.sub(\"<.*>\", \"tac\", string)\n", "string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si queremos cambiar todas las ocurrencias donde tengamos texto entre < <, añadimos la interrogacion.
\n", "A través de .*? se busca el macheo más corto, al contrario que .* que busca el macheo más largo.\n", "Esto provoca que no busque el macheo" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Everything is ELEPHANTS if it's in ELEPHANTS.\"" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = \"Everything is if it's in .\"\n", "string = re.sub(\"<.*?>\", \"ELEPHANTS\", string)\n", "string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Una vez visto esto vamos a probar la siguiente pagina.
\n", "http://olympus.realpython.org/profiles/dionysus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "from urllib.request import urlopen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cargamos la pagina." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "url = \"http://olympus.realpython.org/profiles/dionysus\"\n", "page = urlopen(url)\n", "html = page.read().decode(\"utf-8\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Profile: Dionysus\n", "\n", "\n", "
\n", "

\n", "\n", "

Name: Dionysus

\n", "

\n", "Hometown: Mount Olympus\n", "

\n", "Favorite animal: Leopard
\n", "
\n", "Favorite Color: Wine\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vemos que presenta el mismo problema.
\n", "Lo solucionamos con lo visto" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "pattern = \".*?\"\n", "match_results = re.search(pattern, html, re.IGNORECASE)\n", "title = match_results.group()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizamos el titulo" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Profile: Dionysus\n" ] } ], "source": [ "print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eliminamos los marcadores" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Profile: Dionysus\n" ] } ], "source": [ "title = re.sub(\"<.*?>\", \"\", title)\n", "# Visualizamos el titulo\n", "print(title)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EJERCICIO 1
\n", "Escriba un programa que tome el HTML completo de la siguiente URL
\n", "url = \"http://olympus.realpython.org/profiles/dionysus\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }