{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -*- coding: utf-8 -*-\n", "\"\"\"\n", "Created on Tue Jan 12 20:11:10 2021" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@author: Usuario\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " MechanicalSoup ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " pip install MechanicalSoup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting MechanicalSoup\n", " Downloading MechanicalSoup-1.1.0-py3-none-any.whl (19 kB)\n", "Requirement already satisfied: lxml in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from MechanicalSoup) (4.6.3)\n", "Requirement already satisfied: beautifulsoup4>=4.7 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from MechanicalSoup) (4.10.0)\n", "Requirement already satisfied: requests>=2.22.0 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from MechanicalSoup) (2.26.0)\n", "Requirement already satisfied: soupsieve>1.2 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from beautifulsoup4>=4.7->MechanicalSoup) (2.2.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from requests>=2.22.0->MechanicalSoup) (3.2)\n", "Requirement already satisfied: charset-normalizer~=2.0.0 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from requests>=2.22.0->MechanicalSoup) (2.0.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from requests>=2.22.0->MechanicalSoup) (1.26.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from requests>=2.22.0->MechanicalSoup) (2021.10.8)\n", "Installing collected packages: MechanicalSoup\n", "Successfully installed MechanicalSoup-1.1.0\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install MechanicalSoup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import mechanicalsoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creamos un objeto Browser" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "browser = mechanicalsoup.Browser()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos solicitar una pagina de internet " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "url = \"http://olympus.realpython.org/login\"\n", "page = browser.get(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obtenemos un objeto que almacena la respuesta de la URL solicitada" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "En este caso el numero 200 representa el codigo de estado devuelto por la solicitud.
\n", " Significa que la solicitud se ha realizado correctamente
\n", "Otros codigos habituales:
\n", " 404: La URL no existe
\n", " 500: Ha ocurrido un error en el servidor al realizar la solicitud
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MechanicalSoup puede usar la libreria BeautifulSoup para analizar el HTML obtenido " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(page.soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos ver el HTML mediante el atributo .soup" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "\n", "Log In\n", "\n", "\n", "
\n", "

\n", "

Please log in to access Mount Olympus:

\n", "

\n", "
\n", "Username:
\n", "Password:

\n", "\n", "
\n", "
\n", "\n", "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "page.soup\n", "# Observamos como esta pagina tiene un
con elementos \n", "# que servirá para un nombre de usuario y contraseña" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enviar un formulario con MechanicalSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "Si accedemos a la URl, vemos un formulario con id y contraseña.
\n", "Credenciales: User: zeus /// password: ThunderDude
\n", "Si accedemos, nos redirige a la pagina /profiles
\n", "
\n", "Completar y enviar formularios #" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "La seccion importante del codigo HTML es el formulario de inicio de sesion
\n", " el cual queda dentro de las etiquetas . Este está dividido en
\n", " 3 elementos . Uno contiene el 'user', otro 'pwd' y el ultimo es el boton para enviar
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creamos el objeto" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "browser = mechanicalsoup.Browser()\n", "# Establecemos las URL\n", "url = \"http://olympus.realpython.org/login\"\n", "# Solicitud\n", "login_page = browser.get(url)\n", "# Observamos el HTML\n", "login_html = login_page.soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seleccionamos la etiqueta
\n", ".select devuelve una lista de los elementos de la pagina" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "form = login_html.select(\"form\")[0]\n", "# Establecemos el 'user'\n", "form.select(\"input\")[0][\"value\"] = \"zeus\"\n", "# Establecemos la 'password'\n", "form.select(\"input\")[1][\"value\"] = \"ThunderDude\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enviamos la peticion con dos argumentos. (Objeto form , URL)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(form)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profiles_page = browser.submit(form, login_page.url)\n", "# Comprobamos el estado de la peticion\n", "profiles_page" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos ver cual es la URL una vez enviado el formulario" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'http://olympus.realpython.org/profiles'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profiles_page.url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "Los hackers utilizan automatizaciones de este tipo de codigos
\n", "para poder probar muchas combinaciones de user y pwd para entrar por la fuerza
\n", "Muchas paginas web evitan esto con contadores de intentos de entrada
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obtencion del link de cada perfil de la URL #" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Primero conviene mirar la estructura del HTML " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \n", "profiles_page.soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observamos que los 'perfiles' están dentro de elementos de anclaje " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "links = profiles_page.soup.select(\"a\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iteracion sobre cada link e impresion de los atributos 'href'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for link in links:\n", " address = link[\"href\"]\n", " text = link.text\n", " print(f\"{text}: {address}\")\n", " \n", "#Las URL contenidas en cada atributo href son URL relativas \n", "# Podemos establecer la URL completa utilizando la URL base y las relativas \n", " \n", "base_url = \"http://olympus.realpython.org\"\n", "for link in links:\n", " address = base_url + link[\"href\"]\n", " text = link.text\n", " print(f\"{text}: {address}\") \n", " \n", " \n", "# Ejercicio\n", "# Usar MechanicalSoup para rellenar el formulario de la URL\n", " # http://olympus.realpython.org/login \n", "# Una vez enviado el formulario, mostrar el titulo de la pagina actual, \n", " # para comprobar que nos han redirigido \n", " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }