{ "cells": [ { "cell_type": "markdown", "id": "55b73472-831c-45c7-acb1-8b7ed5ceb21a", "metadata": {}, "source": [ "# Scrapping Meneame.net" ] }, { "cell_type": "code", "execution_count": 1, "id": "bae700ff-2ef8-493c-9ccb-bcb2f76c10a4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from datetime import datetime\n", "import time\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "pd.options.display.float_format = '{:.2f}'.format #Desactivar notación científica en pandas:\n", "np.set_printoptions(suppress=True) #Desactivar notación científica en numpy:\n", "pd.set_option('display.max_columns', None) #comando para mostrar todas las columnas" ] }, { "cell_type": "code", "execution_count": 2, "id": "6476875d-a047-4a46-bae0-b3f9a05466a0", "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests" ] }, { "cell_type": "code", "execution_count": 3, "id": "59abd17c-9c4c-496a-bfca-e28a8ce4c862", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\r\\n403 Forbidden\\r\\n\\r\\n

403 Forbidden

\\r\\n'\n" ] } ], "source": [ "url = 'https://www.meneame.net'\n", "r = requests.get(url)\n", "print(r.content[:100])" ] }, { "cell_type": "code", "execution_count": 4, "id": "e9ceed05-d1d7-407a-a9c0-afcf479c4c01", "metadata": {}, "outputs": [], "source": [ "from selenium import webdriver" ] }, { "cell_type": "code", "execution_count": null, "id": "2a430ef7-cb14-4af3-ae50-ffe44fa4f1bb", "metadata": {}, "outputs": [], "source": [ "options = webdriver.ChromeOptions()\n", "options.add_argument('--remote-debugging-port=9515')\n", "#driver = webdriver.Chrome('./chromedriver.exe', options=options)\n", "driver = webdriver.Chrome('/usr/bin/chromedriver', options=options)\n", "url_ra = 'https://www.meneame.net/'\n", "driver.implicitly_wait(3)\n", "driver.get(url_ra)" ] }, { "cell_type": "code", "execution_count": 10, "id": "02ae2590-0767-41fc-ba3e-b22c4977d8cc", "metadata": {}, "outputs": [], "source": [ "# Hacemos click en Aceptar cookies\n", "driver.find_element_by_xpath('//*[@id=\"qc-cmp2-ui\"]/div[2]/div/button[2]').click()" ] }, { "cell_type": "markdown", "id": "778684b2-db71-4dca-a061-c5f121dc6400", "metadata": {}, "source": [ "# Ejercicio:\n", "\n", "Extraer, de la primera página de noticias:\n", "- Total de noticias\n", "- Títulos de las noticias\n", "- Resumen noticias\n", "- Total visitas y comentarios de la noticia\n", "- Url de la noticia" ] }, { "cell_type": "markdown", "id": "304c8043-1051-4335-81b9-0915ac8cc314", "metadata": {}, "source": [ "Extracción avanzada:\n", "- Extrae las estadísiticas de la primera noticia. Las estadísticas se encuentan en los 3 puntos, en la parte inferior derecha de la pantalla. Cada noticia tiene un id y es el que se utiliza para obtener el JSON con las estadísticas de la noticia.\n", "\n", "-> https://www.meneame.net/backend/karma-story.json?id=[id de la noticia sin l-]\n", "\n", "En esta página aparece el id de la noticia en el objeto con id = \"users-typing-container\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }