{ "cells": [ { "cell_type": "markdown", "id": "89048c14-8cc6-4949-ab73-9c974c688037", "metadata": {}, "source": [ "# Instalación de NLTK" ] }, { "cell_type": "code", "execution_count": 1, "id": "4dc6ada2-1fab-411a-808b-1ef836abc5b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: nltk in /home/mydoctor/anaconda3/lib/python3.8/site-packages (3.7)\n", "Requirement already satisfied: tqdm in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from nltk) (4.63.0)\n", "Requirement already satisfied: joblib in /home/mydoctor/.local/lib/python3.8/site-packages (from nltk) (1.1.0)\n", "Requirement already satisfied: click in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from nltk) (8.0.4)\n", "Requirement already satisfied: regex>=2021.8.3 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from nltk) (2022.3.15)\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install nltk" ] }, { "cell_type": "code", "execution_count": 1, "id": "2c1643c6-4347-4aad-bd49-fab7b9559847", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Instalaciónn de paquetes\n", "import nltk\n", "nltk.download()" ] }, { "cell_type": "code", "execution_count": null, "id": "d63f922d-dad3-4cbf-ae18-c4e057041b32", "metadata": {}, "outputs": [], "source": [ "# Instalar todos los paquetes." ] }, { "cell_type": "markdown", "id": "24f6b5b4-018d-4541-8a7f-b7e32f58b705", "metadata": {}, "source": [ "### Ejemplo Tokenización" ] }, { "cell_type": "code", "execution_count": 2, "id": "44bae85d-a633-4bad-8a61-dde23886901f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Pablito', 'clavó', 'un', 'clavito', 'cuantos', 'clavitos', 'clava', 'pablito']\n" ] } ], "source": [ "from nltk.tokenize import WordPunctTokenizer \n", "texto_pablito = \"Pablito clavó un clavito cuantos clavitos clava pablito\"\n", "pablito_tokenizado = WordPunctTokenizer().tokenize(texto_pablito)\n", "print(pablito_tokenizado)" ] }, { "cell_type": "markdown", "id": "4c40f171-e7a4-41a9-98f2-d6b40aaa204c", "metadata": {}, "source": [ "Agunas palabras tienen el mismo significado o muy parecido y nos gustaría tokenizarlas como la misma. Por ejemplo: clavó y clava, clavito y clavitos y Pablito y pablito. Si bien el sustantivo propio con minúscula corresponde a un error, podemos encontrarnos con cosas de este tipo en un corpus de texto real." ] }, { "cell_type": "markdown", "id": "47005275-3b45-4f51-bf90-4e58b1e7ad74", "metadata": { "tags": [] }, "source": [ "### Ejemplo de steeming" ] }, { "cell_type": "code", "execution_count": 3, "id": "a43065aa-0471-468a-afdf-59836bba44dd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['pablit', 'clav', 'un', 'clavit', 'cuant', 'clavit', 'clav', 'pablit']\n" ] } ], "source": [ "from nltk.stem import SnowballStemmer\n", "stemmer = SnowballStemmer('spanish')\n", "stemmed_text = [stemmer.stem(i) for i in pablito_tokenizado]\n", "\n", "print(stemmed_text)" ] }, { "cell_type": "markdown", "id": "993ca0c0-89ba-4407-8ccb-d03ebf2d6d65", "metadata": {}, "source": [ "### Ejemplo de lemmatización" ] }, { "cell_type": "markdown", "id": "dbabb729-348a-403c-8330-7d1a97a485e6", "metadata": {}, "source": [ "La lemmatization es un poco más compleja y busca la mejor palabra de origen de las que tenemos" ] }, { "cell_type": "code", "execution_count": 4, "id": "da2b745d-e19d-4595-924e-d7ae5d1301e1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package wordnet to /home/mydoctor/nltk_data...\n", "[nltk_data] Package wordnet is already up-to-date!\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['Pablito', 'clavó', 'un', 'clavito', 'cuantos', 'clavitos', 'clava', 'pablito']\n" ] } ], "source": [ "from nltk.stem import WordNetLemmatizer\n", "nltk.download('wordnet')\n", "wnl = WordNetLemmatizer()\n", "lemmatized_text = [wnl.lemmatize(i) for i in pablito_tokenizado]\n", "print(lemmatized_text)" ] }, { "cell_type": "markdown", "id": "d7c89b81-fe3d-43e5-8cb2-b4b2cc56ecdc", "metadata": {}, "source": [ "### Ejemplo de lemmatización con Stanza" ] }, { "cell_type": "code", "execution_count": null, "id": "c0940ad7-b73f-4cff-a238-6953a10736b7", "metadata": {}, "outputs": [], "source": [ "import stanza\n", "stanza.download(\"es\")\n", "nlp = stanza.Pipeline(lang='es', processors='tokenize,mwt,pos,lemma')\n", "texto_pablito = \"Pablito clavó un clavito cuantos clavitos clava pablito\"\n", "doc = nlp(texto_pablito)\n", "print(*[f'Palabra: {word.text+\" \"}\\tLemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "id": "3c1ae2f5-d1aa-458d-b8a9-5f78c8db6b29", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }