{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduccion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otra técnica para el análisis de textos, es el **Topic Modeling**. El objetivo del Top Modeling es encontrar los 'temas' presentes en el corpus. Se puede utilizar en buscadores, automatización de atención al cliente, ...\n", "\n", "Cada documento en el corpus estará formado por al menos un tema. En este notebook, realizaremos el top modeling a través de **Latent Dirichlet Allocation (LDA)**.\n", "El LDA es un aprendizaje no supervisado a través de una nube de palabras. A través de él podemos encontrar, temas ocultos y clasificar los documentos en base a los temas obtenidos entre otros.\n", "\n", "https://es.wikipedia.org/wiki/Latent_Dirichlet_Allocation \n", "https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2\n", "\n", "Para realizar un top modeling, necesitamos:\n", "* Document Term Matrix (corpus)\n", "* Los términos (topics) que queremos usar.\n", "\n", "Una vez aplicada el top modeling, es necesario interpretar los resultados para ver si tienen sentido. En el caso de que no lo tengan, se pueden variar el número de temas, los términos en el document-term matrix, los parámetros del modelo o incluso probar un modelo diferente." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Topic Modeling - Prueba #1 (Todo el texto)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: gensim in /home/mydoctor/anaconda3/lib/python3.8/site-packages (4.1.2)\n", "Requirement already satisfied: scipy>=0.18.1 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (1.7.3)\n", "Requirement already satisfied: numpy>=1.17.0 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (1.21.5)\n", "Requirement already satisfied: smart-open>=1.8.1 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (5.1.0)\n" ] } ], "source": [ "# Importar los módulos LDA con gensim\n", "#!pip install gensim" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aaaaahaaaaahhhhhhhaaaaauuugghhhhhhaaaahhhhhaaahaahabcabcsabilityabject...zeezenzeppelinzerozillionzombiezombieszoningzooéclair
ali0000001000...0000010000
anthony0000000000...0000000000
bill1000000100...0001111100
bo0111000010...0001000000
dave0000100000...0000000000
hasan0000000000...2101000000
jim0000000000...0000000000
joe0000000000...0000000000
john0000000000...0000000001
louis0000030000...0002000000
mike0000000000...0021000000
ricky0000000011...0000000010
\n", "

12 rows × 7468 columns

\n", "
" ], "text/plain": [ " aaaaah aaaaahhhhhhh aaaaauuugghhhhhh aaaahhhhh aaah aah abc \\\n", "ali 0 0 0 0 0 0 1 \n", "anthony 0 0 0 0 0 0 0 \n", "bill 1 0 0 0 0 0 0 \n", "bo 0 1 1 1 0 0 0 \n", "dave 0 0 0 0 1 0 0 \n", "hasan 0 0 0 0 0 0 0 \n", "jim 0 0 0 0 0 0 0 \n", "joe 0 0 0 0 0 0 0 \n", "john 0 0 0 0 0 0 0 \n", "louis 0 0 0 0 0 3 0 \n", "mike 0 0 0 0 0 0 0 \n", "ricky 0 0 0 0 0 0 0 \n", "\n", " abcs ability abject ... zee zen zeppelin zero zillion \\\n", "ali 0 0 0 ... 0 0 0 0 0 \n", "anthony 0 0 0 ... 0 0 0 0 0 \n", "bill 1 0 0 ... 0 0 0 1 1 \n", "bo 0 1 0 ... 0 0 0 1 0 \n", "dave 0 0 0 ... 0 0 0 0 0 \n", "hasan 0 0 0 ... 2 1 0 1 0 \n", "jim 0 0 0 ... 0 0 0 0 0 \n", "joe 0 0 0 ... 0 0 0 0 0 \n", "john 0 0 0 ... 0 0 0 0 0 \n", "louis 0 0 0 ... 0 0 0 2 0 \n", "mike 0 0 0 ... 0 0 2 1 0 \n", "ricky 0 1 1 ... 0 0 0 0 0 \n", "\n", " zombie zombies zoning zoo éclair \n", "ali 1 0 0 0 0 \n", "anthony 0 0 0 0 0 \n", "bill 1 1 1 0 0 \n", "bo 0 0 0 0 0 \n", "dave 0 0 0 0 0 \n", "hasan 0 0 0 0 0 \n", "jim 0 0 0 0 0 \n", "joe 0 0 0 0 0 \n", "john 0 0 0 0 1 \n", "louis 0 0 0 0 0 \n", "mike 0 0 0 0 0 \n", "ricky 0 0 0 1 0 \n", "\n", "[12 rows x 7468 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cargamos el document-term matrix generado previamente\n", "import pandas as pd\n", "import pickle\n", "\n", "datos = pd.read_pickle('dtm_stop.pkl')\n", "datos" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from gensim import matutils, models\n", "import scipy.sparse\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
alianthonybillbodavehasanjimjoejohnlouismikericky
aaaaah001000000000
aaaaahhhhhhh000100000000
aaaaauuugghhhhhh000100000000
aaaahhhhh000100000000
aaah000010000000
\n", "
" ], "text/plain": [ " ali anthony bill bo dave hasan jim joe john louis \\\n", "aaaaah 0 0 1 0 0 0 0 0 0 0 \n", "aaaaahhhhhhh 0 0 0 1 0 0 0 0 0 0 \n", "aaaaauuugghhhhhh 0 0 0 1 0 0 0 0 0 0 \n", "aaaahhhhh 0 0 0 1 0 0 0 0 0 0 \n", "aaah 0 0 0 0 1 0 0 0 0 0 \n", "\n", " mike ricky \n", "aaaaah 0 0 \n", "aaaaahhhhhhh 0 0 \n", "aaaaauuugghhhhhh 0 0 \n", "aaaahhhhh 0 0 \n", "aaah 0 0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Uno de los requerimientos para el LDA es un term-document matrix transpuesto\n", "tdm = datos.transpose()\n", "tdm.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Cambiamos el formato de la matriz a 'gensim'\n", "# Pasos necesarios df --> matriz dispersa --> corpus gensim\n", "matriz_dispersa = scipy.sparse.csr_matrix(tdm)\n", "corpus = matutils.Sparse2Corpus(matriz_dispersa)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Gensim necesita de un diccionario con todos los términos y su ubicación en el corpus.\n", "# Recuperamos la matriz generada en el script 2\n", "cv = pickle.load(open(\"cv_stop.pkl\", \"rb\"))\n", "id2word = dict((v, k) for k, v in cv.vocabulary_.items())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ya tenemos el corpus y el diccionario palabra:ubicación, necesitamos especificar otros 2 parámetros:\n", "- El total de temas y\n", "- El total de iteraciones en el entrenamiento. \n", "\n", "Probamos con 2 temas y veremos si el resultado tiene sentido." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.006*\"cause\" + 0.006*\"shit\" + 0.005*\"fucking\" + 0.005*\"really\" + 0.005*\"good\" + 0.005*\"went\" + 0.005*\"hes\" + 0.004*\"thing\" + 0.004*\"didnt\" + 0.004*\"day\"'),\n", " (1,\n", " '0.009*\"fucking\" + 0.006*\"fuck\" + 0.006*\"say\" + 0.006*\"shit\" + 0.005*\"want\" + 0.005*\"going\" + 0.005*\"theyre\" + 0.004*\"love\" + 0.004*\"did\" + 0.004*\"hes\"')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=40)\n", "lda.print_topics()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.005*\"little\" + 0.005*\"love\" + 0.004*\"bo\" + 0.004*\"stuff\" + 0.004*\"old\" + 0.004*\"clinton\" + 0.004*\"say\" + 0.004*\"way\" + 0.004*\"repeat\" + 0.004*\"hey\"'),\n", " (1,\n", " '0.008*\"shit\" + 0.007*\"fucking\" + 0.006*\"fuck\" + 0.005*\"say\" + 0.005*\"theyre\" + 0.005*\"didnt\" + 0.005*\"want\" + 0.005*\"going\" + 0.005*\"cause\" + 0.005*\"hes\"'),\n", " (2,\n", " '0.012*\"fucking\" + 0.010*\"went\" + 0.006*\"going\" + 0.006*\"day\" + 0.006*\"say\" + 0.006*\"thing\" + 0.005*\"theyre\" + 0.005*\"ive\" + 0.005*\"hes\" + 0.005*\"goes\"')]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# LDA for num_topics = 3\n", "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=40)\n", "lda.print_topics()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.007*\"id\" + 0.007*\"say\" + 0.006*\"says\" + 0.005*\"didnt\" + 0.005*\"went\" + 0.005*\"goes\" + 0.005*\"mean\" + 0.005*\"fucking\" + 0.005*\"going\" + 0.005*\"cause\"'),\n", " (1,\n", " '0.009*\"fuck\" + 0.009*\"shit\" + 0.009*\"fucking\" + 0.007*\"theyre\" + 0.005*\"cause\" + 0.005*\"gotta\" + 0.005*\"theres\" + 0.005*\"man\" + 0.005*\"lot\" + 0.004*\"wanna\"'),\n", " (2,\n", " '0.008*\"dad\" + 0.006*\"going\" + 0.006*\"say\" + 0.005*\"hey\" + 0.005*\"shes\" + 0.005*\"mom\" + 0.005*\"want\" + 0.004*\"love\" + 0.004*\"did\" + 0.004*\"look\"'),\n", " (3,\n", " '0.010*\"fucking\" + 0.008*\"shit\" + 0.007*\"fuck\" + 0.006*\"thing\" + 0.006*\"good\" + 0.005*\"hes\" + 0.005*\"want\" + 0.005*\"didnt\" + 0.005*\"day\" + 0.005*\"say\"')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# LDA for num_topics = 4\n", "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=40)\n", "lda.print_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lo que obtenemos es la probabilidad de que una palabra, aparezca en un tema.\n", "Pero los resultados son pobres. Hemos probado a mejorarlo, modificando los parámetros, probemos ahora modificando los términos usados." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Topic Modeling - Prueba #2 (Sólo sustantivos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Un truco habitual suele ser usar sólo sustantivos, sólo adjetivos, ...\n", "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. -> para comprobar la etiqueta para filtrar por sustantivos " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Creamos una función para extraer los sustantivos de un texto\n", "from nltk import word_tokenize, pos_tag\n", "\n", "def sustantivos(texto):\n", " '''Dada una cadena de texto, se tokeniza y devuelve sólo los sustantivos.'''\n", " # Aquí es donde nos quedamos sólo con los sustantivos.\n", " es_sustantivo = lambda pos: pos[:2] == 'NN'\n", " \n", " tokenizado = word_tokenize(texto)\n", " todo_sustantivos = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sustantivo(pos)] \n", " return ' '.join(todo_sustantivos)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transcripcion
aliladies and gentlemen please welcome to the sta...
anthonythank you thank you thank you san francisco th...
billall right thank you thank you very much thank...
bobo what old macdonald had a farm e i e i o and...
davethis is dave he tells dirty jokes for a living...
hasanwhats up davis whats up im home i had to bri...
jimladies and gentlemen please welcome to the ...
joeladies and gentlemen welcome joe rogan wha...
johnall right petunia wish me luck out there you w...
louisintrofade the music out lets roll hold there l...
mikewow hey thank you thanks thank you guys hey se...
rickyhello hello how you doing great thank you wow ...
\n", "
" ], "text/plain": [ " transcripcion\n", "ali ladies and gentlemen please welcome to the sta...\n", "anthony thank you thank you thank you san francisco th...\n", "bill all right thank you thank you very much thank...\n", "bo bo what old macdonald had a farm e i e i o and...\n", "dave this is dave he tells dirty jokes for a living...\n", "hasan whats up davis whats up im home i had to bri...\n", "jim ladies and gentlemen please welcome to the ...\n", "joe ladies and gentlemen welcome joe rogan wha...\n", "john all right petunia wish me luck out there you w...\n", "louis introfade the music out lets roll hold there l...\n", "mike wow hey thank you thanks thank you guys hey se...\n", "ricky hello hello how you doing great thank you wow ..." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Leemos los datos limpios generados previamente\n", "datos_limpios = pd.read_pickle('datos_limpios.pkl')\n", "datos_limpios" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /home/mydoctor/nltk_data...\n", "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", "[nltk_data] date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Descargamos la librería para poder normalizar las palabras, según su contexto y análisis morfológico.\n", "import nltk\n", "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transcripcion
aliladies gentlemen stage ali hi thank hello na s...
anthonythank thank people i em i francisco city world...
billthank thank pleasure georgia area oasis i june...
bomacdonald farm e i o farm pig e i i snort macd...
davejokes living stare work profound train thought...
hasanwhats davis whats home i netflix la york i son...
jimladies gentlemen stage mr jim jefferies thank ...
joeladies gentlemen joe fuck thanks phone fuckfac...
johnpetunia thats hello hello chicago thank crowd ...
louismusic lets lights lights thank i i place place...
mikewow hey thanks look insane years everyone i id...
rickyhello thank fuck thank im gon youre weve money...
\n", "
" ], "text/plain": [ " transcripcion\n", "ali ladies gentlemen stage ali hi thank hello na s...\n", "anthony thank thank people i em i francisco city world...\n", "bill thank thank pleasure georgia area oasis i june...\n", "bo macdonald farm e i o farm pig e i i snort macd...\n", "dave jokes living stare work profound train thought...\n", "hasan whats davis whats home i netflix la york i son...\n", "jim ladies gentlemen stage mr jim jefferies thank ...\n", "joe ladies gentlemen joe fuck thanks phone fuckfac...\n", "john petunia thats hello hello chicago thank crowd ...\n", "louis music lets lights lights thank i i place place...\n", "mike wow hey thanks look insane years everyone i id...\n", "ricky hello thank fuck thank im gon youre weve money..." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extraemos los sustantivos\n", "datos_sustantivos = pd.DataFrame(datos_limpios ['transcripcion'].apply(sustantivos))\n", "datos_sustantivos" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mydoctor/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aaaaahhhhhhhaaaaauuugghhhhhhaaaahhhhhaahabcabcsabilityabortionabortionsabuse...yummyzezealandzeezeppelinzillionzombiezombieszooéclair
ali0000100000...0000001000
anthony0000000200...00100000000
bill0000010000...0100011100
bo1110001000...0000000000
dave0000000010...0000000000
hasan0000000000...0001000000
jim0000000000...0000000000
joe0000000001...0000000000
john0000000000...0000000001
louis0003000000...0000000000
mike0000000000...0000200000
ricky0000001000...1000000010
\n", "

12 rows × 4635 columns

\n", "
" ], "text/plain": [ " aaaaahhhhhhh aaaaauuugghhhhhh aaaahhhhh aah abc abcs ability \\\n", "ali 0 0 0 0 1 0 0 \n", "anthony 0 0 0 0 0 0 0 \n", "bill 0 0 0 0 0 1 0 \n", "bo 1 1 1 0 0 0 1 \n", "dave 0 0 0 0 0 0 0 \n", "hasan 0 0 0 0 0 0 0 \n", "jim 0 0 0 0 0 0 0 \n", "joe 0 0 0 0 0 0 0 \n", "john 0 0 0 0 0 0 0 \n", "louis 0 0 0 3 0 0 0 \n", "mike 0 0 0 0 0 0 0 \n", "ricky 0 0 0 0 0 0 1 \n", "\n", " abortion abortions abuse ... yummy ze zealand zee zeppelin \\\n", "ali 0 0 0 ... 0 0 0 0 0 \n", "anthony 2 0 0 ... 0 0 10 0 0 \n", "bill 0 0 0 ... 0 1 0 0 0 \n", "bo 0 0 0 ... 0 0 0 0 0 \n", "dave 0 1 0 ... 0 0 0 0 0 \n", "hasan 0 0 0 ... 0 0 0 1 0 \n", "jim 0 0 0 ... 0 0 0 0 0 \n", "joe 0 0 1 ... 0 0 0 0 0 \n", "john 0 0 0 ... 0 0 0 0 0 \n", "louis 0 0 0 ... 0 0 0 0 0 \n", "mike 0 0 0 ... 0 0 0 0 2 \n", "ricky 0 0 0 ... 1 0 0 0 0 \n", "\n", " zillion zombie zombies zoo éclair \n", "ali 0 1 0 0 0 \n", "anthony 0 0 0 0 0 \n", "bill 1 1 1 0 0 \n", "bo 0 0 0 0 0 \n", "dave 0 0 0 0 0 \n", "hasan 0 0 0 0 0 \n", "jim 0 0 0 0 0 \n", "joe 0 0 0 0 0 \n", "john 0 0 0 0 1 \n", "louis 0 0 0 0 0 \n", "mike 0 0 0 0 0 \n", "ricky 0 0 0 1 0 \n", "\n", "[12 rows x 4635 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creamos un nuevo corpus sólo con los sustantivos\n", "from sklearn.feature_extraction import text\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# Quitamos las stopwords, puesto que vamos a generar un nuevo corpus\n", "add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',\n", " 'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']\n", "stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)\n", "\n", "# Corpus sólo con sustantivos\n", "cvs = CountVectorizer(stop_words=stop_words)\n", "datos_cvs = cvs.fit_transform(datos_sustantivos['transcripcion'])\n", "datos_dtms = pd.DataFrame(datos_cvs.toarray(), columns=cvs.get_feature_names())\n", "datos_dtms.index = datos_sustantivos.index\n", "datos_dtms" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Generar el corpus gensim\n", "corpuss = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtms.transpose()))\n", "\n", "# Generar el diccionario de vocabulario\n", "id2words = dict((v, k) for k, v in cvs.vocabulary_.items())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.010*\"thing\" + 0.009*\"day\" + 0.008*\"life\" + 0.007*\"way\" + 0.007*\"cause\" + 0.006*\"kids\" + 0.006*\"hes\" + 0.005*\"mom\" + 0.005*\"joke\" + 0.005*\"lot\"'),\n", " (1,\n", " '0.009*\"shit\" + 0.008*\"man\" + 0.008*\"day\" + 0.008*\"thing\" + 0.007*\"fuck\" + 0.007*\"hes\" + 0.007*\"life\" + 0.006*\"way\" + 0.006*\"cause\" + 0.006*\"guy\"')]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Empezamos por 2 temas\n", "ldas = models.LdaModel(corpus=corpuss, num_topics=2, id2word=id2words, passes=10)\n", "ldas.print_topics()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.013*\"dad\" + 0.007*\"life\" + 0.007*\"shes\" + 0.006*\"mom\" + 0.006*\"parents\" + 0.006*\"school\" + 0.005*\"girl\" + 0.005*\"home\" + 0.005*\"hes\" + 0.004*\"night\"'),\n", " (1,\n", " '0.008*\"day\" + 0.008*\"thing\" + 0.007*\"way\" + 0.007*\"shit\" + 0.007*\"man\" + 0.007*\"hes\" + 0.006*\"years\" + 0.006*\"guy\" + 0.006*\"joke\" + 0.006*\"cause\"'),\n", " (2,\n", " '0.011*\"thing\" + 0.010*\"life\" + 0.010*\"day\" + 0.010*\"cause\" + 0.010*\"shit\" + 0.009*\"fuck\" + 0.009*\"man\" + 0.008*\"women\" + 0.008*\"lot\" + 0.007*\"hes\"')]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 3\n", "ldas = models.LdaModel(corpus=corpuss, num_topics=3, id2word=id2words, passes=10)\n", "ldas.print_topics()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.013*\"shit\" + 0.009*\"guy\" + 0.009*\"gon\" + 0.009*\"dude\" + 0.008*\"thing\" + 0.008*\"hes\" + 0.007*\"fuck\" + 0.006*\"man\" + 0.006*\"life\" + 0.006*\"day\"'),\n", " (1,\n", " '0.008*\"man\" + 0.008*\"life\" + 0.008*\"dad\" + 0.007*\"day\" + 0.007*\"way\" + 0.007*\"hes\" + 0.007*\"shes\" + 0.007*\"thing\" + 0.006*\"fuck\" + 0.006*\"house\"'),\n", " (2,\n", " '0.009*\"cause\" + 0.009*\"day\" + 0.008*\"thing\" + 0.008*\"man\" + 0.008*\"guy\" + 0.007*\"fuck\" + 0.007*\"women\" + 0.007*\"shit\" + 0.006*\"way\" + 0.006*\"lot\"'),\n", " (3,\n", " '0.011*\"thing\" + 0.011*\"day\" + 0.011*\"life\" + 0.010*\"lot\" + 0.010*\"shit\" + 0.008*\"cause\" + 0.008*\"women\" + 0.007*\"hes\" + 0.007*\"joke\" + 0.006*\"gon\"')]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 4\n", "ldas = models.LdaModel(corpus=corpuss, num_topics=4, id2word=id2words, passes=10)\n", "ldas.print_topics()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.009*\"fuck\" + 0.009*\"thing\" + 0.009*\"man\" + 0.008*\"things\" + 0.007*\"day\" + 0.007*\"hes\" + 0.007*\"kids\" + 0.007*\"life\" + 0.007*\"theyre\" + 0.007*\"years\"'),\n", " (1,\n", " '0.010*\"man\" + 0.009*\"shit\" + 0.009*\"dad\" + 0.008*\"fuck\" + 0.007*\"hes\" + 0.007*\"life\" + 0.006*\"way\" + 0.005*\"stuff\" + 0.005*\"night\" + 0.005*\"lot\"'),\n", " (2,\n", " '0.011*\"day\" + 0.010*\"cause\" + 0.009*\"thing\" + 0.008*\"way\" + 0.007*\"guy\" + 0.006*\"house\" + 0.006*\"night\" + 0.005*\"kind\" + 0.005*\"women\" + 0.005*\"man\"'),\n", " (3,\n", " '0.015*\"shit\" + 0.012*\"life\" + 0.011*\"thing\" + 0.010*\"hes\" + 0.009*\"gon\" + 0.009*\"cause\" + 0.008*\"guy\" + 0.008*\"day\" + 0.008*\"dude\" + 0.008*\"lot\"'),\n", " (4,\n", " '0.015*\"joke\" + 0.013*\"day\" + 0.008*\"thing\" + 0.008*\"anthony\" + 0.008*\"school\" + 0.007*\"family\" + 0.007*\"jokes\" + 0.007*\"grandma\" + 0.006*\"lot\" + 0.006*\"baby\"')]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 5\n", "ldas = models.LdaModel(corpus=corpuss, num_topics=5, id2word=id2words, passes=10)\n", "ldas.print_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic Modeling - Prueba #3 (Sustantivos y Adjetivos)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# Función para extraer los sustantivos y adjetivos\n", "def sust_adj(texto):\n", " '''Dado un texto, lo tokeniza y devuelve sólo los sustantivos y adjetivos.'''\n", " es_sust_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'\n", " tokenizado = word_tokenize(texto)\n", " todo_sust_adj = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sust_adj(pos)] \n", " return ' '.join(todo_sust_adj)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transcripcion
aliladies gentlemen welcome stage ali wong hi wel...
anthonythank san francisco thank good people surprise...
billright thank thank pleasure greater atlanta geo...
boold macdonald farm e i i o farm pig e i i snor...
davedirty jokes living stare most hard work profou...
hasanwhats davis whats im home i netflix special la...
jimladies gentlemen welcome stage mr jim jefferie...
joeladies gentlemen joe fuck san francisco thanks...
johnright petunia august thats good right hello he...
louismusic lets lights lights thank much i i i nice...
mikewow hey thanks hey seattle nice look crazy ins...
rickyhello great thank fuck thank lovely welcome im...
\n", "
" ], "text/plain": [ " transcripcion\n", "ali ladies gentlemen welcome stage ali wong hi wel...\n", "anthony thank san francisco thank good people surprise...\n", "bill right thank thank pleasure greater atlanta geo...\n", "bo old macdonald farm e i i o farm pig e i i snor...\n", "dave dirty jokes living stare most hard work profou...\n", "hasan whats davis whats im home i netflix special la...\n", "jim ladies gentlemen welcome stage mr jim jefferie...\n", "joe ladies gentlemen joe fuck san francisco thanks...\n", "john right petunia august thats good right hello he...\n", "louis music lets lights lights thank much i i i nice...\n", "mike wow hey thanks hey seattle nice look crazy ins...\n", "ricky hello great thank fuck thank lovely welcome im..." ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Aplicamos la función a los datos limpios\n", "datos_sust_adj = pd.DataFrame(datos_limpios['transcripcion'].apply(sust_adj))\n", "datos_sust_adj" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mydoctor/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aaaaahaaaaahhhhhhhaaaaauuugghhhhhhaaaahhhhhaahabcabcsabilityabjectable...zezealandzeezeppelinzerozillionzombiezombieszooéclair
ali0000010002...0000001000
anthony0000000000...01000000000
bill1000001001...1000011100
bo0111000100...0000100000
dave0000000000...0000000000
hasan0000000001...0020000000
jim0000000001...0000000000
joe0000000002...0000000000
john0000000003...0000000001
louis0000300001...0000000000
mike0000000000...0002000000
ricky0000000112...0000000010
\n", "

12 rows × 5587 columns

\n", "
" ], "text/plain": [ " aaaaah aaaaahhhhhhh aaaaauuugghhhhhh aaaahhhhh aah abc abcs \\\n", "ali 0 0 0 0 0 1 0 \n", "anthony 0 0 0 0 0 0 0 \n", "bill 1 0 0 0 0 0 1 \n", "bo 0 1 1 1 0 0 0 \n", "dave 0 0 0 0 0 0 0 \n", "hasan 0 0 0 0 0 0 0 \n", "jim 0 0 0 0 0 0 0 \n", "joe 0 0 0 0 0 0 0 \n", "john 0 0 0 0 0 0 0 \n", "louis 0 0 0 0 3 0 0 \n", "mike 0 0 0 0 0 0 0 \n", "ricky 0 0 0 0 0 0 0 \n", "\n", " ability abject able ... ze zealand zee zeppelin zero \\\n", "ali 0 0 2 ... 0 0 0 0 0 \n", "anthony 0 0 0 ... 0 10 0 0 0 \n", "bill 0 0 1 ... 1 0 0 0 0 \n", "bo 1 0 0 ... 0 0 0 0 1 \n", "dave 0 0 0 ... 0 0 0 0 0 \n", "hasan 0 0 1 ... 0 0 2 0 0 \n", "jim 0 0 1 ... 0 0 0 0 0 \n", "joe 0 0 2 ... 0 0 0 0 0 \n", "john 0 0 3 ... 0 0 0 0 0 \n", "louis 0 0 1 ... 0 0 0 0 0 \n", "mike 0 0 0 ... 0 0 0 2 0 \n", "ricky 1 1 2 ... 0 0 0 0 0 \n", "\n", " zillion zombie zombies zoo éclair \n", "ali 0 1 0 0 0 \n", "anthony 0 0 0 0 0 \n", "bill 1 1 1 0 0 \n", "bo 0 0 0 0 0 \n", "dave 0 0 0 0 0 \n", "hasan 0 0 0 0 0 \n", "jim 0 0 0 0 0 \n", "joe 0 0 0 0 0 \n", "john 0 0 0 0 1 \n", "louis 0 0 0 0 0 \n", "mike 0 0 0 0 0 \n", "ricky 0 0 0 1 0 \n", "\n", "[12 rows x 5587 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creación del nuevo corpus, ahora sólo con sustantivos y adjetivos. Además eliminamos las stop words con max_df superior a 0.8\n", "cvna = CountVectorizer(stop_words=stop_words, max_df=.8)\n", "datos_cvna = cvna.fit_transform(datos_sust_adj['transcripcion'])\n", "datos_dtmna = pd.DataFrame(datos_cvna.toarray(), columns=cvna.get_feature_names())\n", "datos_dtmna.index = datos_sust_adj.index\n", "datos_dtmna" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Creación del corpus gensim\n", "corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtmna.transpose()))\n", "\n", "# Diccionario de vocabulario\n", "id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.004*\"joke\" + 0.003*\"comedy\" + 0.003*\"bo\" + 0.003*\"friend\" + 0.002*\"mad\" + 0.002*\"mom\" + 0.002*\"jenny\" + 0.002*\"repeat\" + 0.002*\"jokes\" + 0.002*\"gay\"'),\n", " (1,\n", " '0.004*\"mom\" + 0.004*\"parents\" + 0.003*\"joke\" + 0.003*\"ass\" + 0.003*\"hasan\" + 0.003*\"dog\" + 0.003*\"clinton\" + 0.002*\"guns\" + 0.002*\"class\" + 0.002*\"youve\"')]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 2\n", "ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)\n", "ldana.print_topics()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.009*\"joke\" + 0.005*\"gun\" + 0.004*\"guns\" + 0.004*\"hell\" + 0.003*\"jokes\" + 0.003*\"ass\" + 0.003*\"anthony\" + 0.003*\"party\" + 0.003*\"son\" + 0.003*\"class\"'),\n", " (1,\n", " '0.006*\"mom\" + 0.004*\"parents\" + 0.003*\"bo\" + 0.003*\"friend\" + 0.003*\"hasan\" + 0.003*\"jenny\" + 0.003*\"clinton\" + 0.003*\"comedy\" + 0.003*\"door\" + 0.003*\"love\"'),\n", " (2,\n", " '0.006*\"ahah\" + 0.005*\"tit\" + 0.005*\"gay\" + 0.004*\"nigga\" + 0.004*\"young\" + 0.003*\"ok\" + 0.003*\"murder\" + 0.003*\"son\" + 0.003*\"ha\" + 0.003*\"oj\"')]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 3\n", "ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)\n", "ldana.print_topics()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.006*\"ok\" + 0.005*\"bo\" + 0.004*\"repeat\" + 0.004*\"ahah\" + 0.004*\"gay\" + 0.004*\"eye\" + 0.004*\"young\" + 0.003*\"tit\" + 0.003*\"husband\" + 0.003*\"um\"'),\n", " (1,\n", " '0.011*\"joke\" + 0.005*\"guns\" + 0.005*\"jokes\" + 0.004*\"anthony\" + 0.004*\"ass\" + 0.004*\"party\" + 0.003*\"gun\" + 0.003*\"cunt\" + 0.003*\"girlfriend\" + 0.003*\"twitter\"'),\n", " (2,\n", " '0.005*\"mom\" + 0.005*\"hasan\" + 0.004*\"parents\" + 0.004*\"door\" + 0.003*\"dick\" + 0.003*\"stupid\" + 0.003*\"religion\" + 0.003*\"brown\" + 0.003*\"jesus\" + 0.003*\"gun\"'),\n", " (3,\n", " '0.006*\"clinton\" + 0.006*\"jenny\" + 0.005*\"parents\" + 0.005*\"friend\" + 0.005*\"mom\" + 0.004*\"cow\" + 0.004*\"wife\" + 0.004*\"john\" + 0.003*\"accident\" + 0.003*\"idea\"')]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# topics = 4\n", "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)\n", "ldana.print_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identificando los temas de cada documento" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "De los 9 'topic models' que hemos extraido, el caso que parece tener más sentido es el 4º tema de la prueba con sustantivos y adjetivos. Afinamos ahora el proceso a través de más iteraciones." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.007*\"gun\" + 0.006*\"guns\" + 0.005*\"ass\" + 0.004*\"son\" + 0.004*\"class\" + 0.004*\"girlfriend\" + 0.003*\"hell\" + 0.003*\"business\" + 0.003*\"cunt\" + 0.003*\"dog\"'),\n", " (1,\n", " '0.005*\"joke\" + 0.005*\"jenny\" + 0.003*\"jenner\" + 0.003*\"texas\" + 0.003*\"door\" + 0.003*\"jokes\" + 0.003*\"nuts\" + 0.003*\"dead\" + 0.003*\"stupid\" + 0.003*\"sort\"'),\n", " (2,\n", " '0.005*\"joke\" + 0.005*\"ok\" + 0.004*\"bo\" + 0.004*\"repeat\" + 0.004*\"ahah\" + 0.003*\"gay\" + 0.003*\"eye\" + 0.003*\"mad\" + 0.003*\"young\" + 0.003*\"anthony\"'),\n", " (3,\n", " '0.009*\"mom\" + 0.007*\"parents\" + 0.006*\"hasan\" + 0.006*\"clinton\" + 0.004*\"cow\" + 0.004*\"york\" + 0.004*\"brown\" + 0.004*\"wife\" + 0.003*\"birthday\" + 0.003*\"bike\"')]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Modelo LDA final (de momento)\n", "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=180)\n", "ldana.print_topics()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estos 4 temas parecen bastante 'decentes'\n", "* Tema 0: Familia\n", "* Tema 1: Marido\n", "* Tema 2: Negocios\n", "* Tema 3: Abuela, palabrotas" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(2, 'ali'),\n", " (2, 'anthony'),\n", " (0, 'bill'),\n", " (2, 'bo'),\n", " (2, 'dave'),\n", " (3, 'hasan'),\n", " (0, 'jim'),\n", " (1, 'joe'),\n", " (3, 'john'),\n", " (2, 'louis'),\n", " (1, 'mike'),\n", " (1, 'ricky')]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Comprobamos los temas que contiene cada transcripción\n", "corpus_transformado = ldana[corpusna]\n", "list(zip([a for [(a,b)] in corpus_transformado], datos_dtmna.index))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ejercicios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Prueba a modificar los parámetros para obtener unos mejores resultados.\n", "2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }