{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Topic Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduccion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Otra técnica para el análisis de textos, es el **Topic Modeling**. El objetivo del Top Modeling es encontrar los 'temas' presentes en el corpus.  Se puede utilizar en buscadores, automatización de atención al cliente, ...\n",
    "\n",
    "Cada documento en el corpus estará formado por al menos un tema.  En este notebook, realizaremos el top modeling a través de **Latent Dirichlet Allocation (LDA)**.\n",
    "El LDA es un aprendizaje no supervisado a través de una nube de palabras.  A través de él podemos encontrar, temas ocultos y clasificar los documentos en base a los temas obtenidos entre otros.\n",
    "\n",
    "https://es.wikipedia.org/wiki/Latent_Dirichlet_Allocation  \n",
    "https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2\n",
    "\n",
    "Para realizar un top modeling, necesitamos:\n",
    "* Document Term Matrix (corpus)\n",
    "* Los términos (topics) que queremos usar.\n",
    "\n",
    "Una vez aplicada el top modeling, es necesario interpretar los resultados para ver si tienen sentido. En el caso de que no lo tengan, se pueden variar el número de temas, los términos en el document-term matrix, los parámetros del modelo o incluso probar un modelo diferente."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Topic Modeling - Prueba #1 (Todo el texto)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: gensim in /home/mydoctor/anaconda3/lib/python3.8/site-packages (4.1.2)\n",
      "Requirement already satisfied: scipy>=0.18.1 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (1.7.3)\n",
      "Requirement already satisfied: numpy>=1.17.0 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (1.21.5)\n",
      "Requirement already satisfied: smart-open>=1.8.1 in /home/mydoctor/anaconda3/lib/python3.8/site-packages (from gensim) (5.1.0)\n"
     ]
    }
   ],
   "source": [
    "# Importar los módulos LDA con gensim\n",
    "#!pip install gensim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aaaaah</th>\n",
       "      <th>aaaaahhhhhhh</th>\n",
       "      <th>aaaaauuugghhhhhh</th>\n",
       "      <th>aaaahhhhh</th>\n",
       "      <th>aaah</th>\n",
       "      <th>aah</th>\n",
       "      <th>abc</th>\n",
       "      <th>abcs</th>\n",
       "      <th>ability</th>\n",
       "      <th>abject</th>\n",
       "      <th>...</th>\n",
       "      <th>zee</th>\n",
       "      <th>zen</th>\n",
       "      <th>zeppelin</th>\n",
       "      <th>zero</th>\n",
       "      <th>zillion</th>\n",
       "      <th>zombie</th>\n",
       "      <th>zombies</th>\n",
       "      <th>zoning</th>\n",
       "      <th>zoo</th>\n",
       "      <th>éclair</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12 rows × 7468 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         aaaaah  aaaaahhhhhhh  aaaaauuugghhhhhh  aaaahhhhh  aaah  aah  abc  \\\n",
       "ali           0             0                 0          0     0    0    1   \n",
       "anthony       0             0                 0          0     0    0    0   \n",
       "bill          1             0                 0          0     0    0    0   \n",
       "bo            0             1                 1          1     0    0    0   \n",
       "dave          0             0                 0          0     1    0    0   \n",
       "hasan         0             0                 0          0     0    0    0   \n",
       "jim           0             0                 0          0     0    0    0   \n",
       "joe           0             0                 0          0     0    0    0   \n",
       "john          0             0                 0          0     0    0    0   \n",
       "louis         0             0                 0          0     0    3    0   \n",
       "mike          0             0                 0          0     0    0    0   \n",
       "ricky         0             0                 0          0     0    0    0   \n",
       "\n",
       "         abcs  ability  abject  ...  zee  zen  zeppelin  zero  zillion  \\\n",
       "ali         0        0       0  ...    0    0         0     0        0   \n",
       "anthony     0        0       0  ...    0    0         0     0        0   \n",
       "bill        1        0       0  ...    0    0         0     1        1   \n",
       "bo          0        1       0  ...    0    0         0     1        0   \n",
       "dave        0        0       0  ...    0    0         0     0        0   \n",
       "hasan       0        0       0  ...    2    1         0     1        0   \n",
       "jim         0        0       0  ...    0    0         0     0        0   \n",
       "joe         0        0       0  ...    0    0         0     0        0   \n",
       "john        0        0       0  ...    0    0         0     0        0   \n",
       "louis       0        0       0  ...    0    0         0     2        0   \n",
       "mike        0        0       0  ...    0    0         2     1        0   \n",
       "ricky       0        1       1  ...    0    0         0     0        0   \n",
       "\n",
       "         zombie  zombies  zoning  zoo  éclair  \n",
       "ali           1        0       0    0       0  \n",
       "anthony       0        0       0    0       0  \n",
       "bill          1        1       1    0       0  \n",
       "bo            0        0       0    0       0  \n",
       "dave          0        0       0    0       0  \n",
       "hasan         0        0       0    0       0  \n",
       "jim           0        0       0    0       0  \n",
       "joe           0        0       0    0       0  \n",
       "john          0        0       0    0       1  \n",
       "louis         0        0       0    0       0  \n",
       "mike          0        0       0    0       0  \n",
       "ricky         0        0       0    1       0  \n",
       "\n",
       "[12 rows x 7468 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Cargamos el document-term matrix generado previamente\n",
    "import pandas as pd\n",
    "import pickle\n",
    "\n",
    "datos = pd.read_pickle('dtm_stop.pkl')\n",
    "datos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim import matutils, models\n",
    "import scipy.sparse\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ali</th>\n",
       "      <th>anthony</th>\n",
       "      <th>bill</th>\n",
       "      <th>bo</th>\n",
       "      <th>dave</th>\n",
       "      <th>hasan</th>\n",
       "      <th>jim</th>\n",
       "      <th>joe</th>\n",
       "      <th>john</th>\n",
       "      <th>louis</th>\n",
       "      <th>mike</th>\n",
       "      <th>ricky</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>aaaaah</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aaaaahhhhhhh</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aaaaauuugghhhhhh</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aaaahhhhh</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>aaah</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  ali  anthony  bill  bo  dave  hasan  jim  joe  john  louis  \\\n",
       "aaaaah              0        0     1   0     0      0    0    0     0      0   \n",
       "aaaaahhhhhhh        0        0     0   1     0      0    0    0     0      0   \n",
       "aaaaauuugghhhhhh    0        0     0   1     0      0    0    0     0      0   \n",
       "aaaahhhhh           0        0     0   1     0      0    0    0     0      0   \n",
       "aaah                0        0     0   0     1      0    0    0     0      0   \n",
       "\n",
       "                  mike  ricky  \n",
       "aaaaah               0      0  \n",
       "aaaaahhhhhhh         0      0  \n",
       "aaaaauuugghhhhhh     0      0  \n",
       "aaaahhhhh            0      0  \n",
       "aaah                 0      0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Uno de los requerimientos para el LDA es un term-document matrix transpuesto\n",
    "tdm = datos.transpose()\n",
    "tdm.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cambiamos el formato de la matriz a 'gensim'\n",
    "# Pasos necesarios df --> matriz dispersa --> corpus gensim\n",
    "matriz_dispersa = scipy.sparse.csr_matrix(tdm)\n",
    "corpus = matutils.Sparse2Corpus(matriz_dispersa)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Gensim necesita de un diccionario con todos los términos y su ubicación en el corpus.\n",
    "# Recuperamos la matriz generada en el script 2\n",
    "cv = pickle.load(open(\"cv_stop.pkl\", \"rb\"))\n",
    "id2word = dict((v, k) for k, v in cv.vocabulary_.items())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ya tenemos el corpus y el diccionario palabra:ubicación, necesitamos especificar otros 2 parámetros:\n",
    "- El total de temas y\n",
    "- El total de iteraciones en el entrenamiento. \n",
    "\n",
    "Probamos con 2 temas y veremos si el resultado tiene sentido."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.006*\"cause\" + 0.006*\"shit\" + 0.005*\"fucking\" + 0.005*\"really\" + 0.005*\"good\" + 0.005*\"went\" + 0.005*\"hes\" + 0.004*\"thing\" + 0.004*\"didnt\" + 0.004*\"day\"'),\n",
       " (1,\n",
       "  '0.009*\"fucking\" + 0.006*\"fuck\" + 0.006*\"say\" + 0.006*\"shit\" + 0.005*\"want\" + 0.005*\"going\" + 0.005*\"theyre\" + 0.004*\"love\" + 0.004*\"did\" + 0.004*\"hes\"')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=40)\n",
    "lda.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.005*\"little\" + 0.005*\"love\" + 0.004*\"bo\" + 0.004*\"stuff\" + 0.004*\"old\" + 0.004*\"clinton\" + 0.004*\"say\" + 0.004*\"way\" + 0.004*\"repeat\" + 0.004*\"hey\"'),\n",
       " (1,\n",
       "  '0.008*\"shit\" + 0.007*\"fucking\" + 0.006*\"fuck\" + 0.005*\"say\" + 0.005*\"theyre\" + 0.005*\"didnt\" + 0.005*\"want\" + 0.005*\"going\" + 0.005*\"cause\" + 0.005*\"hes\"'),\n",
       " (2,\n",
       "  '0.012*\"fucking\" + 0.010*\"went\" + 0.006*\"going\" + 0.006*\"day\" + 0.006*\"say\" + 0.006*\"thing\" + 0.005*\"theyre\" + 0.005*\"ive\" + 0.005*\"hes\" + 0.005*\"goes\"')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# LDA for num_topics = 3\n",
    "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=40)\n",
    "lda.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.007*\"id\" + 0.007*\"say\" + 0.006*\"says\" + 0.005*\"didnt\" + 0.005*\"went\" + 0.005*\"goes\" + 0.005*\"mean\" + 0.005*\"fucking\" + 0.005*\"going\" + 0.005*\"cause\"'),\n",
       " (1,\n",
       "  '0.009*\"fuck\" + 0.009*\"shit\" + 0.009*\"fucking\" + 0.007*\"theyre\" + 0.005*\"cause\" + 0.005*\"gotta\" + 0.005*\"theres\" + 0.005*\"man\" + 0.005*\"lot\" + 0.004*\"wanna\"'),\n",
       " (2,\n",
       "  '0.008*\"dad\" + 0.006*\"going\" + 0.006*\"say\" + 0.005*\"hey\" + 0.005*\"shes\" + 0.005*\"mom\" + 0.005*\"want\" + 0.004*\"love\" + 0.004*\"did\" + 0.004*\"look\"'),\n",
       " (3,\n",
       "  '0.010*\"fucking\" + 0.008*\"shit\" + 0.007*\"fuck\" + 0.006*\"thing\" + 0.006*\"good\" + 0.005*\"hes\" + 0.005*\"want\" + 0.005*\"didnt\" + 0.005*\"day\" + 0.005*\"say\"')]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# LDA for num_topics = 4\n",
    "lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=40)\n",
    "lda.print_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lo que obtenemos es la probabilidad de que una palabra, aparezca en un tema.\n",
    "Pero los resultados son pobres.  Hemos probado a mejorarlo, modificando los parámetros, probemos ahora modificando los términos usados."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Topic Modeling - Prueba #2 (Sólo sustantivos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Un truco habitual suele ser usar sólo sustantivos, sólo adjetivos, ...\n",
    "https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. -> para comprobar la etiqueta para filtrar por sustantivos "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creamos una función para extraer los sustantivos de un texto\n",
    "from nltk import word_tokenize, pos_tag\n",
    "\n",
    "def sustantivos(texto):\n",
    "    '''Dada una cadena de texto, se tokeniza y devuelve sólo los sustantivos.'''\n",
    "    # Aquí es donde nos quedamos sólo con los sustantivos.\n",
    "    es_sustantivo = lambda pos: pos[:2] == 'NN'\n",
    "    \n",
    "    tokenizado = word_tokenize(texto)\n",
    "    todo_sustantivos = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sustantivo(pos)] \n",
    "    return ' '.join(todo_sustantivos)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transcripcion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>ladies and gentlemen please welcome to the sta...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>thank you thank you thank you san francisco th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>all right thank you thank you very much thank...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>bo what old macdonald had a farm e i e i o and...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>this is dave he tells dirty jokes for a living...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>whats up davis whats up im home i had to bri...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>ladies and gentlemen please welcome to the ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>ladies and gentlemen welcome joe rogan  wha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>all right petunia wish me luck out there you w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>introfade the music out lets roll hold there l...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>wow hey thank you thanks thank you guys hey se...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>hello hello how you doing great thank you wow ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             transcripcion\n",
       "ali      ladies and gentlemen please welcome to the sta...\n",
       "anthony  thank you thank you thank you san francisco th...\n",
       "bill      all right thank you thank you very much thank...\n",
       "bo       bo what old macdonald had a farm e i e i o and...\n",
       "dave     this is dave he tells dirty jokes for a living...\n",
       "hasan      whats up davis whats up im home i had to bri...\n",
       "jim         ladies and gentlemen please welcome to the ...\n",
       "joe         ladies and gentlemen welcome joe rogan  wha...\n",
       "john     all right petunia wish me luck out there you w...\n",
       "louis    introfade the music out lets roll hold there l...\n",
       "mike     wow hey thank you thanks thank you guys hey se...\n",
       "ricky    hello hello how you doing great thank you wow ..."
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Leemos los datos limpios generados previamente\n",
    "datos_limpios = pd.read_pickle('datos_limpios.pkl')\n",
    "datos_limpios"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/mydoctor/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Descargamos la librería para poder normalizar las palabras, según su contexto y análisis morfológico.\n",
    "import nltk\n",
    "nltk.download('averaged_perceptron_tagger')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transcripcion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>ladies gentlemen stage ali hi thank hello na s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>thank thank people i em i francisco city world...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>thank thank pleasure georgia area oasis i june...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>macdonald farm e i o farm pig e i i snort macd...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>jokes living stare work profound train thought...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>whats davis whats home i netflix la york i son...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>ladies gentlemen stage mr jim jefferies thank ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>ladies gentlemen joe fuck thanks phone fuckfac...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>petunia thats hello hello chicago thank crowd ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>music lets lights lights thank i i place place...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>wow hey thanks look insane years everyone i id...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>hello thank fuck thank im gon youre weve money...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             transcripcion\n",
       "ali      ladies gentlemen stage ali hi thank hello na s...\n",
       "anthony  thank thank people i em i francisco city world...\n",
       "bill     thank thank pleasure georgia area oasis i june...\n",
       "bo       macdonald farm e i o farm pig e i i snort macd...\n",
       "dave     jokes living stare work profound train thought...\n",
       "hasan    whats davis whats home i netflix la york i son...\n",
       "jim      ladies gentlemen stage mr jim jefferies thank ...\n",
       "joe      ladies gentlemen joe fuck thanks phone fuckfac...\n",
       "john     petunia thats hello hello chicago thank crowd ...\n",
       "louis    music lets lights lights thank i i place place...\n",
       "mike     wow hey thanks look insane years everyone i id...\n",
       "ricky    hello thank fuck thank im gon youre weve money..."
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extraemos los sustantivos\n",
    "datos_sustantivos = pd.DataFrame(datos_limpios ['transcripcion'].apply(sustantivos))\n",
    "datos_sustantivos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/mydoctor/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aaaaahhhhhhh</th>\n",
       "      <th>aaaaauuugghhhhhh</th>\n",
       "      <th>aaaahhhhh</th>\n",
       "      <th>aah</th>\n",
       "      <th>abc</th>\n",
       "      <th>abcs</th>\n",
       "      <th>ability</th>\n",
       "      <th>abortion</th>\n",
       "      <th>abortions</th>\n",
       "      <th>abuse</th>\n",
       "      <th>...</th>\n",
       "      <th>yummy</th>\n",
       "      <th>ze</th>\n",
       "      <th>zealand</th>\n",
       "      <th>zee</th>\n",
       "      <th>zeppelin</th>\n",
       "      <th>zillion</th>\n",
       "      <th>zombie</th>\n",
       "      <th>zombies</th>\n",
       "      <th>zoo</th>\n",
       "      <th>éclair</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12 rows × 4635 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         aaaaahhhhhhh  aaaaauuugghhhhhh  aaaahhhhh  aah  abc  abcs  ability  \\\n",
       "ali                 0                 0          0    0    1     0        0   \n",
       "anthony             0                 0          0    0    0     0        0   \n",
       "bill                0                 0          0    0    0     1        0   \n",
       "bo                  1                 1          1    0    0     0        1   \n",
       "dave                0                 0          0    0    0     0        0   \n",
       "hasan               0                 0          0    0    0     0        0   \n",
       "jim                 0                 0          0    0    0     0        0   \n",
       "joe                 0                 0          0    0    0     0        0   \n",
       "john                0                 0          0    0    0     0        0   \n",
       "louis               0                 0          0    3    0     0        0   \n",
       "mike                0                 0          0    0    0     0        0   \n",
       "ricky               0                 0          0    0    0     0        1   \n",
       "\n",
       "         abortion  abortions  abuse  ...  yummy  ze  zealand  zee  zeppelin  \\\n",
       "ali             0          0      0  ...      0   0        0    0         0   \n",
       "anthony         2          0      0  ...      0   0       10    0         0   \n",
       "bill            0          0      0  ...      0   1        0    0         0   \n",
       "bo              0          0      0  ...      0   0        0    0         0   \n",
       "dave            0          1      0  ...      0   0        0    0         0   \n",
       "hasan           0          0      0  ...      0   0        0    1         0   \n",
       "jim             0          0      0  ...      0   0        0    0         0   \n",
       "joe             0          0      1  ...      0   0        0    0         0   \n",
       "john            0          0      0  ...      0   0        0    0         0   \n",
       "louis           0          0      0  ...      0   0        0    0         0   \n",
       "mike            0          0      0  ...      0   0        0    0         2   \n",
       "ricky           0          0      0  ...      1   0        0    0         0   \n",
       "\n",
       "         zillion  zombie  zombies  zoo  éclair  \n",
       "ali            0       1        0    0       0  \n",
       "anthony        0       0        0    0       0  \n",
       "bill           1       1        1    0       0  \n",
       "bo             0       0        0    0       0  \n",
       "dave           0       0        0    0       0  \n",
       "hasan          0       0        0    0       0  \n",
       "jim            0       0        0    0       0  \n",
       "joe            0       0        0    0       0  \n",
       "john           0       0        0    0       1  \n",
       "louis          0       0        0    0       0  \n",
       "mike           0       0        0    0       0  \n",
       "ricky          0       0        0    1       0  \n",
       "\n",
       "[12 rows x 4635 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creamos un nuevo corpus sólo con los sustantivos\n",
    "from sklearn.feature_extraction import text\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# Quitamos las stopwords, puesto que vamos a generar un nuevo corpus\n",
    "add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',\n",
    "                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']\n",
    "stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)\n",
    "\n",
    "# Corpus sólo con sustantivos\n",
    "cvs = CountVectorizer(stop_words=stop_words)\n",
    "datos_cvs = cvs.fit_transform(datos_sustantivos['transcripcion'])\n",
    "datos_dtms = pd.DataFrame(datos_cvs.toarray(), columns=cvs.get_feature_names())\n",
    "datos_dtms.index = datos_sustantivos.index\n",
    "datos_dtms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generar el corpus gensim\n",
    "corpuss = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtms.transpose()))\n",
    "\n",
    "# Generar el diccionario de vocabulario\n",
    "id2words = dict((v, k) for k, v in cvs.vocabulary_.items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.010*\"thing\" + 0.009*\"day\" + 0.008*\"life\" + 0.007*\"way\" + 0.007*\"cause\" + 0.006*\"kids\" + 0.006*\"hes\" + 0.005*\"mom\" + 0.005*\"joke\" + 0.005*\"lot\"'),\n",
       " (1,\n",
       "  '0.009*\"shit\" + 0.008*\"man\" + 0.008*\"day\" + 0.008*\"thing\" + 0.007*\"fuck\" + 0.007*\"hes\" + 0.007*\"life\" + 0.006*\"way\" + 0.006*\"cause\" + 0.006*\"guy\"')]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Empezamos por 2 temas\n",
    "ldas = models.LdaModel(corpus=corpuss, num_topics=2, id2word=id2words, passes=10)\n",
    "ldas.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.013*\"dad\" + 0.007*\"life\" + 0.007*\"shes\" + 0.006*\"mom\" + 0.006*\"parents\" + 0.006*\"school\" + 0.005*\"girl\" + 0.005*\"home\" + 0.005*\"hes\" + 0.004*\"night\"'),\n",
       " (1,\n",
       "  '0.008*\"day\" + 0.008*\"thing\" + 0.007*\"way\" + 0.007*\"shit\" + 0.007*\"man\" + 0.007*\"hes\" + 0.006*\"years\" + 0.006*\"guy\" + 0.006*\"joke\" + 0.006*\"cause\"'),\n",
       " (2,\n",
       "  '0.011*\"thing\" + 0.010*\"life\" + 0.010*\"day\" + 0.010*\"cause\" + 0.010*\"shit\" + 0.009*\"fuck\" + 0.009*\"man\" + 0.008*\"women\" + 0.008*\"lot\" + 0.007*\"hes\"')]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 3\n",
    "ldas = models.LdaModel(corpus=corpuss, num_topics=3, id2word=id2words, passes=10)\n",
    "ldas.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.013*\"shit\" + 0.009*\"guy\" + 0.009*\"gon\" + 0.009*\"dude\" + 0.008*\"thing\" + 0.008*\"hes\" + 0.007*\"fuck\" + 0.006*\"man\" + 0.006*\"life\" + 0.006*\"day\"'),\n",
       " (1,\n",
       "  '0.008*\"man\" + 0.008*\"life\" + 0.008*\"dad\" + 0.007*\"day\" + 0.007*\"way\" + 0.007*\"hes\" + 0.007*\"shes\" + 0.007*\"thing\" + 0.006*\"fuck\" + 0.006*\"house\"'),\n",
       " (2,\n",
       "  '0.009*\"cause\" + 0.009*\"day\" + 0.008*\"thing\" + 0.008*\"man\" + 0.008*\"guy\" + 0.007*\"fuck\" + 0.007*\"women\" + 0.007*\"shit\" + 0.006*\"way\" + 0.006*\"lot\"'),\n",
       " (3,\n",
       "  '0.011*\"thing\" + 0.011*\"day\" + 0.011*\"life\" + 0.010*\"lot\" + 0.010*\"shit\" + 0.008*\"cause\" + 0.008*\"women\" + 0.007*\"hes\" + 0.007*\"joke\" + 0.006*\"gon\"')]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 4\n",
    "ldas = models.LdaModel(corpus=corpuss, num_topics=4, id2word=id2words, passes=10)\n",
    "ldas.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.009*\"fuck\" + 0.009*\"thing\" + 0.009*\"man\" + 0.008*\"things\" + 0.007*\"day\" + 0.007*\"hes\" + 0.007*\"kids\" + 0.007*\"life\" + 0.007*\"theyre\" + 0.007*\"years\"'),\n",
       " (1,\n",
       "  '0.010*\"man\" + 0.009*\"shit\" + 0.009*\"dad\" + 0.008*\"fuck\" + 0.007*\"hes\" + 0.007*\"life\" + 0.006*\"way\" + 0.005*\"stuff\" + 0.005*\"night\" + 0.005*\"lot\"'),\n",
       " (2,\n",
       "  '0.011*\"day\" + 0.010*\"cause\" + 0.009*\"thing\" + 0.008*\"way\" + 0.007*\"guy\" + 0.006*\"house\" + 0.006*\"night\" + 0.005*\"kind\" + 0.005*\"women\" + 0.005*\"man\"'),\n",
       " (3,\n",
       "  '0.015*\"shit\" + 0.012*\"life\" + 0.011*\"thing\" + 0.010*\"hes\" + 0.009*\"gon\" + 0.009*\"cause\" + 0.008*\"guy\" + 0.008*\"day\" + 0.008*\"dude\" + 0.008*\"lot\"'),\n",
       " (4,\n",
       "  '0.015*\"joke\" + 0.013*\"day\" + 0.008*\"thing\" + 0.008*\"anthony\" + 0.008*\"school\" + 0.007*\"family\" + 0.007*\"jokes\" + 0.007*\"grandma\" + 0.006*\"lot\" + 0.006*\"baby\"')]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 5\n",
    "ldas = models.LdaModel(corpus=corpuss, num_topics=5, id2word=id2words, passes=10)\n",
    "ldas.print_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Topic Modeling - Prueba #3 (Sustantivos y Adjetivos)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Función para extraer los sustantivos y adjetivos\n",
    "def sust_adj(texto):\n",
    "    '''Dado un texto, lo tokeniza y devuelve sólo los sustantivos y adjetivos.'''\n",
    "    es_sust_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'\n",
    "    tokenizado = word_tokenize(texto)\n",
    "    todo_sust_adj = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sust_adj(pos)] \n",
    "    return ' '.join(todo_sust_adj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transcripcion</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>ladies gentlemen welcome stage ali wong hi wel...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>thank san francisco thank good people surprise...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>right thank thank pleasure greater atlanta geo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>old macdonald farm e i i o farm pig e i i snor...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>dirty jokes living stare most hard work profou...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>whats davis whats im home i netflix special la...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>ladies gentlemen welcome stage mr jim jefferie...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>ladies gentlemen joe fuck san francisco thanks...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>right petunia august thats good right hello he...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>music lets lights lights thank much i i i nice...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>wow hey thanks hey seattle nice look crazy ins...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>hello great thank fuck thank lovely welcome im...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             transcripcion\n",
       "ali      ladies gentlemen welcome stage ali wong hi wel...\n",
       "anthony  thank san francisco thank good people surprise...\n",
       "bill     right thank thank pleasure greater atlanta geo...\n",
       "bo       old macdonald farm e i i o farm pig e i i snor...\n",
       "dave     dirty jokes living stare most hard work profou...\n",
       "hasan    whats davis whats im home i netflix special la...\n",
       "jim      ladies gentlemen welcome stage mr jim jefferie...\n",
       "joe      ladies gentlemen joe fuck san francisco thanks...\n",
       "john     right petunia august thats good right hello he...\n",
       "louis    music lets lights lights thank much i i i nice...\n",
       "mike     wow hey thanks hey seattle nice look crazy ins...\n",
       "ricky    hello great thank fuck thank lovely welcome im..."
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Aplicamos la función a los datos limpios\n",
    "datos_sust_adj = pd.DataFrame(datos_limpios['transcripcion'].apply(sust_adj))\n",
    "datos_sust_adj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/mydoctor/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aaaaah</th>\n",
       "      <th>aaaaahhhhhhh</th>\n",
       "      <th>aaaaauuugghhhhhh</th>\n",
       "      <th>aaaahhhhh</th>\n",
       "      <th>aah</th>\n",
       "      <th>abc</th>\n",
       "      <th>abcs</th>\n",
       "      <th>ability</th>\n",
       "      <th>abject</th>\n",
       "      <th>able</th>\n",
       "      <th>...</th>\n",
       "      <th>ze</th>\n",
       "      <th>zealand</th>\n",
       "      <th>zee</th>\n",
       "      <th>zeppelin</th>\n",
       "      <th>zero</th>\n",
       "      <th>zillion</th>\n",
       "      <th>zombie</th>\n",
       "      <th>zombies</th>\n",
       "      <th>zoo</th>\n",
       "      <th>éclair</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ali</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anthony</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>10</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bill</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bo</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dave</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hasan</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>jim</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>joe</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>john</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>louis</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mike</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ricky</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>12 rows × 5587 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         aaaaah  aaaaahhhhhhh  aaaaauuugghhhhhh  aaaahhhhh  aah  abc  abcs  \\\n",
       "ali           0             0                 0          0    0    1     0   \n",
       "anthony       0             0                 0          0    0    0     0   \n",
       "bill          1             0                 0          0    0    0     1   \n",
       "bo            0             1                 1          1    0    0     0   \n",
       "dave          0             0                 0          0    0    0     0   \n",
       "hasan         0             0                 0          0    0    0     0   \n",
       "jim           0             0                 0          0    0    0     0   \n",
       "joe           0             0                 0          0    0    0     0   \n",
       "john          0             0                 0          0    0    0     0   \n",
       "louis         0             0                 0          0    3    0     0   \n",
       "mike          0             0                 0          0    0    0     0   \n",
       "ricky         0             0                 0          0    0    0     0   \n",
       "\n",
       "         ability  abject  able  ...  ze  zealand  zee  zeppelin  zero  \\\n",
       "ali            0       0     2  ...   0        0    0         0     0   \n",
       "anthony        0       0     0  ...   0       10    0         0     0   \n",
       "bill           0       0     1  ...   1        0    0         0     0   \n",
       "bo             1       0     0  ...   0        0    0         0     1   \n",
       "dave           0       0     0  ...   0        0    0         0     0   \n",
       "hasan          0       0     1  ...   0        0    2         0     0   \n",
       "jim            0       0     1  ...   0        0    0         0     0   \n",
       "joe            0       0     2  ...   0        0    0         0     0   \n",
       "john           0       0     3  ...   0        0    0         0     0   \n",
       "louis          0       0     1  ...   0        0    0         0     0   \n",
       "mike           0       0     0  ...   0        0    0         2     0   \n",
       "ricky          1       1     2  ...   0        0    0         0     0   \n",
       "\n",
       "         zillion  zombie  zombies  zoo  éclair  \n",
       "ali            0       1        0    0       0  \n",
       "anthony        0       0        0    0       0  \n",
       "bill           1       1        1    0       0  \n",
       "bo             0       0        0    0       0  \n",
       "dave           0       0        0    0       0  \n",
       "hasan          0       0        0    0       0  \n",
       "jim            0       0        0    0       0  \n",
       "joe            0       0        0    0       0  \n",
       "john           0       0        0    0       1  \n",
       "louis          0       0        0    0       0  \n",
       "mike           0       0        0    0       0  \n",
       "ricky          0       0        0    1       0  \n",
       "\n",
       "[12 rows x 5587 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creación del nuevo corpus, ahora sólo con sustantivos y adjetivos.  Además eliminamos las stop words con max_df superior a 0.8\n",
    "cvna = CountVectorizer(stop_words=stop_words, max_df=.8)\n",
    "datos_cvna = cvna.fit_transform(datos_sust_adj['transcripcion'])\n",
    "datos_dtmna = pd.DataFrame(datos_cvna.toarray(), columns=cvna.get_feature_names())\n",
    "datos_dtmna.index = datos_sust_adj.index\n",
    "datos_dtmna"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creación del corpus gensim\n",
    "corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtmna.transpose()))\n",
    "\n",
    "# Diccionario de vocabulario\n",
    "id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.004*\"joke\" + 0.003*\"comedy\" + 0.003*\"bo\" + 0.003*\"friend\" + 0.002*\"mad\" + 0.002*\"mom\" + 0.002*\"jenny\" + 0.002*\"repeat\" + 0.002*\"jokes\" + 0.002*\"gay\"'),\n",
       " (1,\n",
       "  '0.004*\"mom\" + 0.004*\"parents\" + 0.003*\"joke\" + 0.003*\"ass\" + 0.003*\"hasan\" + 0.003*\"dog\" + 0.003*\"clinton\" + 0.002*\"guns\" + 0.002*\"class\" + 0.002*\"youve\"')]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 2\n",
    "ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)\n",
    "ldana.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.009*\"joke\" + 0.005*\"gun\" + 0.004*\"guns\" + 0.004*\"hell\" + 0.003*\"jokes\" + 0.003*\"ass\" + 0.003*\"anthony\" + 0.003*\"party\" + 0.003*\"son\" + 0.003*\"class\"'),\n",
       " (1,\n",
       "  '0.006*\"mom\" + 0.004*\"parents\" + 0.003*\"bo\" + 0.003*\"friend\" + 0.003*\"hasan\" + 0.003*\"jenny\" + 0.003*\"clinton\" + 0.003*\"comedy\" + 0.003*\"door\" + 0.003*\"love\"'),\n",
       " (2,\n",
       "  '0.006*\"ahah\" + 0.005*\"tit\" + 0.005*\"gay\" + 0.004*\"nigga\" + 0.004*\"young\" + 0.003*\"ok\" + 0.003*\"murder\" + 0.003*\"son\" + 0.003*\"ha\" + 0.003*\"oj\"')]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 3\n",
    "ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)\n",
    "ldana.print_topics()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.006*\"ok\" + 0.005*\"bo\" + 0.004*\"repeat\" + 0.004*\"ahah\" + 0.004*\"gay\" + 0.004*\"eye\" + 0.004*\"young\" + 0.003*\"tit\" + 0.003*\"husband\" + 0.003*\"um\"'),\n",
       " (1,\n",
       "  '0.011*\"joke\" + 0.005*\"guns\" + 0.005*\"jokes\" + 0.004*\"anthony\" + 0.004*\"ass\" + 0.004*\"party\" + 0.003*\"gun\" + 0.003*\"cunt\" + 0.003*\"girlfriend\" + 0.003*\"twitter\"'),\n",
       " (2,\n",
       "  '0.005*\"mom\" + 0.005*\"hasan\" + 0.004*\"parents\" + 0.004*\"door\" + 0.003*\"dick\" + 0.003*\"stupid\" + 0.003*\"religion\" + 0.003*\"brown\" + 0.003*\"jesus\" + 0.003*\"gun\"'),\n",
       " (3,\n",
       "  '0.006*\"clinton\" + 0.006*\"jenny\" + 0.005*\"parents\" + 0.005*\"friend\" + 0.005*\"mom\" + 0.004*\"cow\" + 0.004*\"wife\" + 0.004*\"john\" + 0.003*\"accident\" + 0.003*\"idea\"')]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# topics = 4\n",
    "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)\n",
    "ldana.print_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identificando los temas de cada documento"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "De los 9 'topic models' que hemos extraido, el caso que parece tener más sentido es el 4º tema de la prueba con sustantivos y adjetivos.  Afinamos ahora el proceso a través de más iteraciones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.007*\"gun\" + 0.006*\"guns\" + 0.005*\"ass\" + 0.004*\"son\" + 0.004*\"class\" + 0.004*\"girlfriend\" + 0.003*\"hell\" + 0.003*\"business\" + 0.003*\"cunt\" + 0.003*\"dog\"'),\n",
       " (1,\n",
       "  '0.005*\"joke\" + 0.005*\"jenny\" + 0.003*\"jenner\" + 0.003*\"texas\" + 0.003*\"door\" + 0.003*\"jokes\" + 0.003*\"nuts\" + 0.003*\"dead\" + 0.003*\"stupid\" + 0.003*\"sort\"'),\n",
       " (2,\n",
       "  '0.005*\"joke\" + 0.005*\"ok\" + 0.004*\"bo\" + 0.004*\"repeat\" + 0.004*\"ahah\" + 0.003*\"gay\" + 0.003*\"eye\" + 0.003*\"mad\" + 0.003*\"young\" + 0.003*\"anthony\"'),\n",
       " (3,\n",
       "  '0.009*\"mom\" + 0.007*\"parents\" + 0.006*\"hasan\" + 0.006*\"clinton\" + 0.004*\"cow\" + 0.004*\"york\" + 0.004*\"brown\" + 0.004*\"wife\" + 0.003*\"birthday\" + 0.003*\"bike\"')]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Modelo LDA final (de momento)\n",
    "ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=180)\n",
    "ldana.print_topics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estos 4 temas parecen bastante 'decentes'\n",
    "* Tema 0: Familia\n",
    "* Tema 1: Marido\n",
    "* Tema 2: Negocios\n",
    "* Tema 3: Abuela, palabrotas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(2, 'ali'),\n",
       " (2, 'anthony'),\n",
       " (0, 'bill'),\n",
       " (2, 'bo'),\n",
       " (2, 'dave'),\n",
       " (3, 'hasan'),\n",
       " (0, 'jim'),\n",
       " (1, 'joe'),\n",
       " (3, 'john'),\n",
       " (2, 'louis'),\n",
       " (1, 'mike'),\n",
       " (1, 'ricky')]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Comprobamos los temas que contiene cada transcripción\n",
    "corpus_transformado = ldana[corpusna]\n",
    "list(zip([a for [(a,b)] in corpus_transformado], datos_dtmna.index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ejercicios"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Prueba a modificar los parámetros para obtener unos mejores resultados.\n",
    "2. Create a new topic model that includes terms from a different [part of speech](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and see if you can get better topics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}