{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# PLN (Procesamiento del lenguaje natural) con Python\n", " \n", "**Requisitos: Será necesario instalar la librería NLTK, además de descargar el corpus para las stopwords. Por defecto Conda incluye el paquete NLTK así como Google Colab. En el caso de que no estuviera instalado NLTK, ejecutar el siguiente chunk**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Ejecutar este chunk sólo si no está instalado NLTK\n", "# Descomentar la siguiente línea para instalar la libraría:\n", "\n", "#!conda install nltk " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NLTK Downloader\n", "---------------------------------------------------------------------------\n", " d) Download l) List u) Update c) Config h) Help q) Quit\n", "---------------------------------------------------------------------------\n", "Downloader> d\n", "\n", "Download which package (l=list; x=cancel)?\n", " Identifier> stopwords\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " Downloading package stopwords to C:\\Users\\Julen\n", " Montes\\AppData\\Roaming\\nltk_data...\n", " Unzipping corpora\\stopwords.zip.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "---------------------------------------------------------------------------\n", " d) Download l) List u) Update c) Config h) Help q) Quit\n", "---------------------------------------------------------------------------\n", "Downloader> q\n" ] } ], "source": [ "nltk.download_shell() \n", "#d) DOwnload:\n", "#stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtener los datos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para el presente ejercicio, usaremos un dataset de [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). Este dataset está en la carpeta **data**. El conjunto de datos está en inglés y cuenta con más de 5000 SMS. Para información ampliada sobre el conjunto de datos, consultar el fichero **readme**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comprobamos primero el total de mensajes del conjunto de datos. Usaremos rstrip() para eliminar 'espacios' al final de cada línea (o retornos de carro):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5574\n" ] } ], "source": [ "mensajes = [line.rstrip() for line in open('datos/SMSSpamCollection')]\n", "print(len(mensajes))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type (mensajes)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ham\\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',\n", " 'ham\\tOk lar... Joking wif u oni...',\n", " \"spam\\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\",\n", " 'ham\\tU dun say so early hor... U c already then say...',\n", " \"ham\\tNah I don't think he goes to usf, he lives around here though\",\n", " \"spam\\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv\",\n", " 'ham\\tEven my brother is not like to speak with me. They treat me like aids patent.',\n", " \"ham\\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\",\n", " 'spam\\tWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',\n", " 'spam\\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',\n", " \"ham\\tI'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.\",\n", " 'spam\\tSIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info',\n", " 'spam\\tURGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18',\n", " \"ham\\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\",\n", " 'ham\\tI HAVE A DATE ON SUNDAY WITH WILL!!',\n", " 'spam\\tXXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL',\n", " \"ham\\tOh k...i'm watching here:)\",\n", " 'ham\\tEh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.',\n", " 'ham\\tFine if that\\x92s the way u feel. That\\x92s the way its gota b',\n", " 'spam\\tEngland v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes [0:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Una colección de textos se suele denominar \"corpus\". Podemos imprimir mensajes, mostrando además el número de SMS, usando **enumerate**:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n", "1 ham\tOk lar... Joking wif u oni...\n", "2 spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", "3 ham\tU dun say so early hor... U c already then say...\n", "4 ham\tNah I don't think he goes to usf, he lives around here though\n", "5 spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv\n", "6 ham\tEven my brother is not like to speak with me. They treat me like aids patent.\n", "7 ham\tAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n", "8 spam\tWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n", "9 spam\tHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n", "10 ham\tI'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.\n", "11 spam\tSIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info\n", "12 spam\tURGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18\n", "13 ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\n", "14 ham\tI HAVE A DATE ON SUNDAY WITH WILL!!\n", "15 spam\tXXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL\n", "16 ham\tOh k...i'm watching here:)\n", "17 ham\tEh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.\n", "18 ham\tFine if that’s the way u feel. That’s the way its gota b\n", "19 spam\tEngland v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+\n" ] } ], "source": [ "for num_mensaje, mensajes in enumerate(mensajes[:20]):\n", " print(num_mensaje, mensajes)\n", " #print('\\n')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El set de datos, tiene como separador \\t (es un TSV), donde la primera columna nos indica si el mensaje es spam o no. La segunda columna contiene el cuerpo del SMS.\n", "\n", "A través de Machine Learning, vamos a entrenar un modelo para aprender a discriminar automáticamente cuando un SMS es span o no. El modelo lo podremos usar para clasificar SMS sin la variable clase.\n", "\n", "Podemos ver el proceso seguido, a través de la documentación oficial de SciKit Learn:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajes
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", "
" ], "text/plain": [ " clase mensajes\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes = pd.read_csv('datos/SMSSpamCollection', sep='\\t',\n", " names=[\"clase\", \"mensajes\"])\n", "mensajes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Análisis exploratorio inicial" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajes
count55725572
unique25169
tophamSorry, I'll call later
freq482530
\n", "
" ], "text/plain": [ " clase mensajes\n", "count 5572 5572\n", "unique 2 5169\n", "top ham Sorry, I'll call later\n", "freq 4825 30" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agrupamos los datos en base a la clase y vemos que devuelve describe()." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mensajes
countuniquetopfreq
clase
ham48254516Sorry, I'll call later30
spam747653Please call our customer service representativ...4
\n", "
" ], "text/plain": [ " mensajes \n", " count unique top freq\n", "clase \n", "ham 4825 4516 Sorry, I'll call later 30\n", "spam 747 653 Please call our customer service representativ... 4" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.groupby('clase').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para continuar, realizamos un análisis exploratorio para conocer los datos con los que estamos trabajando. Cuanto mayor sea el conocimiento que tengamos de los datos, mayor capacidad tendremos para el [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering) (ingeniería de datos o factores).\n", "\n", "El enriquecimiento de los datos, puede ser mejorar de manera reseñable la capacidad predictiva de nuestro modelo, frente a un set de datos dado." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajestamaño
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " clase mensajes tamaño\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes['tamaño'] = mensajes['mensajes'].apply(len)\n", "mensajes.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajestamaño
1085hamFor me the love should start with attraction.i...910
1863hamThe last thing i ever wanted to do was hurt yo...790
2434hamIndians r poor but India is not a poor country...629
1579hamHow to Make a girl Happy? It's not at all diff...611
2158hamSad story of a Man - Last week was my b'day. M...588
............
3376ham:)3
5357hamOk2
4498hamOk2
1925hamOk2
3051hamOk2
\n", "

5572 rows × 3 columns

\n", "
" ], "text/plain": [ " clase mensajes tamaño\n", "1085 ham For me the love should start with attraction.i... 910\n", "1863 ham The last thing i ever wanted to do was hurt yo... 790\n", "2434 ham Indians r poor but India is not a poor country... 629\n", "1579 ham How to Make a girl Happy? It's not at all diff... 611\n", "2158 ham Sad story of a Man - Last week was my b'day. M... 588\n", "... ... ... ...\n", "3376 ham :) 3\n", "5357 ham Ok 2\n", "4498 ham Ok 2\n", "1925 ham Ok 2\n", "3051 ham Ok 2\n", "\n", "[5572 rows x 3 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.sort_values('tamaño', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualización de los datos." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtQAAAHSCAYAAADMnFxwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAdL0lEQVR4nO3dfbCmZ10f8O+PhPJqhzDZ0JgXNzIrmjAScE1p6YuCNpEogXZil6lMxkFDp8FC64xuGKfiH5nJHwLaaaGGlxoRjSugpAStIYqOM23CAhFIQoYdsyZL0mTFWsA6wYRf/zh3wlM82X12r73Pc86ez2fmzHPf13Pf9/mdPdfsfnPleq6rujsAAMDxedKqCwAAgK1MoAYAgAECNQAADBCoAQBggEANAAADBGoAABhw6qoLGHH66af3zp07V10GAAAnuU984hN/3t071ntvSwfqnTt3Zv/+/asuAwCAk1xV/dkTvWfKBwAADBCoAQBggEANAAADBGoAABggUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAANOXXUB29HOvTc9fnzw2ktXWAkAAKOMUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAbMHqir6pSq+lRVfXg6f3ZV3VxVn59eT1u49uqqOlBVd1fVxXPXBgAAozZihPoNSe5aON+b5Jbu3pXkluk8VXV+kj1JLkhySZK3V9UpG1AfAAAct1kDdVWdneTSJO9aaL4syfXT8fVJXrnQfkN3P9zd9yQ5kOSiOesDAIBRc49Q/0KSn0rytYW253T3A0kyvZ4xtZ+V5L6F6w5NbQAAsGnNFqir6geTPNTdn1j2lnXaep3nXllV+6tq/+HDh4dqBACAUXOOUL8kySuq6mCSG5K8tKp+NcmDVXVmkkyvD03XH0pyzsL9Zye5/xsf2t3Xdffu7t69Y8eOGcsHAICjmy1Qd/fV3X12d+/M2ocNf7+7fyTJjUmumC67IsmHpuMbk+ypqqdU1XlJdiW5ba76AADgRDh1Bd/z2iT7quq1Se5NcnmSdPcdVbUvyZ1JHklyVXc/uoL6AABgaRsSqLv7Y0k+Nh1/McnLnuC6a5JcsxE1AQDAiWCnRAAAGCBQAwDAAIEaAAAGCNQAADBAoAYAgAECNQAADBCoAQBggEANAAADBGoAABggUAMAwACBGgAABpy66gL4up17b3r8+OC1l66wEgAAlmWEGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADBAoAYAgAECNQAADBCoAQBggEANAAADBGoAABggUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYMBsgbqqnlpVt1XVn1TVHVX1c1P7m6vqC1V1+/T18oV7rq6qA1V1d1VdPFdtAABwopw647MfTvLS7v5KVT05yR9X1e9M772tu39+8eKqOj/JniQXJPnmJB+tqm/r7kdnrBEAAIbMNkLda74ynT55+uoj3HJZkhu6++HuvifJgSQXzVUfAACcCLPOoa6qU6rq9iQPJbm5u2+d3np9VX26qt5TVadNbWcluW/h9kNT2zc+88qq2l9V+w8fPjxn+QAAcFSzBurufrS7L0xydpKLqur5Sd6R5LlJLkzyQJK3TJfXeo9Y55nXdffu7t69Y8eOWeoGAIBlbcgqH939l0k+luSS7n5wCtpfS/LOfH1ax6Ek5yzcdnaS+zeiPgAAOF5zrvKxo6qeNR0/Lcn3JflcVZ25cNmrknx2Or4xyZ6qekpVnZdkV5Lb5qoPAABOhDlX+TgzyfVVdUrWgvu+7v5wVb23qi7M2nSOg0lelyTdfUdV7UtyZ5JHklxlhQ8AADa72QJ1d386yQvXaX/NEe65Jsk1c9UEAAAnmp0SAQBggEANAAADBGoAABggUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYcOqqC9gudu69adUlAAAwAyPUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAbY2GXFbPgCALC1GaEGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADBAoAYAgAECNQAADBCoAQBggEANAAADZgvUVfXUqrqtqv6kqu6oqp+b2p9dVTdX1een19MW7rm6qg5U1d1VdfFctQEAwIky5wj1w0le2t0vSHJhkkuq6sVJ9ia5pbt3JbllOk9VnZ9kT5ILklyS5O1VdcqM9QEAwLDZAnWv+cp0+uTpq5NcluT6qf36JK+cji9LckN3P9zd9yQ5kOSiueoDAIATYdY51FV1SlXdnuShJDd3961JntPdDyTJ9HrGdPlZSe5buP3Q1PaNz7yyqvZX1f7Dhw/PWT4AABzVrIG6ux/t7guTnJ3koqp6/hEur/Uesc4zr+vu3d29e8eOHSeoUgAAOD4bsspHd/9lko9lbW70g1V1ZpJMrw9Nlx1Kcs7CbWcnuX8j6gMAgOM15yofO6rqWdPx05J8X5LPJbkxyRXTZVck+dB0fGOSPVX1lKo6L8muJLfNVd9WsnPvTY9/AQCwuZw647PPTHL9tFLHk5Ls6+4PV9X/SLKvql6b5N4klydJd99RVfuS3JnkkSRXdfejM9YHAADDZgvU3f3pJC9cp/2LSV72BPdck+SauWoCAIATzU6JAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADBAoAYAgAECNQAADBCoAQBggEANAAADBGoAABggUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAacuuoCWN/OvTetugQAAJZghBoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAgNkCdVWdU1V/UFV3VdUdVfWGqf3NVfWFqrp9+nr5wj1XV9WBqrq7qi6eqzYAADhR5twp8ZEkP9ndn6yqb0ryiaq6eXrvbd3984sXV9X5SfYkuSDJNyf5aFV9W3c/OmONAAAwZLYR6u5+oLs/OR1/OcldSc46wi2XJbmhux/u7nuSHEhy0Vz1AQDAibAhc6irameSFya5dWp6fVV9uqreU1WnTW1nJblv4bZDWSeAV9WVVbW/qvYfPnx4zrIBAOCoZg/UVfXMJB9I8sbu/lKSdyR5bpILkzyQ5C2PXbrO7f23Grqv6+7d3b17x44d8xQNAABLmjVQV9WTsxam39fdH0yS7n6wux/t7q8leWe+Pq3jUJJzFm4/O8n9c9YHAACjZvtQYlVVkncnuau737rQfmZ3PzCdvirJZ6fjG5P8WlW9NWsfStyV5La56uPksnPvTY8fH7z20hVWAgBsN3Ou8vGSJK9J8pmqun1qe1OSV1fVhVmbznEwyeuSpLvvqKp9Se7M2gohV1nhAwCAzW62QN3df5z150V/5Aj3XJPkmrlqAgCAE81OiQAAMECgBgCAAUsF6qp6/tyFAADAVrTsCPV/qarbqurfVNWz5iwIAAC2kqUCdXf/oyT/KmvrRO+vql+rqu+ftTIAANgClp5D3d2fT/IzSX46yT9N8h+r6nNV9c/nKg4AADa7ZedQf2dVvS3JXUlemuSHuvs7puO3zVgfAABsasuuQ/2fsrZN+Ju6+68fa+zu+6vqZ2apDAAAtoBlA/XLk/z1YzsXVtWTkjy1u/9vd793tuoAAGCTW3YO9UeTPG3h/OlTGwAAbGvLBuqndvdXHjuZjp8+T0kAALB1LDvl46+q6kXd/ckkqarvSvLXR7kHVmLn3psePz547aUrrAQA2A6WDdRvTPKbVXX/dH5mkn85S0UAALCFLBWou/vjVfXtSZ6XpJJ8rrv/ZtbKAABgC1h2hDpJvjvJzumeF1ZVuvtXZqkKAAC2iKUCdVW9N8lzk9ye5NGpuZMI1AAAbGvLjlDvTnJ+d/ecxQAAwFaz7LJ5n03y9+YsBAAAtqJlR6hPT3JnVd2W5OHHGrv7FbNUBQAAW8SygfrNcxYBAABb1bLL5v1hVX1Lkl3d/dGqenqSU+YtDQAANr+l5lBX1Y8neX+SX5qazkry2zPVBAAAW8ayUz6uSnJRkluTpLs/X1VnzFYVS7HFNgDA6i27ysfD3f3Vx06q6tSsrUMNAADb2rKB+g+r6k1JnlZV35/kN5P8t/nKAgCArWHZQL03yeEkn0nyuiQfSfIzcxUFAABbxbKrfHwtyTunL1Zocd40AACrt1Sgrqp7ss6c6e7+1hNeEQAAbCHLrvKxe+H4qUkuT/LsE18OAABsLUvNoe7uLy58faG7fyHJS+ctDQAANr9lp3y8aOH0SVkbsf6mWSoCAIAtZNkpH29ZOH4kycEkP3zCqwEAgC1m2VU+vnfuQgAAYCtadsrHvz/S+9391hNTDgAAbC3HssrHdye5cTr/oSR/lOS+OYoCAICtYtlAfXqSF3X3l5Okqt6c5De7+8fmKgwAALaCZbcePzfJVxfOv5pk5wmvBgAAtphlR6jfm+S2qvqtrO2Y+KokvzJbVQAAsEUsu8rHNVX1O0n+8dT0o939qfnKAgCArWHZKR9J8vQkX+ruX0xyqKrOm6kmAADYMpYK1FX1s0l+OsnVU9OTk/zqUe45p6r+oKruqqo7quoNU/uzq+rmqvr89Hrawj1XV9WBqrq7qi4+vh8JAAA2zrIj1K9K8ookf5Uk3X1/jr71+CNJfrK7vyPJi5NcVVXnJ9mb5Jbu3pXkluk803t7klyQ5JIkb6+qU47txwEAgI21bKD+and31j6QmKp6xtFu6O4HuvuT0/GXk9yV5KwklyW5frrs+iSvnI4vS3JDdz/c3fckOZDkoiXrAwCAlVg2UO+rql9K8qyq+vEkH03yzmW/SVXtTPLCJLcmeU53P5Cshe4kZ0yXnZX/f6OYQ1MbAABsWkdd5aOqKslvJPn2JF9K8rwk/6G7b17mG1TVM5N8IMkbu/tLa49b/9J12nqd512Z5MokOffcc5cpAQAAZnPUQN3dXVW/3d3flWSpEP2Yqnpy1sL0+7r7g1Pzg1V1Znc/UFVnJnloaj+U5JyF289Ocv869VyX5Lok2b17998K3AAAsJGWnfLxP6vqu4/lwdPI9ruT3NXdb11468YkV0zHVyT50EL7nqp6yrQk364ktx3L9wQAgI227E6J35vkX1fVwayt9FFZG7z+ziPc85Ikr0nymaq6fWp7U5JrszYn+7VJ7k1yedYedkdV7UtyZ9ZWCLmqux89th8HAAA21hEDdVWd2933JvmBY31wd/9x1p8XnSQve4J7rklyzbF+LwAAWJWjjVD/dpIXdfefVdUHuvtfbEBNAACwZRxtDvXiCPO3zlkIAABsRUcL1P0ExwAAQI4+5eMFVfWlrI1UP206Tr7+ocS/O2t1AACwyR0xUHf3KRtVCAAAbEXLrkMNAACsQ6AGAIABAjUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADBAoAYAgAECNQAADBCoAQBggEANAAADBGoAABggUAMAwACBGgAABgjUAAAw4NRVF8C8du696fHjg9deusJKAABOTgL1SUJwBgBYDVM+AABggEANAAADBGoAABggUAMAwACBGgAABgjUAAAwQKAGAIABAjUAAAwQqAEAYMBsgbqq3lNVD1XVZxfa3lxVX6iq26evly+8d3VVHaiqu6vq4rnqAgCAE2nOEepfTnLJOu1v6+4Lp6+PJElVnZ9kT5ILpnveXlWnzFgbAACcELMF6u7+oyR/seTllyW5obsf7u57khxIctFctQEAwImyijnUr6+qT09TQk6b2s5Kct/CNYemNgAA2NQ2OlC/I8lzk1yY5IEkb5naa51re70HVNWVVbW/qvYfPnx4liIBAGBZGxqou/vB7n60u7+W5J35+rSOQ0nOWbj07CT3P8Ezruvu3d29e8eOHfMWDAAAR7Ghgbqqzlw4fVWSx1YAuTHJnqp6SlWdl2RXkts2sjYAADgep8714Kr69STfk+T0qjqU5GeTfE9VXZi16RwHk7wuSbr7jqral+TOJI8kuaq7H52rNgAAOFFmC9Td/ep1mt99hOuvSXLNXPUAAMAc7JQIAAADBGoAABgw25QPVmfn3ptWXQIAwLZhhBoAAAYI1AAAMMCUj21kcSrIwWsvXWElAAAnDyPUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAA6zywd9iNRAAgOUZoQYAgAECNQAADBCoAQBggDnUbCnmdwMAm40RagAAGCBQAwDAAFM+tqnFqROJ6RMAAMfLCDUAAAwQqAEAYIBADQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADDg1FUXAMdr596bVl0CAIBAzea0GJYPXnvpCisBADgyUz4AAGCAQA0AAAMEagAAGGAONUl8wA8A4HgZoQYAgAGzBeqqek9VPVRVn11oe3ZV3VxVn59eT1t47+qqOlBVd1fVxXPVBQAAJ9KcI9S/nOSSb2jbm+SW7t6V5JbpPFV1fpI9SS6Y7nl7VZ0yY20AAHBCzBaou/uPkvzFNzRfluT66fj6JK9caL+hux/u7nuSHEhy0Vy1AQDAibLRc6if090PJMn0esbUflaS+xauOzS1AQDAprZZPpRY67T1uhdWXVlV+6tq/+HDh2cuCwAAjmyjA/WDVXVmkkyvD03th5Kcs3Dd2UnuX+8B3X1dd+/u7t07duyYtVgAADiajQ7UNya5Yjq+IsmHFtr3VNVTquq8JLuS3LbBtQEAwDGbbWOXqvr1JN+T5PSqOpTkZ5Ncm2RfVb02yb1JLk+S7r6jqvYluTPJI0mu6u5H56oNAABOlNkCdXe/+gneetkTXH9NkmvmqgcAAOawWT6UCAAAW5JADQAAA2ab8sHJZ+femx4/PnjtpSusBABg8zBCDQAAAwRqAAAYIFADAMAAgRoAAAYI1AAAMECgBgCAAZbNY9jcy+ktPh8AYLMRqJmNdasBgO1AoGalhG4AYKszhxoAAAYI1AAAMECgBgCAAeZQs2lYzQMA2IoEao5IyAUAODJTPgAAYIBADQAAA0z5YENYbxoAOFkZoQYAgAECNQAADDDlgw1n5RAA4GRihBoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAVT44qdlQBgCYmxFqAAAYIFADAMAAUz44oWzaAgBsN0aoAQBggEANAAADBGoAABhgDvWMzCcGADj5GaEGAIABRqg5LkbfAQDWGKEGAIABAjUAAAwQqAEAYMBK5lBX1cEkX07yaJJHunt3VT07yW8k2ZnkYJIf7u7/vYr6AABgWascof7e7r6wu3dP53uT3NLdu5LcMp0DAMCmtpmmfFyW5Prp+Pokr1xdKQAAsJxVBepO8ntV9YmqunJqe053P5Ak0+sZK6oNAACWtqp1qF/S3fdX1RlJbq6qzy174xTAr0ySc889d676AABgKSsZoe7u+6fXh5L8VpKLkjxYVWcmyfT60BPce1137+7u3Tt27NiokgEAYF0bHqir6hlV9U2PHSf5Z0k+m+TGJFdMl12R5EMbXRsAAByrVUz5eE6S36qqx77/r3X371bVx5Psq6rXJrk3yeUrqA0AAI7Jhgfq7v7TJC9Yp/2LSV620fUAAMCIzbRsHgAAbDkCNQAADBCoAQBggEANAAADBGoAABggUAMAwIBVbT1+0tq596ZVlwAAwAYyQg0AAAMEagAAGCBQAwDAAHOoTwDzpgEAti8j1AAAMECgBgCAAaZ8HCfTPAAASIxQAwDAEIEaAAAGmPLBtrE4TefgtZeusBIA4GRihBoAAAYI1AAAMECgBgCAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAG2CmRbc8OigDACCPUAAAwQKAGAIABAjUAAAwQqAEAYIBADQAAA6zywba0uLIHAMAIgRoWfGPQtoweAHA0pnwAAMAAgRoAAAaY8gFLWmZHRbsuAsD2Y4QaAAAGCNQAADDAlA84Didq2b2RaSSmlwDA5iBQwxFst/Wqn+jn3YjA7j8QTgx/jgAbb9MF6qq6JMkvJjklybu6+9oVlwQbbisF+bkDnIAIwGa3qQJ1VZ2S5D8n+f4kh5J8vKpu7O47V1sZHLtVjvbO7Yl+NuEXgO1oUwXqJBclOdDdf5okVXVDksuSCNSc9JYZlV4myC5aZs71KkfDj/XnWRX/ocCx0F9g+9lsgfqsJPctnB9K8vdXVAtseXN8ePJ47tnIqSDH+kHPE/V9j+f+Rcs860T9mW6GGnhix/pn7HeyNfg9Hb+t8GdX3b3qGh5XVZcnubi7f2w6f02Si7r7JxauuTLJldPp85LcvcFlnp7kzzf4e7I16BusR79gPfoF69EvNrdv6e4d672x2UaoDyU5Z+H87CT3L17Q3dcluW4ji1pUVfu7e/eqvj+bl77BevQL1qNfsB79YuvabBu7fDzJrqo6r6r+TpI9SW5ccU0AAPCENtUIdXc/UlWvT/Lfs7Zs3nu6+44VlwUAAE9oUwXqJOnujyT5yKrrOIKVTTdh09M3WI9+wXr0C9ajX2xRm+pDiQAAsNVstjnUAACwpQjUx6CqLqmqu6vqQFXtXXU9bJyqOqeq/qCq7qqqO6rqDVP7s6vq5qr6/PR62sI9V0995e6qunh11TOnqjqlqj5VVR+ezvUJUlXPqqr3V9Xnpr83/oG+QVX9u+nfkM9W1a9X1VP1i5ODQL2khW3RfyDJ+UleXVXnr7YqNtAjSX6yu78jyYuTXDX9/vcmuaW7dyW5ZTrP9N6eJBckuSTJ26c+xMnnDUnuWjjXJ0iSX0zyu9397UlekLU+om9sY1V1VpJ/m2R3dz8/a4sv7Il+cVIQqJf3+Lbo3f3VJI9ti8420N0PdPcnp+MvZ+0fx7Oy1geuny67Pskrp+PLktzQ3Q939z1JDmStD3ESqaqzk1ya5F0LzfrENldVfzfJP0ny7iTp7q92919G32BtMYinVdWpSZ6etb029IuTgEC9vPW2RT9rRbWwQlW1M8kLk9ya5Dnd/UCyFrqTnDFdpr9sD7+Q5KeSfG2hTZ/gW5McTvJfp+lA76qqZ0Tf2Na6+wtJfj7JvUkeSPJ/uvv3ol+cFATq5dU6bZZI2Waq6plJPpDkjd39pSNduk6b/nISqaofTPJQd39i2VvWadMnTk6nJnlRknd09wuT/FWm/43/BPSNbWCaG31ZkvOSfHOSZ1TVjxzplnXa9ItNSqBe3lG3RefkVlVPzlqYfl93f3BqfrCqzpzePzPJQ1O7/nLye0mSV1TVwaxNAXtpVf1q9AnWfteHuvvW6fz9WQvY+sb29n1J7unuw939N0k+mOQfRr84KQjUy7Mt+jZWVZW1+ZB3dfdbF966MckV0/EVST600L6nqp5SVecl2ZXkto2ql/l199XdfXZ378za3we/390/En1i2+vu/5Xkvqp63tT0siR3Rt/Y7u5N8uKqevr0b8rLsvZ5HP3iJLDpdkrcrGyLvu29JMlrknymqm6f2t6U5Nok+6rqtVn7y/LyJOnuO6pqX9b+EX0kyVXd/eiGV80q6BMkyU8ked80APOnSX40a4NY+sY21d23VtX7k3wya7/nT2VtZ8RnRr/Y8uyUCAAAA0z5AACAAQI1AAAMEKgBAGCAQA0AAAMEagAAGCBQAwDAAIEaAAAGCNQAADDg/wHzP0Va1XAc1AAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,8))\n", "mensajes['tamaño'].plot(bins=200, kind='hist') " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtQAAAHSCAYAAADMnFxwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAZR0lEQVR4nO3de4yld33f8c8XLwEMRdjy2nVskzWVZTAoCGeDaFylCY4FCRSTVrRGAW0JiZvWJZBGCmuKSv6hstSUQC+0Mbc4QKHGEOzGlGI2IVGlYme5CDDGtYuJMXbw5gpNkI3h2z/mbDpZze6emd+c28zrJa3mPM95zjnfnXnkfev4N+ep7g4AALA1j1r0AAAAsMoENQAADBDUAAAwQFADAMAAQQ0AAAMENQAADNiz6AFGnHHGGb1v375FjwEAwA73qU996o+6e+9G9610UO/bty+HDx9e9BgAAOxwVfUHx7vPkg8AABggqAEAYICgBgCAAYIaAAAGCGoAABggqAEAYICgBgCAAYIaAAAGCGoAABggqAEAYICgBgCAAYIaAAAGCGoAABggqAEAYICgBgCAATML6qp6Z1U9WFVfWLfv31TVl6rqc1X1m1X1pHX3XV1Vd1fVnVX1vFnNBQAA22mW71D/epLnH7PvliTP6O7vT/K/k1ydJFV1UZIrkjx98pi3VtUpM5wNAAC2xcyCurt/L8mfHLPvY939yGTzk0nOndy+PMn7u/uh7r4nyd1Jnj2r2QAAYLsscg31Tyf575Pb5yT56rr77pvsAwCApbaQoK6qf5nkkSTvPbprg8P6OI+9sqoOV9XhI0eOzGrEmdh38OZFjwAAwDabe1BX1YEkL0zyU919NJrvS3LeusPOTXL/Ro/v7mu7e39379+7d+9shwUAgJOYa1BX1fOTvDbJi7r7L9fddVOSK6rqMVV1fpILktw2z9kAAGAr9szqiavqfUl+JMkZVXVfkjdk7VM9HpPklqpKkk9298919+1VdX2SL2ZtKchV3f2dWc0GAADbZWZB3d0v3WD3O05w/BuTvHFW8wAAwCy4UiIAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADZhbUVfXOqnqwqr6wbt/pVXVLVd01+Xrauvuurqq7q+rOqnrerOYCAIDtNMt3qH89yfOP2XcwyaHuviDJocl2quqiJFckefrkMW+tqlNmOBsAAGyLmQV1d/9ekj85ZvflSa6b3L4uyYvX7X9/dz/U3fckuTvJs2c1GwAAbJd5r6E+q7sfSJLJ1zMn+89J8tV1x9032QcAAEttWX4psTbY1xseWHVlVR2uqsNHjhyZ8VgAAHBi8w7qr1fV2Uky+frgZP99Sc5bd9y5Se7f6Am6+9ru3t/d+/fu3TvTYQEA4GTmHdQ3JTkwuX0gyY3r9l9RVY+pqvOTXJDktjnPBgAAmzbLj817X5L/leTCqrqvql6Z5Jokl1XVXUkum2ynu29Pcn2SLyb5aJKruvs7s5ptkfYdvHnRIwAAsI32zOqJu/ulx7nr0uMc/8Ykb5zVPAAAMAvL8kuJAACwkgQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQLsO/gzYseAQCAbSKoAQBggKAGAIABghoAAAYIagAAGCCoAQBggKAGAIABghoAAAYIagAAGCCoAQBggKAGAIABghoAAAYIagAAGCCoAQBgwEKCuqp+oapur6ovVNX7quqxVXV6Vd1SVXdNvp62iNkAAGAz5h7UVXVOkp9Psr+7n5HklCRXJDmY5FB3X5Dk0GQbAACW2qKWfOxJ8riq2pPk1CT3J7k8yXWT+69L8uLFjAYAANObe1B399eS/EqSe5M8kOTPu/tjSc7q7gcmxzyQ5Mx5zwYAAJu1iCUfp2Xt3ejzk3xvksdX1cs28fgrq+pwVR0+cuTIrMacuX0Hb170CAAAbINFLPn4sST3dPeR7v52kg8l+aEkX6+qs5Nk8vXBjR7c3dd29/7u3r937965DQ0AABtZRFDfm+Q5VXVqVVWSS5PckeSmJAcmxxxIcuMCZgMAgE1ZxBrqW5PckOTTST4/meHaJNckuayq7kpy2WR7V7IcBABgdexZxIt29xuSvOGY3Q9l7d1qAABYGa6UCAAAAwQ1AAAMENQLdOxaaWunAQBWj6AGAIABghoAAAYIagAAGCCoAQBggKAGAIABgnrBfLIHAMBqE9QAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQAADBAUC+JfQdvXvQIAABsgaAGAIABghoAAAZMFdRV9YxZDwIAAKto2neo/3NV3VZV/6yqnjTLgQAAYJVMFdTd/XeS/FSS85Icrqr/UlWXzXQyAABYAVOvoe7uu5K8Pslrk/zdJP+uqr5UVX9/VsMBAMCym3YN9fdX1a8muSPJc5P8ve5+2uT2r85wPgAAWGp7pjzuPyR5W5LXdfe3ju7s7vur6vUzmQwAAFbAtEH9E0m+1d3fSZKqelSSx3b3X3b3u2c2HQAALLlp11B/PMnj1m2fOtnHNnPFRACA1TJtUD+2u//v0Y3J7VNnMxIAAKyOaYP6L6rq4qMbVfUDSb51guMBAGBXmHYN9WuSfKCq7p9sn53kH81kIgAAWCFTBXV3/35VPTXJhUkqyZe6+9sznYy/5uja6q9c84IFTwIAwHrTvkOdJD+YZN/kMc+qqnT3b8xkKgAAWBFTBXVVvTvJ30ry2STfmezuJIIaAIBdbdp3qPcnuai7e5bDAADAqpn2Uz6+kORvznIQAABYRdO+Q31Gki9W1W1JHjq6s7tfNJOpAABgRUwb1L88yyEAAGBVTfuxeb9bVd+X5ILu/nhVnZrklNmOBgAAy2+qNdRV9bNJbkjya5Nd5yT58IxmAgCAlTHtLyVeleSSJN9Iku6+K8mZsxoKAABWxbRB/VB3P3x0o6r2ZO1zqJmho1dH3Ox9AADMz7RB/btV9bokj6uqy5J8IMl/m91YAACwGqYN6oNJjiT5fJJ/kuQjSV4/q6EAAGBVTPspH99N8rbJHwAAYGKqoK6qe7LBmunufsq2TwQAACtk2gu77F93+7FJXpLk9O0fBwAAVstUa6i7+4/X/flad785yXNnOxoAACy/aZd8XLxu81FZe8f6b8xkIgAAWCHTLvn4t+tuP5LkK0n+4bZPAwAAK2baT/n40VkPAgAAq2jaJR//4kT3d/ebNvOiVfWkJG9P8oysfXrITye5M8l/TbIvk3fAu/tPN/O8AAAwb9Ne2GV/kn+a5JzJn59LclHW1lFvZS31W5J8tLufmuSZSe7I2sVjDnX3BUkOTbYBAGCpTbuG+owkF3f3N5Okqn45yQe6+2c2+4JV9cQkP5zkHydJdz+c5OGqujzJj0wOuy7JJ5K8drPPDwAA8zTtO9RPTvLwuu2Hs7Y0YyuekrXLmL+rqj5TVW+vqscnOau7H0iSydczt/j8AAAwN9O+Q/3uJLdV1W9mbc3zTyb5jYHXvDjJq7r71qp6SzaxvKOqrkxyZZI8+clP3uIIAACwPaa9sMsbk7wiyZ8m+bMkr+juf73F17wvyX3dfetk+4asBfbXq+rsJJl8ffA4s1zb3fu7e//evXu3OAIAAGyPaZd8JMmpSb7R3W9Jcl9Vnb+VF+zuP0zy1aq6cLLr0iRfTHJTkgOTfQeS3LiV5wcAgHma9mPz3pC1T/q4MMm7kjw6yXuSXLLF131VkvdW1fck+XLW3v1+VJLrq+qVSe5N8pItPjcAAMzNtGuofzLJs5J8Okm6+/6q2vKlx7v7s1kL9GNdutXnBACARZh2ycfD3d1Z+4XETD6VAwAAdr1pg/r6qvq1JE+qqp9N8vEkb5vdWAAAsBpOGtRVVVm7JPgNST6YtXXU/6q7//2MZ9vV9h28edEjrCTfNwBg3k66hrq7u6o+3N0/kOSWOcwEAAArY9olH5+sqh+c6SQAALCCpg3qH81aVP+fqvpcVX2+qj43y8GYjiUOG/N9AQDm5YRLPqrqyd19b5Ifn9M8AACwUk62hvrDSS7u7j+oqg929z+Yw0wAALAyTrbko9bdfsosBwEAgFV0sqDu49wGAABy8iUfz6yqb2TtnerHTW5nst3d/cSZTgcAAEvuhEHd3afMaxAAAFhF035sHgAAsAFBDQAAAwQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENQraN/Bmxc9wlLz/QEA5klQAwDAAEENAAADBDUAAAwQ1AAAMEBQAwDAAEENAAADBPWK8ZFwAADLRVADAMAAQQ0AAAME9QqwzAMAYHkJagAAGCCoAQBggKAGAIABghoAAAYIagAAGCCoAQBggKAGAIABghoAAAYIagAAGCCoAQBggKBmR3B5dgBgUQQ1AAAMENQAADBAUAMAwABBDQAAAwQ1AAAMENS70E79RIyd+vcCAJaboAYAgAGCGgAABghqAAAYIKgBAGCAoAYAgAGCGgAABghqAAAYIKgBAGCAoAYAgAGCGgAABgjqFXb0UtsuuQ0AsDiCGgAABghqAAAYsLCgrqpTquozVfVbk+3Tq+qWqrpr8vW0Rc0GAADTWuQ71K9Ocse67YNJDnX3BUkOTbYBAGCpLSSoq+rcJC9I8vZ1uy9Pct3k9nVJXjznsQAAYNMW9Q71m5P8UpLvrtt3Vnc/kCSTr2cuYC4AANiUuQd1Vb0wyYPd/aktPv7KqjpcVYePHDmyzdMBAMDmLOId6kuSvKiqvpLk/UmeW1XvSfL1qjo7SSZfH9zowd19bXfv7+79e/fundfMAACwobkHdXdf3d3ndve+JFck+e3uflmSm5IcmBx2IMmN854NAAA2a5k+h/qaJJdV1V1JLptsAwDAUtuzyBfv7k8k+cTk9h8nuXSR8wAAwGYt0zvUAACwcgQ1AAAMENQ73L6DNy96BACAHU1QAwDAAEENAAADBPUOYnkHAMD8CWoAABggqAEAYICgBgCAAYJ6h7KeGgBgPgQ1AAAMENQAADBAUO8Cln8AAMyOoAYAgAGCGgAABghqAAAYIKgBAGCAoAYAgAGCGgAABghqVpaPAwQAloGgBgCAAYIaAAAGCOpdYt/Bmy2RAACYAUENAAADBDUAAAwQ1AAAMEBQ81essQYA2DxBDQAAAwQ1AAAMENQsLUtQAIBVIKgBAGCAoAYAgAGCGgAABghqVo611QDAMhHUAAAwQFADAMAAQb1DbGUZhKUTAADjBDUAAAwQ1AAAMEBQAwDAAEG9w6xfF73VNdLWVgMATE9QAwDAAEENAAADBDUbWtZlH0fnWtb5AIDdR1ADAMAAQQ0AAAMENQAADBDUAAAwQFADAMAAQQ0AAAME9S61zB87d7zZlnlmAGD3EtQAADBAUAMAwABBzV+zflnF8W5PczwAwG4hqAEAYICgBgCAAYIaAAAGCOpdbrPrnldtnfS+gzev3MwAwGoR1AAAMEBQAwDAgLkHdVWdV1W/U1V3VNXtVfXqyf7Tq+qWqrpr8vW0ec/G1llWAQDsVot4h/qRJL/Y3U9L8pwkV1XVRUkOJjnU3RckOTTZBgCApTb3oO7uB7r705Pb30xyR5Jzklye5LrJYdclefG8ZwMAgM1a6BrqqtqX5FlJbk1yVnc/kKxFd5IzFzgaAABMZWFBXVVPSPLBJK/p7m9s4nFXVtXhqjp85MiR2Q3ITG1mzfWJjrV2GwBYtIUEdVU9Omsx/d7u/tBk99er6uzJ/WcneXCjx3b3td29v7v37927dz4DAwDAcSziUz4qyTuS3NHdb1p3101JDkxuH0hy47xnAwCAzVrEO9SXJHl5kudW1Wcnf34iyTVJLququ5JcNtneMZZ5aYLZAAC2bs+8X7C7/2eSOs7dl85zFgAAGOVKiQAAMEBQAwDAAEHNCY2sYV7UYwEA5klQAwDAAEENAAADBDVbstUrHZ7sqoeWegAAq0ZQAwDAAEENAAADBDUAAAwQ1Aw7dt3ztOugj66Ztm4aAFhlghoAAAYIagAAGCComdrRpRmzWKJh2QcAsKoENQAADBDUAAAwQFCza1hWAgDMgqAGAIABghoAAAYIagAAGCCoZ2SWHzE3b9v1d9gJ3wsAgGMJagAAGCCoAQBggKBmWx27rMMyDwBgpxPUAAAwQFADAMAAQQ0AAAME9QxYRwwAsHsIagAAGCCoAQBggKAGAIABghoAAAYIagAAGCCoAQBggKDeJrvxo/E283fejd8fAGB3ENQAADBAUAMAwABBvUWWMKwuPzsAYDsJagAAGCCoAQBggKAGAIABgnobWZsLALD7CGoAABggqAEAYICgHnCyJR6WgKyG4/2c/PwAgGkIagAAGCCoAQBggKAGAIABgnobrF9ra90tq845DACbI6gBAGCAoAYAgAGCGo5hyQMAsBmCGgAABghqAAAYIKgHWR6wWo79eW325+fnPXujPyMAmDdBDQAAAwQ1AAAMENQAADBAULMrbLQOd5orXJ7scev3Hd0/7WNOtH9ai1oDvpm/43a+BqwC5y7sPoIaAAAGCGoAABiwdEFdVc+vqjur6u6qOrjoediZNrtk4WTLPDY6bqMlINMu/Th2Ccmxz7HRvmnmO5Fpjp32mGn+zlt9/mmP3ez90/4sZ/G/82f1PZrmebb6fJY1APx/SxXUVXVKkv+Y5MeTXJTkpVV10WKnAgCA41uqoE7y7CR3d/eXu/vhJO9PcvmCZwIAgONatqA+J8lX123fN9kHAABLqbp70TP8lap6SZLndffPTLZfnuTZ3f2qdcdcmeTKyeaFSe6c85hnJPmjOb8mq8G5wUacF2zEecFGnBfL7fu6e+9Gd+yZ9yQncV+S89Ztn5vk/vUHdPe1Sa6d51DrVdXh7t6/qNdneTk32Ijzgo04L9iI82J1LduSj99PckFVnV9V35PkiiQ3LXgmAAA4rqV6h7q7H6mqf57kfyQ5Jck7u/v2BY8FAADHtVRBnSTd/ZEkH1n0HCewsOUmLD3nBhtxXrAR5wUbcV6sqKX6pUQAAFg1y7aGGgAAVoqg3gSXRd+9quq8qvqdqrqjqm6vqldP9p9eVbdU1V2Tr6ete8zVk3Plzqp63uKmZ5aq6pSq+kxV/dZk2zlBqupJVXVDVX1p8t+Nv+3coKp+YfJvyBeq6n1V9Vjnxc4gqKfksui73iNJfrG7n5bkOUmumvz8DyY51N0XJDk02c7kviuSPD3J85O8dXIOsfO8Oskd67adEyTJW5J8tLufmuSZWTtHnBu7WFWdk+Tnk+zv7mdk7cMXrojzYkcQ1NNzWfRdrLsf6O5PT25/M2v/OJ6TtXPguslh1yV58eT25Une390Pdfc9Se7O2jnEDlJV5yZ5QZK3r9vtnNjlquqJSX44yTuSpLsf7u4/i3ODtQ+DeFxV7UlyatauteG82AEE9fRcFp0kSVXtS/KsJLcmOau7H0jWojvJmZPDnC+7w5uT/FKS767b55zgKUmOJHnXZDnQ26vq8XFu7Grd/bUkv5Lk3iQPJPnz7v5YnBc7gqCeXm2wz0ek7DJV9YQkH0zymu7+xokO3WCf82UHqaoXJnmwuz817UM22Oec2Jn2JLk4yX/q7mcl+YtM/jf+cTg3doHJ2ujLk5yf5HuTPL6qXnaih2ywz3mxpAT19E56WXR2tqp6dNZi+r3d/aHJ7q9X1dmT+89O8uBkv/Nl57skyYuq6itZWwL23Kp6T5wTrP2s7+vuWyfbN2QtsJ0bu9uPJbmnu49097eTfCjJD8V5sSMI6um5LPouVlWVtfWQd3T3m9bddVOSA5PbB5LcuG7/FVX1mKo6P8kFSW6b17zMXndf3d3ndve+rP334Le7+2VxTux63f2HSb5aVRdOdl2a5Itxbux29yZ5TlWdOvk35dKs/T6O82IHWLorJS4rl0Xf9S5J8vIkn6+qz072vS7JNUmur6pXZu0/li9Jku6+vaquz9o/oo8kuaq7vzP3qVkE5wRJ8qok7528AfPlJK/I2ptYzo1dqrtvraobknw6az/nz2TtyohPiPNi5blSIgAADLDkAwAABghqAAAYIKgBAGCAoAYAgAGCGgAABghqAAAYIKgBAGCAoAYAgAH/D/9c0Y+8I6CMAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,8))\n", "mensajes['tamaño'].plot.hist(bins=1000) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podemos jugar con el argumento bin que nos permite definir la granularidad o resolución del eje X. Para estos datos bins representa la longitud de los mensajes ¿Qué pasa cuando bins se acerca a 1000? Tenemos registros (mensajes)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajestamaño
1085hamFor me the love should start with attraction.i...910
1863hamThe last thing i ever wanted to do was hurt yo...790
2434hamIndians r poor but India is not a poor country...629
1579hamHow to Make a girl Happy? It's not at all diff...611
2158hamSad story of a Man - Last week was my b'day. M...588
............
3376ham:)3
5357hamOk2
4498hamOk2
1925hamOk2
3051hamOk2
\n", "

5572 rows × 3 columns

\n", "
" ], "text/plain": [ " clase mensajes tamaño\n", "1085 ham For me the love should start with attraction.i... 910\n", "1863 ham The last thing i ever wanted to do was hurt yo... 790\n", "2434 ham Indians r poor but India is not a poor country... 629\n", "1579 ham How to Make a girl Happy? It's not at all diff... 611\n", "2158 ham Sad story of a Man - Last week was my b'day. M... 588\n", "... ... ... ...\n", "3376 ham :) 3\n", "5357 ham Ok 2\n", "4498 ham Ok 2\n", "1925 ham Ok 2\n", "3051 ham Ok 2\n", "\n", "[5572 rows x 3 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.sort_values('tamaño', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Buscamenos el mensaje más extenso con 910 caracteres." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..\"" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes[mensajes['tamaño'] == 910]['mensajes'].iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Olvidándonos del contenido del mensaje, nos centramos en la idea que ver si la longitud del mensaje influye en si es spam o no." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ], dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAuIAAAF8CAYAAACKZ96RAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAgf0lEQVR4nO3dfbRlZ10f8O+PRCMv8hIyiUkmOFEjGlBRxkC1VmqISRsXia6FhqoExaa1UbG1hURdRbsaO7QVBC2uRt5CBWLAF0YRNMYiSwXCgLwlISaQkAwJySAv4kujCb/+cfbIdbhh5r6d595zP5+1Zp1znr33Ob995txnf+9zn7N3dXcAAID5esDoAgAAYDsSxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQZyFUFW3VtWTR9cBAHCkBHEAABhAEAcAgAEEcRbJ46rqvVX1qar6tar6oqp6RFX9TlUdqKpPTPd3Htygqt5cVf+1qv60qv6qqn67qh5ZVa+qqr+sqndU1a6B+wTAClTVc6rqI1X16aq6sarOrKqfqarXTceGT1fVu6rq65Zsc0lVfXBadn1VfeeSZc+oqj+pqhdU1Ser6kNV9U1T++1VdXdVXThmb9nqBHEWyXcnOSfJqUm+NskzMvuMvzzJlyZ5VJK/TfJLh2x3QZLvT3Jyki9P8tZpm2OT3JDkuRtfOgBrVVWPTvIjSb6xu784ydlJbp0Wn5fktZn17a9O8ltV9QXTsg8m+ZYkD0vys0l+tapOXPLUT0jy3iSPnLa9Msk3JvmKJN+X5Jeq6iEbt2csKkGcRfKi7r6juz+e5LeTPK67/6K7f727/6a7P53ksiTfesh2L+/uD3b3p5K8MckHu/sPuvvezDrtr5/rXgCwWvclOSbJ6VX1Bd19a3d/cFr2zu5+XXf/fZLnJ/miJE9Mku5+7XT8+Ex3/1qSm5KcseR5b+nul3f3fUl+LckpSf5Ld9/T3b+f5O8yC+WwIoI4i+SjS+7/TZKHVNWDqup/V9WHq+ovk7wlycOr6qgl69615P7fLvPYKAfAFtDdNyf58SQ/k+Tuqrqyqk6aFt++ZL3PJNmf5KQkqaqnV9W7p6knn0zy2CTHLXnqQ48L6W7HCtZMEGfR/USSRyd5Qnc/NMk/m9prXEkAbJTufnV3/9PMpiR2kudNi045uE5VPSDJziR3VNWXJvmVzKa0PLK7H57k/XGcYA4EcRbdF2c2UvHJqjo25nsDLKyqenRVfVtVHZPk/2XW/983LX58VX1XVR2d2aj5PUneluTBmQX2A9Nz/EBmI+Kw4QRxFt0vJHlgko9l1uG+aWg1AGykY5LsyazP/2iS45P85LTs9Um+J8knMvuC/nd199939/VJfj6zL+rfleRrkvzJnOtmm6ruHl0DAMCGqaqfSfIV3f19o2uBpYyIAwDAAII4AAAMYGoKAAAMYEQcAAAGEMQBAGCAo0cXcDjHHXdc79q1a3QZAIf1zne+82PdvWN0HYvOcQHYSj7fsWHTB/Fdu3Zl3759o8sAOKyq+vDoGrYDxwVgK/l8xwZTUwAAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYICjRxcwT7suecPntN2659wBlQAArD9ZZ2sxIg4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOwLqpqpdV1d1V9f5llv3HquqqOm5J26VVdXNV3VhVZ8+3WoCxBHEA1tMrkpxzaGNVnZLkrCS3LWk7PckFSR4zbfPiqjpqPmUCjHfYIL5eoxtV9fiqet+07EVVVeu3GwBsBt39liQfX2bRC5I8O0kvaTsvyZXdfU9335Lk5iRnbHyVAJvDkYyIvyLrM7rxy0kuSnLa9O9znhOAxVNVT0nyke5+zyGLTk5y+5LH+6c2gG3hsEF8PUY3qurEJA/t7rd2dyd5ZZLz11o8AJtbVT0oyU8l+c/LLV6mrZdpS1VdVFX7qmrfgQMH1rNEgGFWNUd8FaMbJ0/3D20HYLF9eZJTk7ynqm5NsjPJu6rqSzI7FpyyZN2dSe5Y7km6+/Lu3t3du3fs2LHBJQPMx4ovcb9kdOPbl1u8TFt/nvb7e42LMpvGkkc96lErLRGATaK735fk+IOPpzC+u7s/VlV7k7y6qp6f5KTMpi1eO6RQgAFWMyK+mtGN/dP9Q9uXZeQDYGuqqtckeWuSR1fV/qp65v2t293XJbkqyfVJ3pTk4u6+bz6VAoy34hHx1YxudPd9VfXpqnpikrcneXqSX1yPHQBg8+jupx1m+a5DHl+W5LKNrAlgszqS0xeu1+jGDyd5SWZf4PxgkjeusXYAANiyDjsivl6jG929L8ljV1gfAAAsJFfWBACAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHYN1U1cuq6u6qev+Stv9RVR+oqvdW1W9W1cOXLLu0qm6uqhur6uwhRQMMIogDsJ5ekeScQ9quTvLY7v7aJH+e5NIkqarTk1yQ5DHTNi+uqqPmVyrAWIcN4us1ulFVj6+q903LXlRVte57A8BQ3f2WJB8/pO33u/ve6eHbkuyc7p+X5Mruvqe7b0lyc5Iz5lYswGBHMiL+iqzP6MYvJ7koyWnTv0OfE4DF94NJ3jjdPznJ7UuW7Z/aALaFwwbx9RjdqKoTkzy0u9/a3Z3klUnOX6d9AGALqKqfSnJvklcdbFpmtb6fbS+qqn1Vte/AgQMbVSLAXK3HHPEjGd04ebp/aPuydLgAi6WqLkzyHUm+dxqQSWbHglOWrLYzyR3Lbd/dl3f37u7evWPHjo0tFmBO1hTEVzC6ccSjHokOF2CRVNU5SZ6T5Cnd/TdLFu1NckFVHVNVp2Y2bfHaETUCjHD0ajdcMrpx5hGMbuzPZ6evLG0HYIFU1WuSPCnJcVW1P8lzM/se0TFJrp6+p/+27v633X1dVV2V5PrMBnUu7u77xlQOMH+rCuJLRje+dZnRjVdX1fOTnJRpdKO776uqT1fVE5O8PcnTk/zi2koHYLPp7qct0/zSz7P+ZUku27iKADavwwbxdRzd+OHMzsDywMzmlL8xAACwTR02iK/X6EZ370vy2BVVBwAAC8qVNQEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABjh6dAEAAKzMrkveMLoE1oERcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAdg3VTVy6rq7qp6/5K2Y6vq6qq6abp9xJJll1bVzVV1Y1WdPaZqgDEOG8TXq1OtqsdX1fumZS+qqlr/3QFgsFckOeeQtkuSXNPdpyW5Znqcqjo9yQVJHjNt8+KqOmp+pQKMdSQj4q/I+nSqv5zkoiSnTf8OfU4AtrjufkuSjx/SfF6SK6b7VyQ5f0n7ld19T3ffkuTmJGfMo06AzeCwQXw9OtWqOjHJQ7v7rd3dSV65ZBsAFtsJ3X1nkky3x0/tJye5fcl6+6e2z1FVF1XVvqrad+DAgQ0tFmBeVjtHfKWd6snT/UPbAdi+lpui2Mut2N2Xd/fu7t69Y8eODS4LYD7W+8ua99epHnFnmxj5AFgwd01/Gc10e/fUvj/JKUvW25nkjjnXBjDMaoP4SjvV/dP9Q9uXZeQDYKHsTXLhdP/CJK9f0n5BVR1TVadm9v2hawfUBzDEaoP4ijrVafrKp6vqidPZUp6+ZBsAFkRVvSbJW5M8uqr2V9Uzk+xJclZV3ZTkrOlxuvu6JFcluT7Jm5Jc3N33jakcYP6OPtwKU6f6pCTHVdX+JM/NrBO9aupgb0vy1GTWqVbVwU713vzjTvWHMzsDywOTvHH6N9yuS96wbPute86dcyUAW193P+1+Fp15P+tfluSyjasIYPM6bBBfr061u/cleeyKqgMAgAXlypoAADCAIA4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOAAADCOIAADCAIA4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOAAADCOIAADCAIA4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOAAADCOIAADCAIA4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOAAADCOIAADCAIA4AAAMI4gAAMIAgDgAAAwjiAAAwgCAOAAADCOIAADCAIA4AAAMI4gAAMIAgDsBcVNW/r6rrqur9VfWaqvqiqjq2qq6uqpum20eMrhNgXtYUxFfaqVbVpVV1c1XdWFVnr718ALaCqjo5yY8l2d3dj01yVJILklyS5JruPi3JNdNjgG1h1UF8pZ1qVZ0+LX9MknOSvLiqjlpb+QBsIUcneWBVHZ3kQUnuSHJekium5VckOX9MaQDzt9apKSvpVM9LcmV339PdtyS5OckZa3x9ALaA7v5Ikv+Z5LYkdyb5VHf/fpITuvvOaZ07kxy/3PZVdVFV7auqfQcOHJhX2QAbatVBfBWd6slJbl/yFPunts+hwwVYLNM0xfOSnJrkpCQPrqrvO9Ltu/vy7t7d3bt37NixUWUCzNVapqastFOtZdp6uRV1uAAL58lJbunuA93990l+I8k3Jbmrqk5Mkun27oE1AszVWqamrLRT3Z/klCXb78xsKgsAi++2JE+sqgdVVSU5M8kNSfYmuXBa58Ikrx9UH8DcrSWIr7RT3Zvkgqo6pqpOTXJakmvX8PoAbBHd/fYkr0vyriTvy+z4c3mSPUnOqqqbkpw1PQbYFo5e7Ybd/faqOtip3pvkzzLrVB+S5KqqemZmYf2p0/rXVdVVSa6f1r+4u+9bY/0AbBHd/dwkzz2k+Z7MBnIAtp1VB/Fk5Z1qd1+W5LK1vCYAACwCV9YEAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABhAEAcAgAEEcQAAGEAQBwCAAQRxAAAYQBAHAIABjh5dAAAAG2fXJW9Ytv3WPefOuRIOZUQcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBmIuqenhVva6qPlBVN1TVP6mqY6vq6qq6abp9xOg6AeZlTUF8pZ1qVV1aVTdX1Y1VdfbaywdgC3lhkjd191cl+bokNyS5JMk13X1akmumxwDbwlpHxI+4U62q05NckOQxSc5J8uKqOmqNrw/AFlBVD03yz5K8NEm6+++6+5NJzktyxbTaFUnOH1EfwAirDuKr6FTPS3Jld9/T3bckuTnJGat9fQC2lC9LciDJy6vqz6rqJVX14CQndPedSTLdHr/cxlV1UVXtq6p9Bw4cmF/VABtoLSPiK+1UT05y+5Lt909tn0OHC7Bwjk7yDUl+ubu/PslfZwXTULr78u7e3d27d+zYsVE1AszV0Wvc9huS/Gh3v72qXpjP36nWMm293IrdfXmSy5Nk9+7dy64DwJayP8n+7n779Ph1mR0z7qqqE7v7zqo6McndwyqEwXZd8oZl22/dc+6cK2Fe1hLEV9qp7k9yypLtdya5Yw2vv6GW+2HwgwCwOt390aq6vaoe3d03JjkzyfXTvwuT7JluXz+wTIC5WvXUlO7+aJLbq+rRU9PBTnVvZp1p8o871b1JLqiqY6rq1CSnJbl2ta8PwJbzo0leVVXvTfK4JD+XWQA/q6puSnLW9BhgW1jLiHjy2U71C5N8KMkPZBbur6qqZya5LclTk6S7r6uqqzIL6/cmubi771vj6wOwRXT3u5PsXmbRmXMuBWBTWFMQX2mn2t2XJblsLa8JAACLwJU1AQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGCAo0cXAADA/dt1yRtGl8AGMSIOAAADCOIAADCAIA4AAAOYIw4AMGfmfZMYEQcAgCEEcQAAGEAQBwCAAQRxAAAYQBAHAIABnDUFAGCDODsKn48RcQAAGEAQBwCAAQRxAAAYQBAHAIABBHEAABjAWVNW4P6++XzrnnPnXAkAAFudEXEA5qaqjqqqP6uq35keH1tVV1fVTdPtI0bXCDAvaw7iK+lUq+rSqrq5qm6sqrPX+toAbDnPSnLDkseXJLmmu09Lcs30GGBbWI8R8SPqVKvq9CQXJHlMknOSvLiqjlqH1wdgC6iqnUnOTfKSJc3nJbliun9FkvPnXBbAMGsK4ivsVM9LcmV339PdtyS5OckZa3l9ALaUX0jy7CSfWdJ2QnffmSTT7fED6gIYYq0j4r+QI+9UT05y+5L19k9tACy4qvqOJHd39ztXuf1FVbWvqvYdOHBgnasDGGPVQXwVnWot09b389w6XIDF8s1JnlJVtya5Msm3VdWvJrmrqk5Mkun27uU27u7Lu3t3d+/esWPHvGoG2FBrGRFfaae6P8kpS7bfmeSO5Z5YhwuwWLr70u7e2d27Mvu+0B929/cl2Zvkwmm1C5O8flCJAHO36iC+ik51b5ILquqYqjo1yWlJrl115QAsgj1Jzqqqm5KcNT0G2BY24oI+e5JcVVXPTHJbkqcmSXdfV1VXJbk+yb1JLu7u+zbg9QHYxLr7zUnePN3/iyRnjqwHYJR1CeJH2ql292VJLluP1wQAgK3MlTUBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAYQxAEAYABBHAAABhDEAQBgAEEcAAAGEMQBAGAAQRwAAAY4enQBJLsuecOy7bfuOXfOlQAAMC9GxAEAYAAj4gAA6+D+/sIN98eIOAAADCCIAwDAAKamrIOV/CnKFzABAEiMiAMAwBCCOAAADCCIAwDAAII4AAAMIIgDAMAAgjgAAAwgiAMAwADOIz5nLn8LbEdVdUqSVyb5kiSfSXJ5d7+wqo5N8mtJdiW5Ncl3d/cnRtUJME+rHhGvqlOq6v9W1Q1VdV1VPWtqP7aqrq6qm6bbRyzZ5tKqurmqbqyqs9djBwDYEu5N8hPd/dVJnpjk4qo6PcklSa7p7tOSXDM9BtgW1jIifrBTfVdVfXGSd1bV1UmekVmnuqeqLsmsU33O1OFekOQxSU5K8gdV9ZXdfd/adgGAza6770xy53T/01V1Q5KTk5yX5EnTalckeXOS5wwoEZb9q7UrYrORVj0i3t13dve7pvufTrK0U71iWu2KJOdP989LcmV339PdtyS5OckZq319ALamqtqV5OuTvD3JCVNIPxjWjx9YGsBcrcuXNY+wUz05ye1LNts/tS33fBdV1b6q2nfgwIH1KBGATaCqHpLk15P8eHf/5Qq2c1wAFs6ag/gKOtVapq2XW7G7L+/u3d29e8eOHWstEYBNoKq+ILPjxau6+zem5ruq6sRp+YlJ7l5uW8cFYBGtKYivsFPdn+SUJZvvTHLHWl4fgK2hqirJS5Pc0N3PX7Job5ILp/sXJnn9vGsDGGUtZ01Zaae6N8kFVXVMVZ2a5LQk16729QHYUr45yfcn+baqevf0718m2ZPkrKq6KclZ02OAbWEtZ0052Km+r6rePbX9ZGad6FVV9cwktyV5apJ093VVdVWS6zM748rFzpgCsD109x9n+SmKSXLmPGsB2CxWHcRX06l292VJLlvtawIAwKJwiXsAABhAEAcAgAHWMkd801ruylgAALCZGBEHAIABBHEAABhgIaemAADby/1NS711z7mb8nkhMSIOAABDCOIAADCAIA4AAAOYIw4AbEqbeX62UyWzHoyIAwDAAII4AAAMIIgDAMAA5ogDANuK+d1sFkbEAQBgAEEcAAAGEMQBAGAAc8QBgGUtN5d6M5zDeyXMB79/i/D/u9UZEQcAgAEEcQAAGMDUFADYhOY5bWCt0zc286XoYTMzIg4AAAMI4gAAMIAgDgAAA5gjDgAMt5J56k5JyKIQxDcx5/cEAFhcgjgAMDdGs+GzzBEHAIABjIgDwBpstWmERqT5fFby+djMn/Otwog4AAAMYEQcAOZkq42er5XR9+1pu33O10IQ32J8uAEAFoMgDgADrcf5s+c5IGOUG9bP3IN4VZ2T5IVJjkryku7eM+8aFo0vVgBbnWMDsB3NNYhX1VFJ/leSs5LsT/KOqtrb3dfPs47tbCWjKZth5AVYfI4NwHY17xHxM5Lc3N0fSpKqujLJeUl0tluIEXhgnc3l2LDWKRUr6c/mPX3DdBFG2KjP3Wb4Pty8aph3ED85ye1LHu9P8oQ518Ay5vnDtNVstbmXW+mXn83wV5fNUAOODcD2NO8gXsu09eesVHVRkoumh39VVTeu8HWOS/KxFW6zldnfDVTPm9cr3a8V7e8mqHetjqvnjf88r/J9/NJ1LmO7OOyxYR2OC2u2CX62tltfvxzvwRZ9D9b683PI9kPegzXsw/0eG+YdxPcnOWXJ451J7jh0pe6+PMnlq32RqtrX3btXu/1WY38Xm/1lGzjssWGtx4VF4GfDe5B4D5LFeg/mfWXNdyQ5rapOraovTHJBkr1zrgGAzcWxAdiW5joi3t33VtWPJPm9zE5R9bLuvm6eNQCwuTg2ANvV3M8j3t2/m+R3N/hlttufL+3vYrO/LLw5HRu2Oj8b3oPEe5As0HtQ3Z/zXUkAAGCDzXuOOAAAEEEcAACGEMQBAGCAuX9Zc71V1VdldinkkzO7AMQdSfZ29w1DC9tAVVWZXRJ66T5f2ws64d/+2l8Atq9FPk5s6S9rVtVzkjwtyZWZXRAimV0I4oIkV3b3nlG1bZSq+vYkL05yU5KPTM07k3xFkn/X3b8/qraNYH+T2F/YdqrqYUkuTXJ+kh1T891JXp9kT3d/ckxl87XIAexIbff3YNGPE1s9iP95ksd0998f0v6FSa7r7tPGVLZxquqGJP+iu289pP3UJL/b3V89pLANYn//od3+wjZSVb+X5A+TXNHdH53aviTJhUme3N1njaxvHhY9gB0J78HiHye2+tSUzyQ5KcmHD2k/cVq2iI7OZ0f/l/pIki+Ycy3zYH9n7C9sL7u6+3lLG6ZA/ryq+sFBNc3bCzP7pePWpY0HA1iSLR3AjpD3YMGPE1s9iP94kmuq6qYkt09tj8rsN8UfGVXUBntZkndU1ZX57D6fktl0nJcOq2rj2F/7C9vRh6vq2ZmNiN+VJFV1QpJn5LM/K4tuoQPYEfIeLPhxYktPTUmSqnpAPjt3qjL7wL6ju+8bWtgGqqrTkzwl/3if93b39UML2yD21/7CdlNVj0hySWYnIzghs7nBdyXZm+R53f3xgeXNRVVdmuS7M/se2KEB7Kru/m+japsX78HMIh8ntnwQB4BFV1Xfktmg0/u2w7zggxY5gB2pqvrqfPbscNvyPVhkgvgWs92+SW9/k9hf2Haq6truPmO6/0NJLk7yW0m+PclvL+JZwWA5i36ccEGfreeqJJ9I8qTufmR3PzLJP0/yySSvHVnYBrG/9he2o6Xzf/9Nkm/v7p/NLIh/75iS5quqHlZVe6rqA1X1F9O/G6a2h4+ubx6q6pwl9x9WVS+pqvdW1aun7wxsBwt9nDAivsVU1Y3d/eiVLtuq7O+RLduqttv+wpGqqvckeVJmA2a/1927lyz7s+7++lG1zcvnOYXjM5KcuU1O4fiu7v6G6f5Lknw0ya8k+a4k39rd5w8sby4W/ThhRHzr+XBVPXvpb8JVdcJ0caNF/Ca9/bW/sB09LMk7k+xLcuwUQFNVD8lsnvB2sKu7n3cwhCezUzhO03IeNbCuUXZ3909394e7+wVJdo0uaE4W+jghiG8935PkkUn+qKo+UVUfT/LmJMdm9s3qRXPo/n4is/19ZLbH/m63/99F3184It29q7u/rLtPnW4PhtHPJPnOkbXN0UIHsCN0fFX9h6r6iSQPna6yedB2yXALfZwwNWULqqqvyuzKWm/r7r9a0n5Od79pXGXzUVX/p7u/f3QdG6GqnpDkA939qap6UGanL/uGJNcl+bnu/tTQAtdZza6C+7QkH+nuP6iq703yTUmuT3L5oVfNBbaPQ07hePzUfPAUjnu6+xOjapuXqnruIU0v7u4D019I/nt3P31EXfO2yLlHEN9iqurHMvv2/A1JHpfkWd39+mnZP8wlWxRVtXeZ5m/LbN5guvsp861oY1XVdUm+rrvvrarLk/x1kl9PcubU/l1DC1xnVfWqzC5Y8cAkn0ry4CS/mdn+VndfOLA8YJOqqh/o7pePrmOk7fIeLHru2epX1tyO/nWSx3f3X1XVriSvq6pd3f3CLOa8wZ2ZjY6+JLMLWlSSb0zy8yOL2kAP6O57p/u7l3Qwf1xV7x5U00b6mu7+2qo6OrMrxZ3U3fdV1a8mec/g2oDN62eTLHwIPYzt8h4sdO4RxLeeow7+Waa7b62qJ2X2ofzSLMAHchm7kzwryU8l+U/d/e6q+tvu/qPBdW2U9y8Z5XhPVe3u7n1V9ZVJFnGaxgOm6SkPTvKgzL6g9vEkx2T7XL4ZWEZVvff+FmV2tdGF5z1IsuC5RxDfej5aVY/r7ncnyfQb4nckeVmSrxla2Qbo7s8keUFVvXa6vSuL/bn9oSQvrKqfTvKxJG+tqtsz+2LSDw2tbGO8NMkHkhyV2S9br62qDyV5YmaXdAa2rxOSnJ3ZOaSXqiR/Ov9yhvAeLHjuMUd8i6mqnUnuXXo6pyXLvrm7/2RAWXNTVecm+ebu/snRtWykqvriJF+W2S8d+7v7rsElbZiqOilJuvuOml2k48lJbuvua4cWBgxVVS9N8vLu/uNllr26u//VgLLmynuw+LlHEAcAgAG2yzkoAQBgUxHEAQBgAEEcAAAGEMQBAGAAQRwAAAb4//lVdP4EPGTgAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "mensajes.hist(column='tamaño', by='clase', bins=50,figsize=(12,6))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A través del análisis exploratorio inicial, hemos obtenido una conclusión interesante, la tendencia a que un mensaje sea considerado spam aumenta con el tamaño del mensaje." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocesado del texto" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Los algoritmos de clasificación, implican convertir la conversión del set de datos en algún tipo de dataframe numérico (conversión del corpus a formato vector). La manera más sencilla es a través de una aproximación del tipo [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model) donde una palabra se representa por un número.\n", "\n", "Convertiremos por tanto mensajes en bruto (estado actual) en vectores (secuencias de números).\n", "\n", "Como primer paso separaremos a través de una funcion, cada mensaje en una lista de palabras. Posteriormente eliminaremos las palabras muy comunes (stopwords como 'the', 'a', ...) a través de la librería NLTK (https://www.nltk.org/book/). En este caso de uso usaremos las funciones básicas de la librería.\n", "\n", "Stopwords: https://es.wikipedia.org/wiki/Palabra_vac%C3%ADa\n", "\n", "Generamos una función que procese un mensaje y posteriormente a través de **apply()** lo procesaremos para todo el DataFrame.\n", "\n", "Eliminamos los signos de puntuación, para ello podemos usar el método **string**:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['E',\n", " 'j',\n", " 'e',\n", " 'm',\n", " 'p',\n", " 'l',\n", " 'o',\n", " ' ',\n", " 'm',\n", " 'e',\n", " 'n',\n", " 's',\n", " 'a',\n", " 'j',\n", " 'e',\n", " ' ',\n", " 'A',\n", " 't',\n", " 'e',\n", " 'n',\n", " 'c',\n", " 'i',\n", " 'ó',\n", " 'n',\n", " ' ',\n", " 't',\n", " 'i',\n", " 'e',\n", " 'n',\n", " 'e',\n", " ' ',\n", " 'u',\n", " 'n',\n", " ' ',\n", " 'p',\n", " 'u',\n", " 'n',\n", " 't',\n", " 'o']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import string\n", "\n", "mens = 'Ejemplo mensaje! Atención: tiene un punto..'\n", "\n", "# Comprobamos los caracteres para ver si son símbolos de puntuación\n", "nopunc = [char for char in mens if char not in string.punctuation]\n", "\n", "nopunc" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string.punctuation #elimina todo lo que sean puntuaciones" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ejemplo mensaje Atención tiene un punto'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Juntamos los caracteres de nuevo para construir una cadena de texto.\n", "nopunc = ''.join(nopunc)\n", "nopunc" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Ejemplo mensaje! Atención: tiene un punto..'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Una vez eliminados los signos de puntuación, eliminamos las stopwords. En este ejemplo, el set de datos está en inglés, por lo que deberemos eliminar las stopwords inglesas. En la documentación de NLTF podemos encontrar las stopwords para cada idioma." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /home/mydoctor/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import stopwords\n", "nltk.download('stopwords')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['i',\n", " 'me',\n", " 'my',\n", " 'myself',\n", " 'we',\n", " 'our',\n", " 'ours',\n", " 'ourselves',\n", " 'you',\n", " \"you're\",\n", " \"you've\",\n", " \"you'll\",\n", " \"you'd\",\n", " 'your',\n", " 'yours',\n", " 'yourself',\n", " 'yourselves',\n", " 'he',\n", " 'him',\n", " 'his',\n", " 'himself',\n", " 'she',\n", " \"she's\",\n", " 'her',\n", " 'hers',\n", " 'herself',\n", " 'it',\n", " \"it's\",\n", " 'its',\n", " 'itself',\n", " 'they',\n", " 'them',\n", " 'their',\n", " 'theirs',\n", " 'themselves',\n", " 'what',\n", " 'which',\n", " 'who',\n", " 'whom',\n", " 'this',\n", " 'that',\n", " \"that'll\",\n", " 'these',\n", " 'those',\n", " 'am',\n", " 'is',\n", " 'are',\n", " 'was',\n", " 'were',\n", " 'be',\n", " 'been',\n", " 'being',\n", " 'have',\n", " 'has',\n", " 'had',\n", " 'having',\n", " 'do',\n", " 'does',\n", " 'did',\n", " 'doing',\n", " 'a',\n", " 'an',\n", " 'the',\n", " 'and',\n", " 'but',\n", " 'if',\n", " 'or',\n", " 'because',\n", " 'as',\n", " 'until',\n", " 'while',\n", " 'of',\n", " 'at',\n", " 'by',\n", " 'for',\n", " 'with',\n", " 'about',\n", " 'against',\n", " 'between',\n", " 'into',\n", " 'through',\n", " 'during',\n", " 'before',\n", " 'after',\n", " 'above',\n", " 'below',\n", " 'to',\n", " 'from',\n", " 'up',\n", " 'down',\n", " 'in',\n", " 'out',\n", " 'on',\n", " 'off',\n", " 'over',\n", " 'under',\n", " 'again',\n", " 'further',\n", " 'then',\n", " 'once',\n", " 'here',\n", " 'there',\n", " 'when',\n", " 'where',\n", " 'why',\n", " 'how',\n", " 'all',\n", " 'any',\n", " 'both',\n", " 'each',\n", " 'few',\n", " 'more',\n", " 'most',\n", " 'other',\n", " 'some',\n", " 'such',\n", " 'no',\n", " 'nor',\n", " 'not',\n", " 'only',\n", " 'own',\n", " 'same',\n", " 'so',\n", " 'than',\n", " 'too',\n", " 'very',\n", " 's',\n", " 't',\n", " 'can',\n", " 'will',\n", " 'just',\n", " 'don',\n", " \"don't\",\n", " 'should',\n", " \"should've\",\n", " 'now',\n", " 'd',\n", " 'll',\n", " 'm',\n", " 'o',\n", " 're',\n", " 've',\n", " 'y',\n", " 'ain',\n", " 'aren',\n", " \"aren't\",\n", " 'couldn',\n", " \"couldn't\",\n", " 'didn',\n", " \"didn't\",\n", " 'doesn',\n", " \"doesn't\",\n", " 'hadn',\n", " \"hadn't\",\n", " 'hasn',\n", " \"hasn't\",\n", " 'haven',\n", " \"haven't\",\n", " 'isn',\n", " \"isn't\",\n", " 'ma',\n", " 'mightn',\n", " \"mightn't\",\n", " 'mustn',\n", " \"mustn't\",\n", " 'needn',\n", " \"needn't\",\n", " 'shan',\n", " \"shan't\",\n", " 'shouldn',\n", " \"shouldn't\",\n", " 'wasn',\n", " \"wasn't\",\n", " 'weren',\n", " \"weren't\",\n", " 'won',\n", " \"won't\",\n", " 'wouldn',\n", " \"wouldn't\"]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stopwords.words('english')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Las StopWords para castellano son:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['de',\n", " 'la',\n", " 'que',\n", " 'el',\n", " 'en',\n", " 'y',\n", " 'a',\n", " 'los',\n", " 'del',\n", " 'se',\n", " 'las',\n", " 'por',\n", " 'un',\n", " 'para',\n", " 'con',\n", " 'no',\n", " 'una',\n", " 'su',\n", " 'al',\n", " 'lo',\n", " 'como',\n", " 'más',\n", " 'pero',\n", " 'sus',\n", " 'le',\n", " 'ya',\n", " 'o',\n", " 'este',\n", " 'sí',\n", " 'porque',\n", " 'esta',\n", " 'entre',\n", " 'cuando',\n", " 'muy',\n", " 'sin',\n", " 'sobre',\n", " 'también',\n", " 'me',\n", " 'hasta',\n", " 'hay',\n", " 'donde',\n", " 'quien',\n", " 'desde',\n", " 'todo',\n", " 'nos',\n", " 'durante',\n", " 'todos',\n", " 'uno',\n", " 'les',\n", " 'ni',\n", " 'contra',\n", " 'otros',\n", " 'ese',\n", " 'eso',\n", " 'ante',\n", " 'ellos',\n", " 'e',\n", " 'esto',\n", " 'mí',\n", " 'antes',\n", " 'algunos',\n", " 'qué',\n", " 'unos',\n", " 'yo',\n", " 'otro',\n", " 'otras',\n", " 'otra',\n", " 'él',\n", " 'tanto',\n", " 'esa',\n", " 'estos',\n", " 'mucho',\n", " 'quienes',\n", " 'nada',\n", " 'muchos',\n", " 'cual',\n", " 'poco',\n", " 'ella',\n", " 'estar',\n", " 'estas',\n", " 'algunas',\n", " 'algo',\n", " 'nosotros',\n", " 'mi',\n", " 'mis',\n", " 'tú',\n", " 'te',\n", " 'ti',\n", " 'tu',\n", " 'tus',\n", " 'ellas',\n", " 'nosotras',\n", " 'vosotros',\n", " 'vosotras',\n", " 'os',\n", " 'mío',\n", " 'mía',\n", " 'míos',\n", " 'mías',\n", " 'tuyo',\n", " 'tuya',\n", " 'tuyos',\n", " 'tuyas',\n", " 'suyo',\n", " 'suya',\n", " 'suyos',\n", " 'suyas',\n", " 'nuestro',\n", " 'nuestra',\n", " 'nuestros',\n", " 'nuestras',\n", " 'vuestro',\n", " 'vuestra',\n", " 'vuestros',\n", " 'vuestras',\n", " 'esos',\n", " 'esas',\n", " 'estoy',\n", " 'estás',\n", " 'está',\n", " 'estamos',\n", " 'estáis',\n", " 'están',\n", " 'esté',\n", " 'estés',\n", " 'estemos',\n", " 'estéis',\n", " 'estén',\n", " 'estaré',\n", " 'estarás',\n", " 'estará',\n", " 'estaremos',\n", " 'estaréis',\n", " 'estarán',\n", " 'estaría',\n", " 'estarías',\n", " 'estaríamos',\n", " 'estaríais',\n", " 'estarían',\n", " 'estaba',\n", " 'estabas',\n", " 'estábamos',\n", " 'estabais',\n", " 'estaban',\n", " 'estuve',\n", " 'estuviste',\n", " 'estuvo',\n", " 'estuvimos',\n", " 'estuvisteis',\n", " 'estuvieron',\n", " 'estuviera',\n", " 'estuvieras',\n", " 'estuviéramos',\n", " 'estuvierais',\n", " 'estuvieran',\n", " 'estuviese',\n", " 'estuvieses',\n", " 'estuviésemos',\n", " 'estuvieseis',\n", " 'estuviesen',\n", " 'estando',\n", " 'estado',\n", " 'estada',\n", " 'estados',\n", " 'estadas',\n", " 'estad',\n", " 'he',\n", " 'has',\n", " 'ha',\n", " 'hemos',\n", " 'habéis',\n", " 'han',\n", " 'haya',\n", " 'hayas',\n", " 'hayamos',\n", " 'hayáis',\n", " 'hayan',\n", " 'habré',\n", " 'habrás',\n", " 'habrá',\n", " 'habremos',\n", " 'habréis',\n", " 'habrán',\n", " 'habría',\n", " 'habrías',\n", " 'habríamos',\n", " 'habríais',\n", " 'habrían',\n", " 'había',\n", " 'habías',\n", " 'habíamos',\n", " 'habíais',\n", " 'habían',\n", " 'hube',\n", " 'hubiste',\n", " 'hubo',\n", " 'hubimos',\n", " 'hubisteis',\n", " 'hubieron',\n", " 'hubiera',\n", " 'hubieras',\n", " 'hubiéramos',\n", " 'hubierais',\n", " 'hubieran',\n", " 'hubiese',\n", " 'hubieses',\n", " 'hubiésemos',\n", " 'hubieseis',\n", " 'hubiesen',\n", " 'habiendo',\n", " 'habido',\n", " 'habida',\n", " 'habidos',\n", " 'habidas',\n", " 'soy',\n", " 'eres',\n", " 'es',\n", " 'somos',\n", " 'sois',\n", " 'son',\n", " 'sea',\n", " 'seas',\n", " 'seamos',\n", " 'seáis',\n", " 'sean',\n", " 'seré',\n", " 'serás',\n", " 'será',\n", " 'seremos',\n", " 'seréis',\n", " 'serán',\n", " 'sería',\n", " 'serías',\n", " 'seríamos',\n", " 'seríais',\n", " 'serían',\n", " 'era',\n", " 'eras',\n", " 'éramos',\n", " 'erais',\n", " 'eran',\n", " 'fui',\n", " 'fuiste',\n", " 'fue',\n", " 'fuimos',\n", " 'fuisteis',\n", " 'fueron',\n", " 'fuera',\n", " 'fueras',\n", " 'fuéramos',\n", " 'fuerais',\n", " 'fueran',\n", " 'fuese',\n", " 'fueses',\n", " 'fuésemos',\n", " 'fueseis',\n", " 'fuesen',\n", " 'sintiendo',\n", " 'sentido',\n", " 'sentida',\n", " 'sentidos',\n", " 'sentidas',\n", " 'siente',\n", " 'sentid',\n", " 'tengo',\n", " 'tienes',\n", " 'tiene',\n", " 'tenemos',\n", " 'tenéis',\n", " 'tienen',\n", " 'tenga',\n", " 'tengas',\n", " 'tengamos',\n", " 'tengáis',\n", " 'tengan',\n", " 'tendré',\n", " 'tendrás',\n", " 'tendrá',\n", " 'tendremos',\n", " 'tendréis',\n", " 'tendrán',\n", " 'tendría',\n", " 'tendrías',\n", " 'tendríamos',\n", " 'tendríais',\n", " 'tendrían',\n", " 'tenía',\n", " 'tenías',\n", " 'teníamos',\n", " 'teníais',\n", " 'tenían',\n", " 'tuve',\n", " 'tuviste',\n", " 'tuvo',\n", " 'tuvimos',\n", " 'tuvisteis',\n", " 'tuvieron',\n", " 'tuviera',\n", " 'tuvieras',\n", " 'tuviéramos',\n", " 'tuvierais',\n", " 'tuvieran',\n", " 'tuviese',\n", " 'tuvieses',\n", " 'tuviésemos',\n", " 'tuvieseis',\n", " 'tuviesen',\n", " 'teniendo',\n", " 'tenido',\n", " 'tenida',\n", " 'tenidos',\n", " 'tenidas',\n", " 'tened']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stopwords.words('spanish')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Ejemplo', 'mensaje', 'Atención', 'tiene', 'un', 'punto']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nopunc.split()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Eliminamos stopwords\n", "clean_mens = [word for word in nopunc.split() if word.lower() not in stopwords.words('spanish')]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Ejemplo', 'mensaje', 'Atención', 'punto']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clean_mens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este ejemplo está desarrollado para texto en castellano, pero el conjunto de datos está en inglés. Automatizamos el proceso para ejecutarlo sobre el total de datos en inglés." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def procesado_texto(mens):\n", " \"\"\"\n", " Acepta una cadena de texto, y ejecuta:\n", " 1. Elimina todos los símbolos de puntuación\n", " 2. Elimina las stopwords\n", " 3. Devuelve una lista de texto limpio\n", " \"\"\"\n", " # Comprobar caracteres para eliminar cualquier símbolo de puntuación\n", " nopunc = [char for char in mens if char not in string.punctuation]\n", "\n", " # Unir los caracteres para generar un string de nuevo.\n", " nopunc = ''.join(nopunc)\n", " \n", " # Eliminar las stopwords (en este caso de uso, inglesas)\n", " return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajestamaño
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " clase mensajes tamaño\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Para procesar el set de datos, necesitamos 'tokenizar' los mensajes (convertir un conjunto de textos, en una lista de 'tokens' que son las palabras que nos interesan).\n", "\n", "Let's see an example output on on column:\n", "\n", "**Atención:**\n", "Podemos obtener 'warnings' debido a símbolos que no hemos tenido en cuenta o que no están en Unicode (como el símbolo de € o libra)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [Go, jurong, point, crazy, Available, bugis, n...\n", "1 [Ok, lar, Joking, wif, u, oni]\n", "2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...\n", "3 [U, dun, say, early, hor, U, c, already, say]\n", "4 [Nah, dont, think, goes, usf, lives, around, t...\n", "5 [FreeMsg, Hey, darling, 3, weeks, word, back, ...\n", "6 [Even, brother, like, speak, treat, like, aids...\n", "7 [per, request, Melle, Melle, Oru, Minnaminungi...\n", "8 [WINNER, valued, network, customer, selected, ...\n", "9 [mobile, 11, months, U, R, entitled, Update, l...\n", "Name: mensajes, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Comprobamos que funciona\n", "mensajes['mensajes'].head(10).apply(procesado_texto)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clasemensajestamaño
0hamGo until jurong point, crazy.. Available only ...111
1hamOk lar... Joking wif u oni...29
2spamFree entry in 2 a wkly comp to win FA Cup fina...155
3hamU dun say so early hor... U c already then say...49
4hamNah I don't think he goes to usf, he lives aro...61
\n", "
" ], "text/plain": [ " clase mensajes tamaño\n", "0 ham Go until jurong point, crazy.. Available only ... 111\n", "1 ham Ok lar... Joking wif u oni... 29\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155\n", "3 ham U dun say so early hor... U c already then say... 49\n", "4 ham Nah I don't think he goes to usf, he lives aro... 61" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mensajes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Continuando con la Normalización\n", "\n", "Existen diferentes maneras para continuar normalizando textos. Una de ellas es el [Stemming](https://es.wikipedia.org/wiki/Stemming) otra de ellas podría ser la caracterización de cada palabra en función de si es un sustantivo, adjetivo, verbo, ...(http://www.nltk.org/book/ch05.html).\n", "\n", "NLTK tiene numerosas herramientas (que están muy bien documentadas). Tenemos que tener en cuenta, que en ocasiones el formato de las palabras y textos pueden estár abreviados o no están correctamente construidas a nivel sintáctico. Por ejemplo:\n", " \n", "_'Nah dawg, IDK! Wut time u headin to da club?'_\n", " \n", "vs.\n", "\n", "_'No dog, I don't know! What time are you heading to the club?'_\n", " \n", "Para esos casos será necesario hacer uso de los métodos avanzados disponibles en [NLTK book online](http://www.nltk.org/book/).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectorización" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hasta ahora, tenemos los mensajes como una lista de tokens (también conocidas como [lemas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) y tenemos que convertir esos mensajes en un vector que los algoritmos de SciKit learn puedan usar.\n", "\n", "Convertiremos ahora cada mensaje (representado como una lista de tokens (lemas)), en un vector. \n", "Pasos:\n", "1. Contar cuántas veces aparece cada palabra en cada mensaje (frecuencia):\n", "\n", "2. Ponderar las apariciones, de manera que los tokens frecuentes 'pesen' menos (inversa de la frecuencia)\n", "\n", "3. Normalizar los vectores\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El resultado que queremos obtener es una matriz de este tipo:\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mensaje 1 Mensaje 2 ... Mensaje N
Palabra 1 Count01...0
Palabra 2 Count00...0
... 12...0
Palabra N Count 01...1
\n", "\n", "\n", "En esta matriz, representamos por filas todos los tokens (únicos) detectados y por columnas cada uno de los mensajes del conjunto de datos. Haremos uso de **CountVectorizer** incluido en Scikit Learn.\n", "\n", "Debido a que no todas los tokens aparecerán en todos los mensajes, obtendremos una \"matriz dispersa\" donde el valor más habitual es el 0 -> [Matriz dispersa](https://en.wikipedia.org/wiki/Sparse_matrix)." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the **analyzer** to be our own previously defined function:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11425\n" ] } ], "source": [ "# Éste proceso puede llevar un tiempo...\n", "nube_palabras = CountVectorizer(analyzer = procesado_texto).fit(mensajes['mensajes'])\n", "\n", "# Total elementos en la nube de palabras\n", "print(len(nube_palabras.vocabulary_))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "jurong 7555\n", "point 8917\n", "crazy 5769\n", "Available 1110\n", "bugis 5218\n", "n 8336\n", "great 6937\n", "world 11163\n", "la 7668\n", "e 6217\n", "buffet 5217\n", "Cine 1483\n", "got 6906\n", "amore 4653\n", "wat 10965\n", "Ok 3064\n", "lar 7701\n", "Joking 2451\n", "wif 11072\n", "u 10698\n", "oni 8590\n", "Free 1941\n", "entry 6331\n", "2 423\n", "wkly 11123\n", "comp 5619\n", "win 11084\n", "FA 1833\n", "Cup 1551\n", "final 6557\n", "tkts 10512\n", "21st 443\n", "May 2804\n", "2005 430\n", "Text 3953\n", "87121 871\n", "receive 9252\n", "questionstd 9159\n", "txt 10686\n", "rateTCs 9200\n", "apply 4731\n", "08452810075over18s 73\n", "U 4068\n", "dun 6204\n", "say 9554\n", "early 6222\n", "hor 7186\n", "c 5261\n", "already 4629\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Una pequeña muestra de lo obtenido\n", "pd.Series(nube_palabras.vocabulary_)[1:50]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extraemos la nube de palabras de un mensaje como vector..." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U dun say so early hor... U c already then say...\n" ] } ], "source": [ "mensaje4 = mensajes['mensajes'][3]\n", "print(mensaje4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En formato vector tendríamos..." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 4068)\t2\n", " (0, 4629)\t1\n", " (0, 5261)\t1\n", " (0, 6204)\t1\n", " (0, 6222)\t1\n", " (0, 7186)\t1\n", " (0, 9554)\t2\n", "\n", "\n", "Dimensiones: (1, 11425)\n" ] } ], "source": [ "vector4 = nube_palabras.transform([mensaje4])\n", "print(vector4)\n", "print('\\n')\n", "print('Dimensiones: ',vector4.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vemos que en el mensaje4, hay 7 palabras únicas (tras eliminar las stop words). 2 de ellas aparece dos veces, y el resto sólo una vez. Comprobamos a qué términos corresponden éstos elementos." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U\n", "say\n" ] } ], "source": [ "print(nube_palabras.get_feature_names_out()[4068])\n", "print(nube_palabras.get_feature_names_out()[9554])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ahora usaremos **.transform** en la nube de palabras obtenida y la convertimos en DataFrame" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "mensajes_nube_palabras = nube_palabras.transform(mensajes['mensajes'])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dimensiones de la matriz dispersa: (5572, 11425)\n", "Total de elementos NO nulos: 50548\n" ] } ], "source": [ "print('Dimensiones de la matriz dispersa: ', mensajes_nube_palabras.shape)\n", "print('Total de elementos NO nulos: ', mensajes_nube_palabras.nnz)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dispersion: 0.07940295412668218\n" ] } ], "source": [ "dispersion = (100.0 * mensajes_nube_palabras.nnz / (mensajes_nube_palabras.shape[0] * mensajes_nube_palabras.shape[1]))\n", "print('dispersion: {}'.format(dispersion))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Después de obtener la matriz con la nube de palabras, necesitamos normalizar lo obtenido. El objetivo es comprobar cómo de importante es cada término respecto del total y puede llevarse a cabo a través de [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), usando `TfidfTransformer` de Scikit-learn." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " (0, 9554)\t0.5385626262927564\n", " (0, 7186)\t0.4389365653379857\n", " (0, 6222)\t0.3187216892949149\n", " (0, 6204)\t0.29953799723697416\n", " (0, 5261)\t0.29729957405868723\n", " (0, 4629)\t0.26619801906087187\n", " (0, 4068)\t0.40832589933384067\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer\n", "\n", "tfidf_transformer = TfidfTransformer().fit(mensajes_nube_palabras)\n", "tfidf4 = tfidf_transformer.transform(vector4)\n", "print(tfidf4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "¿Cuál es el IDF (inverse document frequency) para las palabras \"u\" y \"university\"?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.2800524267409408\n", "8.527076498901426\n" ] } ], "source": [ "print(tfidf_transformer.idf_[nube_palabras.vocabulary_['u']])\n", "print(tfidf_transformer.idf_[nube_palabras.vocabulary_['university']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transformamos la nube de palabras en un corpus TD-IDF de una vez:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(5572, 11425)\n" ] } ], "source": [ "mensajes_tfidf = tfidf_transformer.transform(mensajes_nube_palabras)\n", "print(mensajes_tfidf.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Entrenando el modelo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Puesto que ya tenemos los mensajes reprentados como vectores, podemos entrenar nuestro clasificador spam/ham. Podemos utilizar casi cualquier tipo de algoritmo de clasificación. Usaremos para este caso el clasificador Naive Bajes (http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf).\n", "\n", "[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "modelo_deteccion_spam = MultinomialNB().fit(mensajes_tfidf, mensajes['clase'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ya tenemos el modelo, veamos cómo clasifica el mensaje 4:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicho: ham\n", "Esperado: ham\n" ] } ], "source": [ "print('Predicho:', modelo_deteccion_spam.predict(tfidf4)[0])\n", "print('Esperado:', mensajes.clase[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ya tenemos nuestro modelo clasificador de mensajes.\n", "\n", "## Evaluación del modelo\n", "Comprobaremos ahora el desempeño de nuestro modelo con la predicción de todos los mensajes. Tenemos que tener en cuenta que no podemos usar el mismo set de datos para entrenar y testear el modelo. Puesto que no hemos particionado los datos al inicio, no podríamos comprobar el desempeño del modelo." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Test Split" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4457 1115 5572\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "msg_train, msg_test, clase_train, clase_test = train_test_split(mensajes['mensajes'], mensajes['clase'], test_size=0.2)\n", "\n", "print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hemos elegido en este caso un tamaño de la muestra de test del 20% (1115 mensajes de un total de 5572)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creación de un Pipeline\n", "\n", "El Pipeline es el código común que generará un modelo para cualquier problema de clasificación o regresión. También generan códigos para entrenamiento y prueba , transforma datos. [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html)\n", "\n", "La salida de todo el proceso es un objeto modelo, que es persistente, se puede guardar y cargar para su análisis.\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "pipeline = Pipeline([\n", " ('nube', CountVectorizer(analyzer=procesado_texto)), # strings to token integer counts\n", " ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores\n", " ('clasificador', MultinomialNB()), # entrenamiento multinomial NaiveBayes\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "'Pasamos' ahora los mensajes de texto y pipeline realizará el preprocesamiento por nosotros:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Pipeline(steps=[('nube',\n", " CountVectorizer(analyzer=)),\n", " ('tfidf', TfidfTransformer()),\n", " ('clasficador', RandomForestClassifier())])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.fit(msg_train,clase_train)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report\n", "predicciones = pipeline.predict(msg_test)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.97 1.00 0.99 974\n", " spam 1.00 0.82 0.90 141\n", "\n", " accuracy 0.98 1115\n", " macro avg 0.99 0.91 0.94 1115\n", "weighted avg 0.98 0.98 0.98 1115\n", "\n" ] } ], "source": [ "print(classification_report(clase_test,predicciones))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Si quisiéramos usar otro clasificador, es muy sencillo a través de pipeline. En el siguiente ejemplo usaremos el clasficador RF" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "pipeline = Pipeline([\n", " ('nube', CountVectorizer(analyzer=procesado_texto)), \n", " ('tfidf', TfidfTransformer()), \n", " ('clasficador', RandomForestClassifier()), \n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "¿Puedes comprobar el desempeño del clasificador RF? ¿Es mejor o peor que el de NB?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Más recursos\n", "\n", "[NLTK Book Online](http://www.nltk.org/book/)\n", "\n", "[Kaggle Walkthrough](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words)\n", "\n", "[SciKit Learn's Tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }