Dataset available here
The dataset is labeled, and contains two labels of text messages: spam and ham. It is in text file format.
Dataset size: 5574 labeled messages.
Following are some examples:
| message label | message content |
|---|---|
| ham | What you doing?how are you? |
| ham | Ok lar... Joking wif u oni... |
| ham | dun say so early hor... U c already then say... |
| ham | MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H* |
| ham | Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. |
| spam | FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop |
| spam | Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B |
| spam | URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU |
Classify the messages into spam or ham using Support Vector Classification
The program was developed using Google Colab notebook. I uploaded a copy of the notebook.
Best result found:
| Metric | Score (%) |
|---|---|
| Spam Caught | 83.89 |
| Blocked Ham | 0.00 |
| Accuracy | 97.85 |
| Matthews correlation coefficient (MCC) | 90.48 |
Results are discussed in this report.