Recent Developments in Social Spam Detection and Combating Techniques
Spam refers to unwanted or unsolicited messages sent or received electronically via email, instant messaging, blogs, newsgroups, social media, web search, and mobile phones, with advertising fines, Phishing, Malware, etc. As is clear from the definition, Spam is intended for malice and generally represents a viable but fraudulent source of income for some individuals or organizations. The cyber attacker involved in sending such spam messages is generally referred to as a “spammer.” Although initially targeted and limited to email, spam has invaded all electronic platforms in all media.
What kinds of spam exist today?
- Email spam: also known as junk mail, sends unwanted messages, frequently containing commercial content, in large quantities to an indiscriminate set of recipients.
- Spam in instant messaging uses instant messengers (IM): Although it is subtler than its email counterpart, it tends to annoy users of instant messengers like Skype©, Yahoo!®, and Messenger with unsolicited messages from advertisers, etc.
- Spam on newsgroups and forums: the multiple and repetitive postings in Usenet newsgroups and irrelevant Internet forums.
- Mobile phone spam: This form of spam uses short message services (SMS) as its modus operandi. Sometimes customers are charged for premium services by being tricked into some fake subscription and scam.
- Spamdexing: refers to search engine spam or the practice of manipulating the search engine ranking and relevance algorithm to promote a particular website or web page.
- Splogs and Wikis: Spam on Blogs, also known as Splog, refers to comments unrelated to the discussion topic. These comments are usually embedded with URL links to some commercial sites. Some Splogs are written as detailed announcements for the websites they promote; others have no original content featuring nonsense or content stolen from legitimate websites. Similar types of attacks are also seen on Wikis and other guestbooks that accept comments from general users.
- Spam on video sites: Social networking websites like YouTube are also infested with spam that usually involves comments and links to some pornographic or dating site or some unrelated videos. Sometimes these comments are automatically generated through Bots.
- Spam in the messaging of online games: they are floods of messages, requests to join a particular group, violations of copyright terms and conditions, etc.
- Spi or Spam over Internet telephony: This uses voice over Internet telephony (VoIP) to send Spam. Typically, a pre-recorded message is played when the recipient mistakenly receives a spam call. This platform is a vulnerable target for spammers since VoIP is cheap and easily anonymized.
Types of spam and spamming techniques
Types of spam
- Malicious Links: Links that harm, mislead, or otherwise harm a user’s computer.
- Fake Profiles: Spammers may create fake profiles that would otherwise appear legitimate to avoid detection and lure non-spammers into befriending them.
- Mass mailings: they are known as spam bombs; they are a set of comments published several times with the exact text, which allows the tags associated with the comments to trend on social networks quickly.
- Scam Reviews: These reviews claim that a product is original and good, even though the reviewer may not have used it.
- Clickjacking: Also known as UI-redressing, spammers trick users into clicking on invisible targets (e.g., buttons) belonging to a different page. This form of spam can be seen mainly on blogs and forums.
- Malicious browser extensions via drive-by downloads: This form of attack occurs by downloading malware from the Internet without the user’s notice. This type of spam usually comes as malicious links and can be found on blogs, website bookmarks, reviews, etc.
- URL Shorteners: In this spam attack, the shortened URL obfuscates the actual URL and redirects to its configured destinations without the user’s consent. This type of spam is more frequent in social networks, microblogs, reviews, etc.
Spam detection techniques
There are three main strategies for dealing with spam:
- Detection-based techniques: These try to identify and remove spam from the system.
- Degradation-based strategies: These attempts to lower the spam ranking in a list of messages.
- Prevention-based strategies: These aim to hinder the ability of spammers to contribute to the system by altering interfaces or limiting user actions.
Latest developments in antispam techniques
Collaborative filtering is used through a social network called SocialFilter, which is a collaborative spam filtering system that uses social trust integrated into online social networks (OSN) to evaluate the reliability of spam reporters. It is a graph-based approach that is based on the OSN graph. SocialFilter aims to add the features of multiple spam detectors, thereby democratizing spam mitigation. Each SocialFilter node, which is managed by a human administrator, sends spammer reports to a centralized repository.
Blogs are a platform where people express their emotions, share information, and communicate with each other. With their growing popularity, blogs are now being used to drive blog search engine traffic or for promotional purposes. These types of blogs are called Splogs. However, most existing Splog detection techniques are content-based, which is less effective given the dynamic nature of blogs.
Currently, three antispam techniques are used to combat Splogs:
- Detection-based techniques that use a deterministic approach work with a set of Technorati Queries data with detection accuracy greater than 60%.
- Classification-based techniques that operate through the comments of social networks with detection accuracy greater than 60%.
- Detection-based techniques that use grouped social graphs using publications on commercial blog sites. This technique is considered the most efficient at detecting spammers.
Microblog spam refers to spamming on microblogging platforms, such as Twitter, where there is a limitation on the size of the tweet. To detect microblog spam, there are several techniques available, including:
- Deterministic-based approaches that study cases to identify spam.
- Classification-based approaches that focus on the Social Honeypot Framework.
- Degradation techniques that focus on Collusion Rank and PageRank.
There are also antispam techniques that combine case studies and classification, such as social-graph based Mr. SPA, as well as techniques based on clustering of spam campaigns and labeling them using RF Classifier, Lasso formulation integrated with a graph regularization term, Random Forest Classification using Adjusted Features, and ELM based classifier with defined features.
Social bookmarking has evolved from traditional bookmarking to a platform where users can add, edit, or modify a website or web page for future access. These sites allow users to bookmark different web pages and share their opinions on articles, images, and videos. However, many website owners use social bookmarking sites to browse interesting articles and include links. This exposes websites to spammers through backlinks, as spammers create attractive spam bookmarks that are chosen by unsuspecting users.
To combat bookmarking spam, there are various antispam techniques available, including:
- Clustering and classification techniques based on Self-Organizing Maps (SOM) clustering and association discovery.
- Probabilistic feature extraction and aggregation.
- GraphLab Create and Probabilistic Soft Logic for feature extraction
- Gradient-Boosted Decision Tree classifier for classification.
Social network spam
Current anti-spam techniques have identified that the generators of social network spam are robots. These robots are known as Displayer, Bragger, Poster, and Whisperer. The anti-spam techniques for social networks include:
- FF Ratio: The ratio of friend requests to the existing number of friends.
- URL ratio: The ratio of URLs in a message to the number of words.
- Friend choice: The similarity between the spammer and the victim’s friend lists.
- Messages sent: The number of messages a user sends in a given time frame.
- Friend number: The number of friends a user has.
In other words, these are classification-based techniques based on spam-bot and spam profile and run-time classification. Clustering techniques focus on Markov clustering on social graphs and the SOM learning algorithm, while detection and removal techniques are based on incremental clustering followed by classification. Classification and monitoring techniques focus on the social network-based Social Spam Guard, and unsupervised detection techniques focus on the HITS-based framework.
Review Spam is a type of spam that appears as reviews on various e-commerce websites. Positive reviews can boost a company’s business, but negative reviews can harm it. Some spammers intentionally post reviews to damage the reputation of a product or company, and robots can also generate these reviews. In 2013, a model was designed to generate synthetic reviews. A novel defense method was proposed to detect the difference in semantic flows between fake and truthful reviews, which are difficult to detect using existing methods.
Currently, there are several antispam techniques for detecting review spam, including:
- Classification-based techniques focused on linear kernel SVM and n-gram-based methods.
- Rule-based techniques.
- Time-sensitive feature-based techniques.
- Combined techniques based on frameworks for classification and clustering.
- Classification-based techniques focused on generating and analyzing synthetic reviews.
- Loopy Belief Propagation (LBP) network-based techniques.
Location search spam
According to cybersecurity research, spammers can infiltrate and disrupt a valid search system by associating unrelated tags with documents or even randomly infusing documents with terms related to a particular location. A methodology for detecting spam on a location-based social bookmarking website, Foursquare, was developed to address this issue. Foursquare allows users to leave tips about various places and attractions, which other users can access. However, spammers post irrelevant content, such as business tips, which misleads users interested in learning about a particular place.
This document analyzes suggestion spammers, aiming to develop automated tools for detecting users posting spam suggestions. The Location Search antispam techniques are based on classification, and clustering focused on Random Forest and Decision Tree based classification, EM clustering for categorization, and Random Forest classification.
Comment spam is prevalent on social media platforms, particularly on YouTube and news sites. A data mining approach has been proposed to filter spam comments on YouTube forums to combat this cyberattack. Unlike content analysis for spam detection, this approach exploits comment behavior to identify spammers. The methodology takes advantage of YouTube’s hasSpamHint feature that accompanies user comments. Here are the steps involved:
- Retrieve comments marked as hasSpamHint for a given video.
- Extract the user IDs behind the suspected spam comments to gather information about user comment activity.
- Derive attributes such as the comment text, timestamp, VideoID of the commented video, and the value of the hasSpamHint binary variable from the usage log in discussion forums.
- Calculate the values of variables that indicate the spam intent of the user.
- Assign a score to the user to identify them as a spammer or not.
- Apply a specific rule derived from manual data inspection to mark any user who meets the rule’s conditions (with at least five comments) as a spammer.
This antispam technique is called Rule-based and NLP-induced Topic Similarity in posts and comments followed by classification.
Cross-media spam is a unique anti-spam methodology that detects spam across different platforms. It involves rapid identification of spam in all social networks and an increase in spam detection accuracy with the participation of a large data set. While a single effective strategy cannot be applied to all forms and platforms of spam, this technique is an innovative cross-platform framework for detecting social spam.
This technique is divided into three main components:
- Mapping and assembly use the conversion of a specific social network object into a framework-defined standard model for the object.
- Pre-filtering is based on blacklists, hashing, and similarity comparison to compare incoming objects with known spam objects.
- Classification is based on supervised machine learning techniques to classify incoming and associated objects.
Spam is a widespread problem on the Internet, and antispam techniques have been developed and implemented across various platforms with varying degrees of success. These techniques broadly fall into deterministic, probabilistic, or graph-based algorithms, but each category has significant variation. Probabilistic approaches are most commonly used in modern techniques, as the characteristics of social networks differ significantly from those of standard documents or web pages. However, the fight against spam is a never-ending game as spammers develop new methods to evade detection.
Therefore, constant vigilance and developing new and better spam-fighting techniques are essential to combat spam effectively.
admin is a senior staff writer for Government Technology. She previously wrote for PYMNTS and The Bay State Banner, and holds a B.A. in creative writing from Carnegie Mellon. She’s based outside Boston.