Moderation & Risk Scoring Specification

This document outlines the precise technical rules governing the automated systems for content moderation and risk score calculation. All user-submitted content and user accounts are evaluated according to these exact specifications.

Part 1: Content-Level Moderation

Each piece of user-submitted content (profiles, posts, comments) is first processed by the Content Moderation Function. This function returns two outputs: the modified (filtered) content and a numerical Content Score.

Stage 1.1: Severe Violation Checks

Content is checked against high-priority violation tiers. A match results in immediate content removal and a maximum score.

Rule 1.1.1 (Tier 1 Words): A case-insensitive, whole-word search is performed against the Tier 1 Word List. If a match is found, the function immediately returns the string [content removed due to severe violation] and a fixed Content Score of 5.0.
Rule 1.1.2 (Tier 2 Phrases): If no Tier 1 match is found, a case-insensitive, whole-phrase search is performed against the Tier 2 Phrase List. If a match is found, the function immediately returns the string [content removed due to spam/scam policy] and a fixed Content Score of 5.0.

Stage 1.2: Scored Violations & Filtering

If the content passes Stage 1.1, its Content Score starts at 0.0 and is incremented based on the following rules:

Rule 1.2.1 (Tier 3 Words): Each case-insensitive, whole-word match from the Tier 3 Word List is replaced with asterisks (*) equal to its length. The Content Score is incremented by +2.0 for each match.
Rule 1.2.2 (External Links): Each detected URL is replaced with [link removed]. The Content Score is incremented by +2.0 for each match.
Rule 1.2.3 (Excessive Capitalization): If content has >15 alphabetic characters and >70% are uppercase, the Content Score is incremented by a fixed value of +0.5. The content is not modified.

Part 2: Entity-Level Risk Score Calculation

After individual pieces of content are scored, the system calculates a final Risk Score for posts, comments, and users. This score is used for sorting and review in the administrative dashboard.

Rule 2.1: Post & Comment Risk Score

The Risk Score for an individual post or comment is determined by its Content Score, adjusted for the author's account age.

Step 1: Calculate the base_score for the content using the Content Moderation Function.
Step 2: Determine the author's account age in days.
Step 3: Apply an age-based multiplier to the base_score to get the final risk_score:
- If account age is less than 7 days, risk_score = base_score * 1.5.
- Otherwise, risk_score = base_score.

Rule 2.2: User Risk Score

The Risk Score for a user is a weighted average of their content, designed to reflect their overall behavior.

Step 1 (Profile Score): Calculate the Content Score of the user's profile text. This is the profile_score.
Step 2 (Post Score): Calculate the Content Score for all of the user's posts and find the average. This is the average_post_score. If the user has no posts, this value is 0.
Step 3 (Comment Score): Calculate the Content Score for all of the user's comments and find the average. This is the average_comment_score. If the user has no comments, this value is 0.
Step 4 (Combine Scores): A preliminary content_risk_score is calculated using the following formula:
content_risk_score = (profile_score * 1) + (average_post_score * 3) + (average_comment_score * 1)
Step 5 (Apply Age Multiplier): The final user_risk_score is determined by applying a multiplier to the content_risk_score based on account age:
- If account age < 7 days: user_risk_score = content_risk_score * 1.5
- If account age < 30 days: user_risk_score = content_risk_score * 1.2
- Otherwise: user_risk_score = content_risk_score
Step 6 (Final Capping): The final calculated user_risk_score is capped at a maximum value of 5.0.

Risk Level Classification

Finally, a risk score is translated into a human-readable label based on these thresholds:

HIGH: Score is 5.0 or greater.
MEDIUM: Score is 3.0 to 4.99.
LOW: Score is 1.0 to 2.99.
NONE: Score is less than 1.0.