Detecting *bad* words in text

Detecting *bad* words in text

  
Does anyone have a good way to parse through the words of text a user typed in and look for bad words in it?
Thanks,
Gerry
Gerry -

There is no good way to do this. Not because it's OutSystems, but because there is no good way to do it, period.

The word "assume", for example, presents all sorts of issues for such things. Do you want "assume" to be banned? Or do you want the filter to be defeated just by taking out a space between the next word?

J.Ja
Hi Gerry,

As Justin has put it, it's never an overstatement to say that this is a trully complex problem, whatever the development language / platform. Most likely you'll become yet another victim of the clbuttic Scunthorpe problem.

That being said, you can indeed detect bad words if you pay attention to word boundries (which you can easily do using regular expressions). But then this will mean that such a filter can then be easily circumvented by alternate spellings (eg: b-a-d instead of bad), so you shouldn't rely on such a filter to be certain about the presence or absense of badwords in a text.

Besides, because some bad words still have they badness depending on the context, you shouldn't perform automatic substitutions or other silent actions using their apparent presence as a criteria.

Cheers,
Miguel
Thanks. I know that this is difficult and it's impossible to get every case, but I need some initial scattergun method in this application I'm writing to do a basic test. My thought is to use the Text entension to split out the words and then just compare them one by one to a list of bad words in a database table. I'm worried about performance and wonder if there is a better way.
Thanks!
Hi Gerry,

You can also use regular expressions in the form of \bword\b where \b is a regular expression anchor meaning the match must occur at a word boundary. This anchor is supported by all major regular expression implementations, including .NET, JavaScript and Java.

Whether throwing regular expressions at the text is faster than splitting the text in words and finding matching atoms or splitting in words, putting them on a hashset and checking the set against the presence of certain words... I'm guessing that will depend on the size of the texts you will parse versus the number of bad words you want to search for.

If you don't have many [bad]words you want to match against, using the regular expressions might be faster.
If you have many words but the text won't be gigantic, splitting the text might be better.
If both can grow a lot, splitting the words into a hashset should be better.

In any case you should test if this poses a real performance issue for you and test against some real user input.

Cheers,
Miguel