Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,10 @@
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

-------------------------------------------------------------------------------

This project also includes code under the MIT License:

Portions of this software are © 2015 Rob Renaud, licensed under the MIT License.
See LICENSES/MIT.txt for the full license text.
21 changes: 21 additions & 0 deletions LICENSES/MIT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) 2015 Rob Renaud

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
5 changes: 5 additions & 0 deletions assets/bad.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
zxcvwerjasc
nmnjcviburili,<>
zxcvnadtruqe
ertrjiloifdfyyoiu
grty iuewdiivjh
128,538 changes: 128,538 additions & 0 deletions assets/big.txt

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions assets/good.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
rob
two models
some long sentence, might suck?
Project Gutenberg
a b c
HTTP GET
70 changes: 70 additions & 0 deletions cmd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Go gibberish

This program is a go-powered version of <https://github.com/rrenaud/Gibberish-Detector>,

It uses a training file to build a model, which is then used to check whether a string is likely to be gibberish or not.

## How it works

With rrenaud's words, the creator of the original Python algorithm:

> It uses a 2 character markov chain.
>
> The markov chain first 'trains' or 'studies' a few MB of English text, recording how often characters appear next to each other. Eg, given the text "Rob likes hacking" it sees Ro, ob, o[space], [space]l, ... It just counts these pairs. After it has finished reading through the training data, it normalizes the counts. Then each character has a probability distribution of 27 followup character (26 letters + space) following the given initial.
>
>So then given a string, it measures the probability of generating that string according to the summary by just multiplying out the probabilities of the adjacent pairs of characters in that string. EG, for that "Rob likes hacking" string, it would compute prob['r']['o'] * prob['o']['b'] * prob['b'][' '] ... This probability then measures the amount of 'surprise' assigned to this string according the data the model observed when training. If there is funny business with the input string, it will pass through some pairs with very low counts in the training phase, and hence have low probability/high surprise.
>
>I then look at the amount of surprise per character for a few known good strings, and a few known bad strings, and pick a threshold between the most surprising good string and the least surprising bad string. Then I use that threshold whenever to classify any new piece of text.
>
>Peter Norvig, the director of Research at Google, has this nice talk about "The unreasonable effectiveness of data" here, <http://www.youtube.com/watch?v=9vR8Vddf7-s>. This insight is really not to try to do something complicated, just write a small program that utilizes a bunch of data and you can do cool things.

## How to use it

Run the training for the model by calling the function `training.TrainModel` and then use `gibberish.IsGibberish` to detect whether a string is gibberish or not.
In case you decide to us

```go

var (
performTraining bool
)

func main() {

flag.BoolVar(&performTraining, "train", false, "train")
flag.Parse()

if performTraining {
err := training.TrainModel(consts.AcceptedCharacters, "big.txt", "good.txt", "bad.txt", "knowledge.json")
if err != nil {
log.Fatal(err)
}

return
}

reader := bufio.NewReader(os.Stdin)
data, err := persistence.LoadKnowledgeBase("knowledge.json")
if err != nil {
log.Fatal(err)
}

for {

fmt.Print("Insert something to check: ")
input, _ := reader.ReadString('\n')
input = strings.TrimSpace(input)
isGibberish := gibberish.IsGibberish(input, data)
fmt.Println(fmt.Sprintf("Input: %s: is gibberish? %v\n", input, isGibberish))

}

}
```

## Credits

Thanks once again to [rrenaud](https://github.com/rrenaud) for the original algorithm.

A huge thank you goes to [domef](https://github.com/domef) as well, for helping me translate the algorithm.

58 changes: 58 additions & 0 deletions cmd/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
// Originally from go-gibberish
// Copyright (c) 2015 Rob Renaud
// Licensed under the MIT License. See LICENSES/MIT.txt.
//
// Modifications copyright (c) 2025 Grafana Labs
// Licensed under the Apache License, Version 2.0.

package main

import (
"bufio"
"flag"
"fmt"
"log"
"os"
"strings"

"github.com/grafana/clusterurl/pkg/consts"
"github.com/grafana/clusterurl/pkg/gibberish"
"github.com/grafana/clusterurl/pkg/persistence"
"github.com/grafana/clusterurl/pkg/training"
)

var (
performTraining bool
)

func main() {

flag.BoolVar(&performTraining, "train", false, "train")
flag.Parse()

if performTraining {
err := training.TrainModel(consts.AcceptedCharacters, "assets/big.txt", "assets/good.txt", "assets/bad.txt", "pkg/clusterurl/model.json")
if err != nil {
log.Fatal(err)
}

return
}

reader := bufio.NewReader(os.Stdin)
data, err := persistence.LoadKnowledgeBase("pkg/clusterurl/model.json")
if err != nil {
log.Fatal(err)
}

for {

fmt.Print("Insert something to check: ")
input, _ := reader.ReadString('\n')
input = strings.TrimSpace(input)
isGibberish := gibberish.IsGibberish(input, data)
fmt.Println(fmt.Sprintf("Input: %s: is gibberish? %v\n", input, isGibberish))

}

}
8 changes: 5 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@ module github.com/grafana/clusterurl
go 1.20

require (
github.com/AlessandroPomponio/go-gibberish v0.0.0-20191004143433-a2d4156f0396 // indirect
github.com/hashicorp/golang-lru/v2 v2.0.7
github.com/stretchr/testify v1.8.4
)

require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/stretchr/testify v1.8.4 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
3 changes: 1 addition & 2 deletions go.sum
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
github.com/AlessandroPomponio/go-gibberish v0.0.0-20191004143433-a2d4156f0396 h1:cKIHT8I2mrmw/VgdyNeACP/AvetK8AgGsiRfOC3ZjmQ=
github.com/AlessandroPomponio/go-gibberish v0.0.0-20191004143433-a2d4156f0396/go.mod h1:2VCDG9kHYQ5vfYUqeoB7foVlcvIvB7rp9LxTELLD1qU=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/hashicorp/golang-lru/v2 v2.0.7 h1:a+bsQ5rvGLjzHuww6tVxozPZFVghXaHOwFs4luLUK2k=
Expand All @@ -8,6 +6,7 @@ github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZb
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
125 changes: 125 additions & 0 deletions pkg/analysis/analysis.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
// Originally from go-gibberish
// Copyright (c) 2015 Rob Renaud
// Licensed under the MIT License. See LICENSES/MIT.txt.
//
// Modifications copyright (c) 2025 Grafana Labs
// Licensed under the Apache License, Version 2.0.

// Package analysis contains the functions needed to
// analyze lines.
package analysis

import (
"fmt"
"math"
"strings"

"github.com/grafana/clusterurl/pkg/consts"
"github.com/grafana/clusterurl/pkg/structs"
)

// AverageTransitionProbability returns the probability of
// generating the input string digraph by digraph according
// to the occurrences matrix.
func AverageTransitionProbability(line string, occurrences [][]float64, position map[rune]int) (float64, error) {

logProb := 0.0
transitionCt := 0.0

for _, pair := range GetDigraphs(line) {

firstPosition, firstRuneFound := position[pair.First]
if !firstRuneFound {
return -1, fmt.Errorf("AverageTransitionProbability: unable to find the position of the rune %s", string(pair.First))
}

secondPosition, secondRuneFound := position[pair.Second]
if !secondRuneFound {
return -1, fmt.Errorf("AverageTransitionProbability: unable to find the position of the rune %s", string(pair.First))
}

logProb += occurrences[firstPosition][secondPosition]
transitionCt++

}

if transitionCt == 0 {
transitionCt = 1
}

return math.Exp(logProb / transitionCt), nil

}

// GetDigraphs returns pairs of adjacent runes, after
// normalizing the input line.
func GetDigraphs(line string) []structs.Digraph {

runes := Normalize(line)
if len(runes) == 0 {
return []structs.Digraph{}
}

digraphs := make([]structs.Digraph, len(runes)-1)
for i := 0; i < len(runes)-1; i++ {
digraphs[i] = structs.Digraph{First: runes[i], Second: runes[i+1]}
}

return digraphs

}

// Normalize returns the subset of runes in the line
// that are in the accepted characters. This helps
// keeping the model relatively small by ignoring
// punctuation, symbols, etc.
func Normalize(line string) []rune {

line = strings.ToLower(line)
result := make([]rune, 0, len(line))

for _, r := range line {

if strings.ContainsRune(consts.AcceptedCharacters, r) {
result = append(result, r)
}

}

return result

}

// MaxForSlice returns the maximum value in a
// float64 slice.
func MaxForSlice(slice []float64) float64 {

max := -math.MaxFloat64
for _, item := range slice {

if item > max {
max = item
}

}

return max

}

// MinForSlice returns the minimum value in
// a float64 slice.
func MinForSlice(slice []float64) float64 {

min := math.MaxFloat64
for _, item := range slice {

if item < min {
min = item
}

}

return min

}
4 changes: 2 additions & 2 deletions clusterurl/clusterurl.go → pkg/clusterurl/clusterurl.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ import (
"fmt"
"os"

"github.com/AlessandroPomponio/go-gibberish/gibberish"
"github.com/AlessandroPomponio/go-gibberish/structs"
"github.com/grafana/clusterurl/pkg/gibberish"
"github.com/grafana/clusterurl/pkg/structs"
lru "github.com/hashicorp/golang-lru/v2"
)

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
16 changes: 16 additions & 0 deletions pkg/consts/accepted_characters.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// Originally from go-gibberish
// Copyright (c) 2015 Rob Renaud
// Licensed under the MIT License. See LICENSES/MIT.txt.
//
// Modifications copyright (c) 2025 Grafana Labs
// Licensed under the Apache License, Version 2.0.

// Package consts contains constants.
package consts

const (

// AcceptedCharacters is a string with all the letters
// in the English alphabet, plus a space.
AcceptedCharacters = "abcdefghijklmnopqrstuvwxyz "
)
22 changes: 22 additions & 0 deletions pkg/gibberish/gibberish.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
// Originally from go-gibberish
// Copyright (c) 2015 Rob Renaud
// Licensed under the MIT License. See LICENSES/MIT.txt.
//
// Modifications copyright (c) 2025 Grafana Labs
// Licensed under the Apache License, Version 2.0.

// Package gibberish contains methods to tell whether
// the input is gibberish or not.
package gibberish

import (
"github.com/grafana/clusterurl/pkg/analysis"
"github.com/grafana/clusterurl/pkg/structs"
)

// IsGibberish returns true if the input string is likely
// to be gibberish
func IsGibberish(input string, data *structs.GibberishData) bool {
value, err := analysis.AverageTransitionProbability(input, data.Occurrences, data.Positions)
return value <= data.Threshold && err == nil
}
Loading
Loading