There are several evidences
2020-11-13 12:37:47

series
handwriting
algorithms
based
python

I'm currently doing some work related to anomaly detection , And participated in pyod Inside CBLOF Algorithm development . therefore , I'm concerned about CBLOF And pyod The implementation is very familiar .

For the handwritten algorithm series of articles . Interested can study the code . Time tight can also skip code not to see . however , Even if you don't want to see the code , I also suggest that you read my article . After all , I implemented the code , I understand these algorithms first , I just want to write articles . And in the process of implementation , As in this article , Some intermediate steps can be plotted and printed out .

Anomaly detection is generally an unsupervised learning method . He didn't have a label , Using a variety of algorithms to distinguish normal values from outliers .

Of course , If there's a label , We can use the classification algorithm to solve the anomaly detection . Although it's also anomaly detection , But such an algorithm is still a classification algorithm . Only unsupervised anomaly detection , It is generally considered as an anomaly detection algorithm .

Anomaly detection algorithm , From the traditional statistical method , To machine learning , Then go to deep learning , have everything that one expects to find .

A lot of anomaly detection algorithms , They are all based on existing machine learning algorithms . such as KNN Anomaly detection algorithm of , Is to see a point from near to far K The distance of points , Then find the few points with the largest distance as the anomaly .

What we're going to learn today CBLOF It is also an anomaly detection algorithm based on other machine learning algorithms . When it comes to the basis of , Namely CBLOF In the name B~Based. And he's based on other clustering algorithms , So he is Cluster-Based.LOF The three letters are Local Outlier Factor, Local anomaly factors . Close CBLOF Namely Cluster-based Local Outlier Factor, Local anomaly factors based on clustering .

One of his basic understandings is ： Data can be aggregated in many different places , Forming clusters . The closer a point is to a large cluster , The more likely he is to be normal , On the contrary, the lower . that , All we have to do is CBLOF Inside , Set up a clustering algorithm , You can get these clusters . commonly , We will use k-means Algorithm , The distance is directly Euclidean distance . therefore , We have the four clusters in the picture above . here , We can clearly see that ,C2 and C4 It's two big clusters , The rest C1 and C3 It's two small clusters . that , distance C2 perhaps C4 Center of （ namely K-Means The center inside , Also called Centroids） Closer , The more normal , On the contrary, the more abnormal .

here , Let's take a practical example , To introduce the algorithm in detail .

We use sales and profit data here . You can find... Here ：

https://pyod.readthedocs.io/en/latest/_modules/pyod/models/cblof.html

We can see , Most of the points are concentrated in a small area in the middle left . Very few points are distributed in other areas .

We said to the , The first step of this algorithm is clustering , So we're going to cluster them .

Before clustering , Let's normalize the data first . This is a very necessary step . Especially in the case of different variable ranges .

（ And the limit of the length , Most of the code is not pasted , The code address is given at the end of the paper .）

then , We just use k-means clustering .

```
km = KMeans(n_clusters=8)
km.fit(X)
```

After clustering , We get the following figure . Each cluster is distinguished by color

I'm going to ask you a question , Which cluster is the largest ？

I hope you're right , The smallest looking blue cluster is the largest . The smallest cluster is the purple one , Although he is very large , But he has only one point . therefore , Our size here , How many cluster members are used （size） To measure .

So the other clusters , Which is a big cluster , Which is a small cluster ？

Let's first define a $∣C∣$, Represents a cluster $C$ Size （size）. We put these clusters in order of size , Then there are ：

$∣C_{1}∣≥∣C_{2}∣≥∣C_{3}∣≥...≥∣C_{k}∣$

This is a , We choose big and small clusters , There are two principles , Let's start with the first one , It's the absolute majority principle . in other words , We start with the largest cluster , Take their... One by one size Add up , their size To achieve the total size The absolute majority of . And the absolute majority , It is a parameter that can be set $α$, His range is 0.5 To 1, Usually take 0.9.

The second principle is sudden descent , A sudden drop is a multiple ($β$) To measure . The default value is 5, It's a sudden drop 5 times .

**pyod** This is how it works ： It is best to find a partition that satisfies both conditions . Suboptimal is to find a partition that satisfies the principle of absolute majority . The worst thing is that we only find the partition that satisfies the principle of sudden drop .

Here is my simple implementation .

```
large_clusters=[]
small_clusters=[]
found_b= False
count=0
clusters = df_cluster_sizes['cluster'].values
n_clusters = len(clusters)
sizes = df_cluster_sizes['size'].values
for i in range(n_clusters):
print(f"-----------iterration {i}--------------")
satisfy_alpha=False
satisfy_beta=False
if found_b:
small_clusters.append(clusters[i])
continue
count+=sizes[i]
print(count)
if count>n_points_in_large_clusters:
satisfy_alpha=True
print(sizes[i]/sizes[i+1])
if i<n_clusters-1 and sizes[i]/sizes[i+1]>beta:
print("beta")
satisfy_beta=True
print(satisfy_alpha, satisfy_beta)
if satisfy_alpha and satisfy_beta:
found_b=True
large_clusters.append(clusters[i])
```

We show large clusters and small clusters in color , So it's more like normal in the big cluster , Small clusters are more like outliers .

Here we have to calculate CBLOF factor , That is, the distance from a point to the nearest large cluster .

If this point is a point in a large cluster , Then directly calculate the distance from him to the center of the cluster . If this point is not a large cluster , Then we have to calculate the distance to all large clusters separately , Take the smallest one as the factor .

```
def get_distance(a,b):
return np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)
def decision_function(X, labels):
n=len(labels)
distances=[]
for i in range(n):
p=X[i]
label = labels[i]
if label in large_clusters:
center = km.cluster_centers_[label]
d=get_distance(p, center)
else:
d=None
for center in large_cluster_centers:
d_temp = get_distance(p, center)
if d is None:
d=d_temp
elif d_temp<d:
d=d_temp
distances.append(d)
distances=np.array(distances)
return distances
distances = decision_function(X, km.labels_)
```

here , We need to introduce a pollution *contamination* The concept of . It is the main parameter of anomaly detection algorithm . In fact, he means the abnormal proportion . Generally, this value is 1%, We use the default value directly here .

We have it. contamination 1%, You can use it python The percentile function in it , Find the biggest distance 1% The point of , Take them as exceptions .

```
threshold=np.percentile(distances, 99)
print(f"threshold is {threshold}")
anomaly_labels = (distances>threshold)*1
```

here , We get the outliers . I print out my results , And on and on pyod Inside CBLOF Compare the printed pictures , The two are the same .

( The code above is copied from Susan Li The article )

https://github.com/EricWebsmith/machine_learning_from_scrach

If you like this series , You can collect my github project . All the algorithms , First in github Above realization , When I'm free , Will be in csdn There's an article on it . At present, there are a lot of algorithms implemented, there is no time to write articles in the future . Include XGB, CatBoost, Zhou's Isolation Forest etc. . among Isolation Forest Algorithm , As in this article , I spent a lot of work on the map .

He, Zengyou, Xiaofei Xu, and Shengchun Deng. “Discovering cluster-based local outliers.” Pattern Recognition Letters 24.9-10 (2003): 1641-1650.

https://github.com/yzhao062/pyod

towardsdatascience.com Website Susan Li The article Anomaly Detection for Dummies

- 利用Python爬虫获取招聘网站职位信息
- Using Python crawler to obtain job information of recruitment website
- Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
- Python装饰器
- Python实现LDAP认证
- Python decorator
- Implementing LDAP authentication with Python
- Vscode configures Python development environment!
- In Python, how dare you say you can't log module? ️
- 我收藏的有关Python的电子书和资料
- python 中 lambda的一些tips
- python中字典的一些tips
- python 用生成器生成斐波那契数列
- python脚本转pyc踩了个坑。。。
- My collection of e-books and materials about Python
- Some tips of lambda in Python
- Some tips of dictionary in Python
- Using Python generator to generate Fibonacci sequence
- The conversion of Python script to PyC stepped on a pit...
- Python游戏开发，pygame模块，Python实现扫雷小游戏
- Python game development, pyGame module, python implementation of minesweeping games
- Python实用工具，email模块，Python实现邮件远程控制自己电脑
- Python utility, email module, python realizes mail remote control of its own computer
- 毫无头绪的自学Python，你可能连门槛都摸不到！【最佳学习路线】
- Python读取二进制文件代码方法解析
- Python字典的实现原理
- Without a clue, you may not even touch the threshold【 Best learning route]
- Parsing method of Python reading binary file code
- Implementation principle of Python dictionary
- You must know the function of pandas to parse JSON data - JSON_ normalize()
- Python实用案例，私人定制，Python自动化生成爱豆专属2021日历
- Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
- 《Python实例》震惊了，用Python这么简单实现了聊天系统的脏话，广告检测
- "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
- Convolutional neural network processing sequence for Python deep learning
- Python data structure and algorithm (1) -- enum type enum
- 超全大厂算法岗百问百答（推荐系统/机器学习/深度学习/C++/Spark/python）
- 【Python进阶】你真的明白NumPy中的ndarray吗？
- All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
- [advanced Python] do you really understand ndarray in numpy?
- 【Python进阶】Python进阶专栏栏主自述：不忘初心，砥砺前行
- [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
- python垃圾回收和缓存管理
- java调用Python程序
- java调用Python程序
- Python常用函数有哪些?Python基础入门课程
- Python garbage collection and cache management
- Java calling Python program
- Java calling Python program
- What functions are commonly used in Python? Introduction to Python Basics
- Python basic knowledge
- Anaconda5.2 安装 Python 库（MySQLdb）的方法
- Python实现对脑电数据情绪分析
- Anaconda 5.2 method of installing Python Library (mysqldb)
- Python implements emotion analysis of EEG data
- Master some advanced usage of Python in 30 seconds, which makes others envy it
- python爬取百度图片并对图片做一系列处理
- Python crawls Baidu pictures and does a series of processing on them
- python链接mysql数据库
- Python link MySQL database