使用亚马逊 Neptune 图数据库查找可疑登录

huangapple go评论45阅读模式
英文:

Finding suspicious logins using amazon neptune graph database

问题

我有一个加载在Amazon Neptune中的图形结构,如下所示:

问题描述: 我想找到对帐户进行可疑登录的情况。

如何实现: 理想情况下,一个帐户应该只能从一个设备访问。如果它被从多个设备访问,那么其中一个登录就是可疑的。但是当客户获得新设备时,这个规则就不再成立了。为了提高准确性,我们可以说,如果后续访问的设备距离之前访问的设备至少1公里,那么这就变得可疑。

我正在尝试使用Gremlin查询语言编写相同的查询,但还没有走得太远,正在寻找一个查询,以查找帐户和设备,其中帐户被从多台设备访问,而这些设备之间至少相隔100公里。

编辑:

使用以下命令创建的示例图形:

g.addV("TEST_LOCATION").property(id, "location1").property("latitude", 10.855641295031102).property("longitude", 76.62664361108531).next()
g.addV("TEST_DEVICE").property(id, "device1").next()
g.addE("TEST_ACCESSED").from(__.V('location1')).to(__.V('device1')).property(id, "accessed1").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2023658452").property("accountNumber", "2023658452").next()
g.addE("TEST_LOGGED_IN").from(__.V('device1')).to(__.V('2023658452')).property(id, "login1").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location2").property("latitude", 10.856369018807927).property("longitude", 76.62525939933491).next()
g.addV("TEST_DEVICE").property(id, "device2").next()
g.addE("TEST_ACCESSED").from(__.V('location2')).to(__.V('device2')).property(id, "accessed2").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()
g.addE("TEST_LOGGED_IN").from(__.V('device2')).to(__.V('2023658452')).property(id, "login2").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()


g.addV("TEST_LOCATION").property(id, "location3").property("latitude", 12.9578933).property("longitude", 77.7446321).next()
g.addV("TEST_DEVICE").property(id, "device3").next()
g.addE("TEST_ACCESSED").from(__.V('location3')).to(__.V('device3')).property(id, "accessed3").property("loggedInAt", datetime('2022-11-29T11:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2025487996").property("accountNumber", "2025487996").next()
g.addE("TEST_LOGGED_IN").from(__.V('device3')).to(__.V('2025487996')).property(id, "login3").property("loggedInAt", datetime('2022-11-29T11:41:02')).next()


g.addV("TEST_LOCATION").property(id, "location4").property("latitude", 37.4219983).property("longitude", -122.084).next()
g.addV("TEST_DEVICE").property(id, "device4").next()
g.addE("TEST_ACCESSED").from(__.V('location4')).to(__.V('device4')).property(id, "accessed4").property("loggedInAt", datetime('2022-11-29T12:41:02')).next()
g.addE("TEST_LOGGED_IN").from(__.V('device4')).to(__.V('2025487996')).property(id, "login4").property("loggedInAt", datetime('2022-11-29T12:41:02')).next()


g.addV("TEST_LOCATION").property(id, "location5").property("latitude", 10.85569589).property("longitude", 76.626646584).next()
g.addV("TEST_DEVICE").property(id, "device5").next()
g.addE("TEST_ACCESSED").from(__.V('location5')).to(__.V('device5')).property(id, "accessed5").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_LOCATION").property(id, "location6").property("latitude", 10.85569965).property("longitude", 76.626646745).next()
g.addE("TEST_ACCESSED").from(__.V('location6')).to(__.V('device5')).property(id, "accessed6").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2004589564").property("accountNumber", "2004589564").next()
g.addE("TEST_LOGGED_IN").from(__.V('device5')).to(__.V('2004589564')).property(id, "login5").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addE("TEST_LOGGED_IN").from(__.V('device5')).to(__.V('2004589564')).property(id, "login6").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location7").property("latitude", 10.85569951).property("longitude", 76.626646951).next()
g.addV("TEST_DEVICE").property(id, "device6").next()
g.addE("TEST_ACCESSED").from(__.V('location7')).to(__.V('device6')).property(id, "accessed7").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()
g.addE("TEST_LOGGED_IN").from(__.V('device6')).to(__.V('2004589564')).property(id, "login7").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()

期望的输出:

帐号号码:2025487996
从设备访

<details>
<summary>英文:</summary>

I have a below structure of graph loaded in amazon neptune.
[![Below is graph description][1]][1]


  [1]: https://i.stack.imgur.com/QxiQL.jpg

**In words graph can be described as:** Device is getting accessed from location and Device is logging into the account.

**Problem Statement:** I want to find suspecious logins to the account.

**How?:** Ideally one account should get accessed from only one device. If it is getting accessed from more than 1 device then one of the login is suspicious. But this becomes false when customer gets a new device. To increase accuracy we can say that if device which is accessed later is atleast 1km away from device which was accessed earlier. Then it becomes suspicious.

I am trying to write the query for the same in gremlin query language but haven&#39;t gone very far and **looking for a query:** to find account and devices where account was accessed from more than one device and both the devices are atleast 100km away from each other.


**EDIT:**

Sample graph created using below commands:

g.addV("TEST_LOCATION").property(id, "location1").property("latitude", 10.855641295031102).property("longitude", 76.62664361108531).next()
g.addV("TEST_DEVICE").property(id, "device1").next()
g.addE("TEST_ACCESSED").from(.V('location1')).to(.V('device1')).property(id, "accessed1").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2023658452").property("accountNumber", "2023658452").next()
g.addE("TEST_LOGGED_IN").from(.V('device1')).to(.V('2023658452')).property(id, "login1").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location2").property("latitude", 10.856369018807927).property("longitude", 76.62525939933491).next()
g.addV("TEST_DEVICE").property(id, "device2").next()
g.addE("TEST_ACCESSED").from(.V('location2')).to(.V('device2')).property(id, "accessed2").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()
g.addE("TEST_LOGGED_IN").from(.V('device2')).to(.V('2023658452')).property(id, "login2").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location3").property("latitude", 12.9578933).property("longitude", 77.7446321).next()
g.addV("TEST_DEVICE").property(id, "device3").next()
g.addE("TEST_ACCESSED").from(.V('location3')).to(.V('device3')).property(id, "accessed3").property("loggedInAt", datetime('2022-11-29T11:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2025487996").property("accountNumber", "2025487996").next()
g.addE("TEST_LOGGED_IN").from(.V('device3')).to(.V('2025487996')).property(id, "login3").property("loggedInAt", datetime('2022-11-29T11:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location4").property("latitude", 37.4219983).property("longitude", -122.084).next()
g.addV("TEST_DEVICE").property(id, "device4").next()
g.addE("TEST_ACCESSED").from(.V('location4')).to(.V('device4')).property(id, "accessed4").property("loggedInAt", datetime('2022-11-29T12:41:02')).next()
g.addE("TEST_LOGGED_IN").from(.V('device4')).to(.V('2025487996')).property(id, "login4").property("loggedInAt", datetime('2022-11-29T12:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location5").property("latitude", 10.85569589).property("longitude", 76.626646584).next()
g.addV("TEST_DEVICE").property(id, "device5").next()
g.addE("TEST_ACCESSED").from(.V('location5')).to(.V('device5')).property(id, "accessed5").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_LOCATION").property(id, "location6").property("latitude", 10.85569965).property("longitude", 76.626646745).next()
g.addE("TEST_ACCESSED").from(.V('location6')).to(.V('device5')).property(id, "accessed6").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addV("TEST_ACCOUNT").property(id, "2004589564").property("accountNumber", "2004589564").next()
g.addE("TEST_LOGGED_IN").from(.V('device5')).to(.V('2004589564')).property(id, "login5").property("loggedInAt", datetime('2022-11-29T09:41:02')).next()
g.addE("TEST_LOGGED_IN").from(.V('device5')).to(.V('2004589564')).property(id, "login6").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()

g.addV("TEST_LOCATION").property(id, "location7").property("latitude", 10.85569951).property("longitude", 76.626646951).next()
g.addV("TEST_DEVICE").property(id, "device6").next()
g.addE("TEST_ACCESSED").from(.V('location7')).to(.V('device6')).property(id, "accessed7").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()
g.addE("TEST_LOGGED_IN").from(.V('device6')).to(.V('2004589564')).property(id, "login7").property("loggedInAt", datetime('2022-11-29T10:41:02')).next()



Expected Output:

Account No: 2025487996
Accessed from device: device3 From location: location3
Accessed from device: device4 From location: location4
Distance between these devices is: 14040 km


**Explanation of output:** Account was accessed by multiple devices from different locations. We wanted only those devices and accounts which were accessed from locations with distance 100 km away from each other.

</details>


# 答案1
**得分**: 1

感谢添加示例图表。这使得编写答案变得更容易。

首先,查找访问帐户的设备和位置就像这样简单:

```java
gremlin> g.V('2025487996').in('TEST_LOGGED_IN').in('TEST_ACCESSED').path().by(id)

==> [2025487996, device4, location4]
==> [2025487996, device3, location3]

有了这个基础,我们可以提取坐标并将结果格式化得更漂亮一些。

gremlin> g.V('2025487996').as('ac').
......1>   in('TEST_LOGGED_IN').as('dev').
......2>   in('TEST_ACCESSED').as('loc').
......3>   project('account','device','location').
......4>     by(select('ac').id()).
......5>     by(select('dev').id()).
......6>     by(values('latitude','longitude').fold())

==> [account:2025487996, device:device4, location:[37.4219983,-122.084]]
==> [account:2025487996, device:device3, location:[12.9578933,77.7446321]]

由于Gremlin目前没有本地支持地理空间的函数,您有几个选择:

  1. 在应用程序中执行距离过滤。
  2. 使用Haversine大圆距离 Gremlin查询示例,并将点注入其中并根据结果进行过滤。
  3. 在应用程序中执行Haversine计算。
  4. 使用Geohash等存储点,然后在应用程序中使用Geohash(或等效库)执行计算。

Apache TinkerPop的路线图(日期待定)中有一项任务是改进Gremlin的地理空间支持。在那之前,最好的替代方法将是使用Haversine查询进行计算,或者在应用程序中进行计算,只使用图来查找关系。

如果决定尝试在Gremlin中执行所有操作,可以稍微更改查询的初始部分,以使其更容易与Haversine距离示例匹配。

gremlin> g.V('2025487996').as('ac').
......1>   in('TEST_LOGGED_IN').as('dev').
......2>   in('TEST_ACCESSED').as('loc').
......3>   group().
......4>     by(select('dev').id()).
......5>     by(project('lat','lon').
......6>         by('latitude').
......7>         by('longitude'))

==> [device3:[lat:12.9578933,lon:77.7446321],device4:[lat:37.4219983,lon:-122.084]]

最后,

gremlin> g.V('2025487996').as('ac').
......1>   in('TEST_LOGGED_IN').as('dev').
......2>   in('TEST_ACCESSED').as('loc').
......3>   group().
......4>     by(select('dev').id()).
......5>     by(project('lat','lon').
......6>         by('latitude').
......7>         by('longitude')).
......8>   project('src','dst').
......9>     by(limit(local,1).select(values).unfold()).
.....10>     by(tail(local,1).select(values).unfold())

==> [src:[lat:12.9578933,lon:77.7446321],dst:[lat:37.4219983,lon:-122.084]]

这创建了一个src/dst对,可以嵌入到Haversine查询中。

因此,我们现在可以剪切并粘贴Haversine查询的相关部分。

gremlin> g.withSideEffect("rdeg", 0.017453293).
......1>   withSideEffect("gcmiles", 3956).
......2>   V('2025487996').as('ac').
......3>   in('TEST_LOGGED_IN').as('dev').
......4>   in('TEST_ACCESSED').as('loc').
......5>   group().
......6>     by(select('dev').id()).
......7>     by(project('lat','lon').
......8>         by('latitude').
......9>         by('longitude')).
.....10>   project('src','dst').
.....11>     by(limit(local,1).select(values).unfold()).
.....12>     by(tail(local,1).select(values).unfold()).
.....13>   as('grp').
.....14>   project('ladiff','lgdiff','lat1','lon1','lat2','lon2').
.....15>     by(project('la1','la2').
.....16>          by(select('grp').select('src').select('lat')).
.....17>          by(select('grp').select('dst').select('lat')).
.....18>        math('(la2 - la1) * rdeg')).
.....19>     by(project('lg1','lg2').
.....20>          by(select('grp').select('src').select('lon')).
.....21>          by(select('grp').select('dst').select('lon')).
.....22>        math('(lg2 - lg1) * rdeg')).
.....23>     by(select('grp').select('src').select('lat')).
.....24>     by(select('grp').select('src').select('lon')).
.....25>     by(select('grp').select('dst').select('lat')).
.....26>     by(select('grp').select('dst').select('lon')).
.....27>   math('(sin(ladiff/2))^2 + cos(lat1*rdeg) * cos(lat2*rdeg) * (sin(lgdiff/2))^2').
.....28>   math('gcmiles * (2 * asin(sqrt(_))')

==> 8719.402153569563

8,719英里约合14,032公里。

英文:

Thanks for adding the sample graph. That makes writing an answer much easier.

To begin with, finding the devices and locations that accessed the account is as simple as:

gremlin&gt; g.V(&#39;2025487996&#39;).in(&#39;TEST_LOGGED_IN&#39;).in(&#39;TEST_ACCESSED&#39;).path().by(id)

==&gt;[2025487996,device4,location4]
==&gt;[2025487996,device3,location3]    

With that foundation we can then extract the coordinates and format the results a bit more nicely.

gremlin&gt; g.V(&#39;2025487996&#39;).as(&#39;ac&#39;).
......1&gt;   in(&#39;TEST_LOGGED_IN&#39;).as(&#39;dev&#39;).
......2&gt;   in(&#39;TEST_ACCESSED&#39;).as(&#39;loc&#39;).
......3&gt;   project(&#39;account&#39;,&#39;device&#39;,&#39;location&#39;).
......4&gt;     by(select(&#39;ac&#39;).id()).
......5&gt;     by(select(&#39;dev&#39;).id()).
......6&gt;     by(values(&#39;latitude&#39;,&#39;longitude&#39;).fold()) 

==&gt;[account:2025487996,device:device4,location:[37.4219983,-122.084]]
==&gt;[account:2025487996,device:device3,location:[12.9578933,77.7446321]]  

As there are currently no GeoSpatial functions native to Gremlin, you have a few choices here.

  1. Do the distance filtering in the application.
  2. Use the Haversine Great Circle Distance Gremlin query example and inject the points into that and filter based on the results.
  3. Do the Haversine calculations in the application.
  4. Store the points using something like a Geohash and do the calculations in the application using a GeoHash (or equivalent) library.

On the roadmap for Apache TinkerPop (date TBD) there is an item to add better GeoSpatial support to Gremlin. Until then, the best alternatives are going to be to do the calculation using the Harversine query, or do them in the application and just use the graph to find the relationships.

If you decide to try doing it all in Gremlin, changing the initial part of the query a little will make it fit more easily with the Haversine distance example.

gremlin&gt; g.V(&#39;2025487996&#39;).as(&#39;ac&#39;).
......1&gt;   in(&#39;TEST_LOGGED_IN&#39;).as(&#39;dev&#39;).
......2&gt;   in(&#39;TEST_ACCESSED&#39;).as(&#39;loc&#39;).
......3&gt;   group().
......4&gt;     by(select(&#39;dev&#39;).id()).
......5&gt;     by(project(&#39;lat&#39;,&#39;lon&#39;).
......6&gt;         by(&#39;latitude&#39;).
......7&gt;         by(&#39;longitude&#39;)) 

==&gt;[device3:[lat:12.9578933,lon:77.7446321],device4:[lat:37.4219983,lon:-122.084]] 

Lastly

gremlin&gt; g.V(&#39;2025487996&#39;).as(&#39;ac&#39;).
......1&gt;   in(&#39;TEST_LOGGED_IN&#39;).as(&#39;dev&#39;).
......2&gt;   in(&#39;TEST_ACCESSED&#39;).as(&#39;loc&#39;).
......3&gt;   group().
......4&gt;     by(select(&#39;dev&#39;).id()).
......5&gt;     by(project(&#39;lat&#39;,&#39;lon&#39;).
......6&gt;         by(&#39;latitude&#39;).
......7&gt;         by(&#39;longitude&#39;)).
......8&gt;   project(&#39;src&#39;,&#39;dst&#39;).
......9&gt;     by(limit(local,1).select(values).unfold()).
.....10&gt;     by(tail(local,1).select(values).unfold()) 

==&gt;[src:[lat:12.9578933,lon:77.7446321],dst:[lat:37.4219983,lon:-122.084]]    

which creates a src/dst pair that can be plugged into the Harversine query.

So......we can now cut and paste in the relevant parts of the Harversine query.

gremlin&gt; g.withSideEffect(&quot;rdeg&quot;, 0.017453293).
......1&gt;   withSideEffect(&quot;gcmiles&quot;,3956). 
......2&gt;   V(&#39;2025487996&#39;).as(&#39;ac&#39;).
......3&gt;   in(&#39;TEST_LOGGED_IN&#39;).as(&#39;dev&#39;).
......4&gt;   in(&#39;TEST_ACCESSED&#39;).as(&#39;loc&#39;).
......5&gt;   group().
......6&gt;     by(select(&#39;dev&#39;).id()).
......7&gt;     by(project(&#39;lat&#39;,&#39;lon&#39;).
......8&gt;         by(&#39;latitude&#39;).
......9&gt;         by(&#39;longitude&#39;)).
.....10&gt;   project(&#39;src&#39;,&#39;dst&#39;).
.....11&gt;     by(limit(local,1).select(values).unfold()).
.....12&gt;     by(tail(local,1).select(values).unfold()).
.....13&gt;   as(&#39;grp&#39;).
.....14&gt;   project(&#39;ladiff&#39;,&#39;lgdiff&#39;,&#39;lat1&#39;,&#39;lon1&#39;,&#39;lat2&#39;,&#39;lon2&#39;).
.....15&gt;     by(project(&#39;la1&#39;,&#39;la2&#39;).
.....16&gt;          by(select(&#39;grp&#39;).select(&#39;src&#39;).select(&#39;lat&#39;)).
.....17&gt;          by(select(&#39;grp&#39;).select(&#39;dst&#39;).select(&#39;lat&#39;)).
.....18&gt;        math(&#39;(la2 - la1) * rdeg&#39;)).
.....19&gt;     by(project(&#39;lg1&#39;,&#39;lg2&#39;).
.....20&gt;          by(select(&#39;grp&#39;).select(&#39;src&#39;).select(&#39;lon&#39;)).
.....21&gt;          by(select(&#39;grp&#39;).select(&#39;dst&#39;).select(&#39;lon&#39;)).
.....22&gt;        math(&#39;(lg2 - lg1) * rdeg&#39;)).
.....23&gt;     by(select(&#39;grp&#39;).select(&#39;src&#39;).select(&#39;lat&#39;)).
.....24&gt;     by(select(&#39;grp&#39;).select(&#39;src&#39;).select(&#39;lon&#39;)).
.....25&gt;     by(select(&#39;grp&#39;).select(&#39;dst&#39;).select(&#39;lat&#39;)).
.....26&gt;     by(select(&#39;grp&#39;).select(&#39;dst&#39;).select(&#39;lon&#39;)).
.....27&gt;   math(&#39;(sin(ladiff/2))^2 + cos(lat1*rdeg) * cos(lat2*rdeg) * (sin(lgdiff/2))^2&#39;).
.....28&gt;   math(&#39;gcmiles * (2 * asin(sqrt(_)))&#39;)  

==&gt;8719.402153569563

8,719 miles is approximately 14,032 kilometers.

huangapple
  • 本文由 发表于 2023年2月8日 14:32:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382112.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定