Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Sync Maven data to graph #2048

Closed
1 task
msrb opened this issue Jan 25, 2018 · 15 comments
Closed
1 task

Sync Maven data to graph #2048

msrb opened this issue Jan 25, 2018 · 15 comments

Comments

@msrb
Copy link
Collaborator

msrb commented Jan 25, 2018

Description

There were times when data ingestion pipeline was broken or certain parts of the pipeline were disabled. During such times, we analyzed plenty of packages and stored results in S3, but never ingested the data to graph database. With #1085 implemented, we want to sync all missing data from S3 to graph.

Acceptance criteria

  • all Maven data from production S3 buckets are available in graph
@abs51295
Copy link

Fixed a typo @msrb :)

@msrb
Copy link
Collaborator Author

msrb commented Jan 29, 2018

@sivaavkd description updated, is it better now?

@tuxdna
Copy link
Collaborator

tuxdna commented Jan 30, 2018

Carrying forward from - #1085 (comment)

Some intermittent failures encountered in Graph layer. This seems to be happening very sparsely:


g.V().has('ecosystem','maven').has('name','org.apache.tomcat:tomcat-servlet-api').properties('tokens','libio_usedby').drop().iterate();pkg = g.V().has('ecosystem','maven').has('name', 'org.apache.tomcat:tomcat-servlet-api').tryNext().orElseGet{graph.addVertex('ecosystem', 'maven', 'name', 'org.apache.tomcat:tomcat-servlet-api', 'vertex_label', 'Package')};pkg.property('last_updated', 1517286212.43);pkg.property('tokens', 'org'); pkg.property('tokens', 'apache'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'servlet'); pkg.property('tokens', 'api');pkg.property('latest_version', '9.0.0.M17');pkg.property('libio_latest_release', '1500854400.0');pkg.property('libio_usedby', 'keycloak/keycloak:1359');pkg.property('libio_usedby', 'SungardAS/enhanced-snapshots:35');pkg.property('libio_usedby', 'cf-unik/unik:1239');pkg.property('libio_usedby', 'entando/entando-components:15');pkg.property('libio_usedby', 'indeedeng/proctor:198');pkg.property('libio_usedby', 'magro/memcached-session-manager:552');pkg.property('libio_usedby', 'nysenate/OpenLegislation:163');pkg.property('libio_usedby', 'Red5/red5-server:1254');pkg.property('libio_usedby', 'aspose-words/Aspose.Words-for-Java:52');pkg.property('libio_usedby', 'google/identity-toolkit-java-client:32');pkg.property('libio_dependents_projects', '65');pkg.property('libio_dependents_repos', '2.05K');pkg.property('libio_total_releases', '124');pkg.property('libio_latest_version', '8.5.19');g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.19').property('gh_release_date', 1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M25').property('gh_release_date',1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M19').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.
5.16').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14').property('gh_release_date',1492041600.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.15').property('gh_release_date',1493942400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M21').property('gh_release_date',1493856000.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M20').property('gh_release_date',1491955200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M22').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.13').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M1').properties('licenses','cve_ids','declared_licenses').drop().iterate();ver = g.V().has('pecosystem', 'maven').has('pname', 'org.apache.tomcat:tomcat-servlet-api').has('version', '9.0.0.M1').tryNext().orElseGet{graph.addVertex('pecosystem','maven', 'pname','org.apache.tomcat:tomcat-servlet-api', 'version', '9.0.0.M1', 'vertex_label', 'Version')};ver.property('last_updated',1517286212.43);ver.property('description','javax.servlet package');ver.property('cm_num_files',112);ver.property('cm_avg_cyclomatic_complexity', 1.23);ver.property('cm_loc',42622);ver.property('licenses', 'ASL 2.0'); ver.property('licenses', 'CDDL');ver.property('cve_ids', 'CVE-2017-6056:5.0'); ver.property('cve_ids', 'CVE-2016-8735:7.5'); ver.property('cve_ids', 'CVE-2016-6816:6.8'); ver.property('cve_ids', 'CVE-2016-6325:7.2'); ver.property('cve_ids', 'CVE-2016-5425:7.2'); ver.property('cve_ids', 'CVE-2016-3092:7.8'); ver.property('cve_ids', 'CVE-2016-0763:6.5'); ver.property('cve_ids', 'CVE-2016-0714:6.
5'); ver.property('cve_ids', 'CVE-2016-0706:4.0'); ver.property('cve_ids', 'CVE-2015-5351:6.8'); ver.property('cve_ids', 'CVE-2015-5346:6.8'); ver.property('cve_ids', 'CVE-2015-5345:5.0');ver.property('declared_licenses', 'Apache License'); ver.property('declared_licenses', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0');lic = g.V().has('lname', 'Apache License').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', 'Apache License', 'last_updated',1517286212.43)}; g.V(ver).out('has_declared_license').has('lname', 'Apache License').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};lic = g.V().has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0', 'last_updated',1517286212.43)}; g.V(ver).out('has_declared_license').has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};edge_c = g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M1').in('has_version').tryNext().orElseGet{pkg.addEdge('has_version', ver)};
ERROR:data_importer:The import failed: 'status'
ERROR:data_importer:Traceback for latest failure in import call: Traceback (most recent call last):
  File "/src/data_importer.py", line 101, in _import_keys_from_s3_http
    if resp['status']['code'] == 200:
KeyError: 'status'

g.V().has('ecosystem','maven').has('name','org.apache.tomcat:tomcat-servlet-api').properties('tokens','libio_usedby').drop().iterate();pkg = g.V().has('ecosystem','maven').has('name', 'org.apache.tomcat:tomcat-servlet-api').tryNext().orElseGet{graph.addVertex('ecosystem', 'maven', 'name', 'org.apache.tomcat:tomcat-servlet-api', 'vertex_label', 'Package')};pkg.property('last_updated', 1517286213.63);pkg.property('tokens', 'org'); pkg.property('tokens', 'apache'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'servlet'); pkg.property('tokens', 'api');pkg.property('latest_version', '9.0.0.M17');pkg.property('libio_latest_release', '1500854400.0');pkg.property('libio_usedby', 'keycloak/keycloak:1359');pkg.property('libio_usedby', 'SungardAS/enhanced-snapshots:35');pkg.property('libio_usedby', 'cf-unik/unik:1239');pkg.property('libio_usedby', 'entando/entando-components:15');pkg.property('libio_usedby', 'indeedeng/proctor:198');pkg.property('libio_usedby', 'magro/memcached-session-manager:552');pkg.property('libio_usedby', 'nysenate/OpenLegislation:163');pkg.property('libio_usedby', 'Red5/red5-server:1254');pkg.property('libio_usedby', 'aspose-words/Aspose.Words-for-Java:52');pkg.property('libio_usedby', 'google/identity-toolkit-java-client:32');pkg.property('libio_dependents_projects', '65');pkg.property('libio_dependents_repos', '2.05K');pkg.property('libio_total_releases', '124');pkg.property('libio_latest_version', '8.5.19');g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.19').property('gh_release_date', 1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M25').property('gh_release_date',1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M19').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.
5.16').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14').property('gh_release_date',1492041600.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.15').property('gh_release_date',1493942400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M21').property('gh_release_date',1493856000.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M20').property('gh_release_date',1491955200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M22').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.13').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M11').properties('licenses','cve_ids','declared_licenses').drop().iterate();ver = g.V().has('pecosystem', 'maven').has('pname', 'org.apache.tomcat:tomcat-servlet-api').has('version', '9.0.0.M11').tryNext().orElseGet{graph.addVertex('pecosystem','maven', 'pname','org.apache.tomcat:tomcat-servlet-api', 'version', '9.0.0.M11', 'vertex_label', 'Version')};ver.property('last_updated',1517286213.63);ver.property('description','javax.servlet package');ver.property('cm_num_files',114);ver.property('cm_avg_cyclomatic_complexity', 1.23);ver.property('cm_loc',42800);ver.property('licenses', 'ASL 2.0'); ver.property('licenses', 'CDDL');ver.property('cve_ids', 'CVE-2017-6056:5.0'); ver.property('cve_ids', 'CVE-2016-8747:5.0'); ver.property('cve_ids', 'CVE-2016-8735:7.5'); ver.property('cve_ids', 'CVE-2016-6816:6.8'); ver.property('cve_ids', 'CVE-2016-6325:7.2'); ver.property('cve_ids', 'CVE-2016-5425:7.2');ver.property('declared_licenses', 'Apache License'); ver.property('declared_licenses'
, ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0');lic = g.V().has('lname', 'Apache License').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', 'Apache License', 'last_updated',1517286213.63)}; g.V(ver).out('has_declared_license').has('lname', 'Apache License').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};lic = g.V().has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0', 'last_updated',1517286213.63)}; g.V(ver).out('has_declared_license').has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};edge_c = g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M11').in('has_version').tryNext().orElseGet{pkg.addEdge('has_version', ver)};
ERROR:data_importer:The import failed: HTTPConnectionPool(host='172.30.80.86', port=8182): Read timed out. (read timeout=30)
ERROR:data_importer:Traceback for latest failure in import call: Traceback (most recent call last):
  File "/src/data_importer.py", line 98, in _import_keys_from_s3_http
    data=json.dumps(payload), timeout=30)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
ReadTimeout: HTTPConnectionPool(host='172.30.80.86', port=8182): Read timed out. (read timeout=30)

@tuxdna
Copy link
Collaborator

tuxdna commented Jan 30, 2018

Figured out the cause of above error from gremlin server logs:

39989555 [gremlin-server-worker-1] WARN  org.apache.tinkerpop.gremlin.server.handler.HttpGremlinEndpointHandler  - Invalid request - responding with 500 Internal Server Error and startup failed:
Script3023.groovy: 1: expecting ''', found '\n' @ line 1, column 4116.
   d_licenses', ' Version 2.0 and
                                 ^

1 error

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script3023.groovy: 1: expecting ''', found '\n' @ line 1, column 4116.
   d_licenses', ' Version 2.0 and
                                 ^

1 error

	at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:310)
	at org.codehaus.groovy.control.ErrorCollector.addFatalError(ErrorCollector.java:150)
	at org.codehaus.groovy.control.ErrorCollector.addError(ErrorCollector.java:120)
	at org.codehaus.groovy.control.ErrorCollector.addError(ErrorCollector.java:132)
	at org.codehaus.groovy.control.SourceUnit.addError(SourceUnit.java:360)
	at org.codehaus.groovy.antlr.AntlrParserPlugin.transformCSTIntoAST(AntlrParserPlugin.java:140)
	at org.codehaus.groovy.antlr.AntlrParserPlugin.parseCST(AntlrParserPlugin.java:111)
	at org.codehaus.groovy.control.SourceUnit.parse(SourceUnit.java:237)
	at org.codehaus.groovy.control.CompilationUnit$1.call(CompilationUnit.java:167)
	at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:931)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:593)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:569)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:546)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:298)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:268)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:254)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:211)
	at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.getScriptClass(GremlinGroovyScriptEngine.java:527)
	at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:446)
	at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
	at org.apache.tinkerpop.gremlin.groovy.engine.ScriptEngines.eval(ScriptEngines.java:119)
	at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$2(GremlinExecutor.java:287)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
	

Essentially the queries are not formed should factor in special characters like newlines correctly in existing code here:

https://github.com/fabric8-analytics/fabric8-analytics-data-model/blame/51ae4e1ecbd03e352a1a8e63a003bf2d7264a1a1/src/graph_populator.py#L112-L124

                drop_props.append('declared_licenses')


                prp_version += " ".join(["ver.property('declared_licenses', '{}');".format
                                         (dl) for dl in declared_licenses])
                # Create License Node and edge from EPV
                for lic in declared_licenses:
                    prp_version += "lic = g.V().has('lname', '{lic}').tryNext().orElseGet{{" \
                                   "graph.addVertex('vertex_label', 'License', 'lname', '{lic}', " \
                                   "'last_updated',{last_updated})}}; g.V(ver).out(" \
                                   "'has_declared_license').has('lname', '{lic}').tryNext()." \
                                   "orElseGet{{ver.addEdge('has_declared_license', lic)}};".format(
                                       lic=lic, last_updated=str(time.time())
                                   )

Happens for this package

has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14')

@tuxdna
Copy link
Collaborator

tuxdna commented Jan 30, 2018

metadata.json for the above EPV contains

"declared_license": "Apache License, Version 2.0 and\n        Common Development And Distribution License (CDDL) Version 1.0",

Notice the newline.

@tuxdna
Copy link
Collaborator

tuxdna commented Feb 2, 2018

@tuxdna tuxdna self-assigned this Feb 5, 2018
@tuxdna
Copy link
Collaborator

tuxdna commented Feb 12, 2018

Last time I checked the progress of Maven graph sync, it went till package named org.ops4j.pax.exam:pax-exam-spi which is ranked 104495 out of total 131460 Maven packages ( considering lexicographic order by name ). This is about 79% of all Maven packages which were synced.

@msrb
Copy link
Collaborator Author

msrb commented Feb 12, 2018

@miteshvp can we query graph for exact numbers of Maven packages/components please?

@tuxdna
Copy link
Collaborator

tuxdna commented Feb 13, 2018

In the first pass, less than 11402 packages remain from Maven graph sync. This is

100 * (1 - 11402 / 131460) = 91.32 %

There could be some packages skipped due to Gateway Timeouts. We can sync those once first pass is complete.

@tuxdna
Copy link
Collaborator

tuxdna commented Feb 14, 2018

First pass is complete. Many of the Maven packages failed to sync in graph due to multiple reasons:

  • service unavailable
  • gateway timeout
  • bugs in data model importer

I have scheduled the sync of those (pending) packages again now.

@msrb
Copy link
Collaborator Author

msrb commented Feb 14, 2018

Thanks Saleem 👍

I've created #2256 for improving test coverage in data-importer.

@msrb msrb modified the milestones: Sprint 144, Sprint 145 Feb 14, 2018
@miteshvp
Copy link

Late reply but here it is - there are total 137389 Maven packages in our graph

@tuxdna
Copy link
Collaborator

tuxdna commented Feb 21, 2018

@miteshvp How did you figure this out ?

@msrb
Copy link
Collaborator Author

msrb commented Feb 22, 2018

Maven is in graph, closing. Thanks @tuxdna 😉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants