2023-03-30 Infrastructure Working Group Agenda and Meeting Notes

2023-03-30 Infrastructure Working Group Agenda and Meeting Notes

March 30, 2023

Mar 30, 2023

When:
Tuesday, October 4, 2022
9am to 10am PDT/12:00PM-1PM EST
ZOOM Please Register in advance for this meeting:
https://zoom.us/meeting/register/tJwpcO2spjkpE9a1HXBeyBxz7TM_Dvo8Ne8j

 

Attendees:

  • Sean Bohan (openIDL)

  • Nathan Southern (openIDL)

  • Peter Antley (AAIS)

  • Ken Sayers (AAIS)

  • Surya Lanka (Chainyard)

  • Yanko Zhelyazkov (Senofi)

  • Allen Thompson (Hanover)

  • Adnan Choudhury (Chainyard)

  • Aashish Shrestha (Chainyard)

  • Jeff Braswell (openIDL)

  • Faheem Zakaria (Hanover)

 

 

Agenda Items:

  • Antitrust

  • ND POC: Learnings, What Worked, What Didn't, What Was Changed

  • Next topics

  • AOB

Minutes:

  • openIDL ND POC Changes

    • 3 - Infrastructure, Fabric, Application POV

  • Infrastructure

    • added code to deploy the lambda function for report processing and upload UI

    • S3 bucket for cloud distro and AWS cert manager

    • upgraded Kub to 1.23 (1.22 was getting deprecated)

    • test environ updated

    • added cert manager for ssl certs for all endpoints

    • updated self-managed node groups to EKS managed node groups

      • self managed was using launch configs, getting deprecated in Dec - had to update them to EKS manager node group (template not configs)

    • Added jenkins pipeline to deploy upload UI app config

    • Upload UI was part of ND POC

  • KS

  • some things specific to ND, may take it out of base code and look at it as reference

  • Surya

    • Fabric

    • in HLF deployment, upgraded the images from 2.2.3 to 2.2.9

    • have the bug fixes as part of 2.2.9

    • fixed the chaincode to run after start

    • whenever there is a peer restart, need to upgrade chaincode to make it work

    • there is a restart of the peer, comes back up, should / will be able to create chaincode w/o issues 

    • fixed

    • not related to ND POC, but in general

    • Application

    • ND POC - need to deploy incidence manager to analytics node

    • with changes done, deployed data manager to Analytics node

    • new params in application component configs

    • automated gen of app component files through ansible templates

    • ansible app config files and load into one based on node type

    • integrated through jenkins pipeline

    • during node setup or new changes, can just trigger pipeline to update add app config files to the cluster

  • Why upgrade was performed? Specific bugs? Or just in case and what problems occured

  • Surya

    • w/ respect to bugs, necessary to keep code up to date, using 2.2.3 v of Fabric after which there is .4-.10

    • keep code up to date w/ respect to 2.2 vers

  • JB

    • was latest vers of fabric related to smaller max PDC size?

  • Surya - already there due to couch.db

    • 2.5 has support, 

  • Adnan

    • re: DC size, Fab has default value, going to static, not configurable

    • do not want to go over max

    • larger dataset in PDC reduces performance due to larger transactions

  • Aashish

    • couchdb 4, planning to make set limit (8mb)

    • future versions would encounter problems

  • KenS - concern about latest ver?

  • Yanko - target specific fixes, cautious

    • new vers may intro some other issues

    • wondering what specific bugs

  • KS - we dont want to get 2/3 vers behind, takes imporrible lift, stay within certain # of fabric

  • YZ - minor fixes and features

    • major vers require regression testing, etc

  • AS - did not update minor vers, just latest patch vers

  • KS 2.2.3 - 2.2.9

  • AS - next would be 2.4, feature wanted was to deleted PDC by triggering functions is in 2.5

  • KS - carriers dont want data in PDC

  • results of extractions, once in report, should be in PDC

  • KS - didn't test for some sizing

    • some unneccesaary logging

    • of things to do inside abd required beefing up machines

    • ND is an outlier - passing 100k rows, diff than stat reporting

    • more than running out of memory issues, probs with timeouts and restartting pods, etc.

  • Aashish

    • haven't touched performance tests

    • code optimization around loops too

  • Adnan

    • most taken care of by resource restructuring

    • specifically when running EP in Mongo, resources not enough

    • after beefing up resources, some loops and logs needed to be reprogrammed

    • longer timeouts, etc.

    • 1MM rows of data, give proper time, etc. 

  • KS

    • had to batch stuff in PDC

  • AC

    • going over config value for PDCs

    • chopped up w/ config value and saved results in PDC 1 by 1

    • taken for further processing, next set of process

  • YZ

    • perf problems due to chaincode and how it processed data

  • AC

    • unique nature of data itself, had to recalibrate a few things

    • comparison of data needed to be efficient

  • YZ

    • report processor - report being created

    • what are the problems on the network side - fix or extra effort to make it work

  • AC

    • saw timeouts

    • tranax timeouts

    • needed where we were trying to inc the resources

    • made sure the node with peer was not overloaded

  • SL

    • doc memory issue

    • processing larger data in PDC, couchDB issues with peers (size issue)

    • some app components getting killed due to "out of memory"

    • mult cases where we see the processing time is taking longer and getting killed due to OOM

  • YZ

    • analysis on why running out of memory?

  • AC - found issue

    • openIDL has 1 status DB, saving status of datacall

    • for some reason fails, scheduler comes in and tries to finish the job, comes back every 30 or 45 min

    • in test enviornment

    • some data calls did not - saw test environment was doing transax even though not doing data calls

  • YZ - issue was application side, not fabric

  • JB

    • Resource sizing for HDS in AWS is distinct from resource requirements for Fabric nodes, is it not ?

    • If HDS AWS resource for carrier is different than the Fabric node resource, the Fabric node resource would not need to be so large ?

    • diff resource sizing reqs were for app requirements, not the node itself

  • AS - typical data call, size of data less than ND?

  • KS - EP would be hundreds of records, small results at extraction

  • PA - we will be somewhere between, comes back as JSON, formatting data layer, string of JSONs

    • under 1000 element JSON

  • AS - will cut down a lot of processing

  • KS - stat reports are similar, will have other situations

    • MS POC might have a similr result set

  • PA - much bigger, not drivers, whole state

  • AS  # of carriers? per state?

  • PA - 100+ sep entities reco by NAIC

    • 200 by AAIS

    • less than 3k total

  • KS - way expecting to see it unfold, loading on behalf of carriers into multitenant node

  • PA - load testing to see how much we can fit

    • lots of carriers in a single node

    • 200+ carriers

    • a lot of small mutuals, crrier node 4 has 100 carriers in same table

    • data call, agg info, wont have all the primary keys

  • KS - ind nodes for less than 20 carriers for a while

  • AS - multi-tenant node size - MS ?

  • KS stat reporting will stretch multitenant

  • PA - didn't have any of these carriers working with us

  • KS - putting data in was a huge win

  • KS - cool thing

    • adding HDS to analutics node made it possible

    • wasnt considered in the original design

    • allowed DOT to load data that could be used by reporting processor

    • ability to quicly pivot and create diff reports was a big win as well

    • data quality issues 

    • bad VIN situation, vehilces not on the list due to not insurable

  • KS - how do we merge?

  • YZ - tickets in github with problem and solution, how approached, get aproved, can/has addressed can create a branch to get approved and merged

  • KS

    • how do we decide

  • YZ - focus on aplication side issues, few probs

    • processing data, etc.

    • Network side - going with operator so we prob wont use those changes, not relevant to new openIDL

    • ex: performance prob with processing data during data call, title of a git hub issue

  • AS - can move over things on trello, tracking this, as issues in github

  • JB - any reason to make a repo for ND app specific stuff, baseline concept

  • KS - are we refactoring a repo? roadmap, backlog 

  • YZ - merge the fixes into MAIN, then refactoring later

  • KS - grab Trello items, pick what applies, merge and refactor

  • PA - Mason and PA working on ETL, are we making a new repo? stay where we are?

  • KS - ETL will stick around - kee working there

 

Action items: