2023-03-30 Infrastructure Working Group Agenda and Meeting Notes
March 30, 2023
When:
Tuesday, October 4, 2022
9am to 10am PDT/12:00PM-1PM EST
ZOOM Please Register in advance for this meeting:
https://zoom.us/meeting/register/tJwpcO2spjkpE9a1HXBeyBxz7TM_Dvo8Ne8j
Attendees:
- Sean Bohan (openIDL)
- Nathan Southern (openIDL)
- Peter Antley (AAIS)
- Ken Sayers (AAIS)
- Surya Lanka (Chainyard)
- Yanko Zhelyazkov (Senofi)
- Allen Thompson (Hanover)
- Adnan Choudhury (Chainyard)
- Aashish Shrestha (Chainyard)
- Jeff Braswell (openIDL)
- Faheem Zakaria (Hanover)
Agenda Items:
Antitrust
ND POC: Learnings, What Worked, What Didn't, What Was Changed
- Next topics
- AOB
Minutes:
- openIDL ND POC Changes
- 3 - Infrastructure, Fabric, Application POV
- Infrastructure
- added code to deploy the lambda function for report processing and upload UI
- S3 bucket for cloud distro and AWS cert manager
- upgraded Kub to 1.23 (1.22 was getting deprecated)
- test environ updated
- added cert manager for ssl certs for all endpoints
- updated self-managed node groups to EKS managed node groups
- self managed was using launch configs, getting deprecated in Dec - had to update them to EKS manager node group (template not configs)
- Added jenkins pipeline to deploy upload UI app config
- Upload UI was part of ND POC
- KS
- some things specific to ND, may take it out of base code and look at it as reference
- Surya
- Fabric
- in HLF deployment, upgraded the images from 2.2.3 to 2.2.9
- have the bug fixes as part of 2.2.9
- fixed the chaincode to run after start
- whenever there is a peer restart, need to upgrade chaincode to make it work
- there is a restart of the peer, comes back up, should / will be able to create chaincode w/o issues
- fixed
- not related to ND POC, but in general
- Application
- ND POC - need to deploy incidence manager to analytics node
- with changes done, deployed data manager to Analytics node
- new params in application component configs
- automated gen of app component files through ansible templates
- ansible app config files and load into one based on node type
- integrated through jenkins pipeline
- during node setup or new changes, can just trigger pipeline to update add app config files to the cluster
- Why upgrade was performed? Specific bugs? Or just in case and what problems occured
- Surya
- w/ respect to bugs, necessary to keep code up to date, using 2.2.3 v of Fabric after which there is .4-.10
- keep code up to date w/ respect to 2.2 vers
- JB
- was latest vers of fabric related to smaller max PDC size?
- Surya - already there due to couch.db
- 2.5 has support,
- Adnan
- re: DC size, Fab has default value, going to static, not configurable
- do not want to go over max
- larger dataset in PDC reduces performance due to larger transactions
- Aashish
- couchdb 4, planning to make set limit (8mb)
- future versions would encounter problems
- KenS - concern about latest ver?
- Yanko - target specific fixes, cautious
- new vers may intro some other issues
- wondering what specific bugs
- KS - we dont want to get 2/3 vers behind, takes imporrible lift, stay within certain # of fabric
- YZ - minor fixes and features
- major vers require regression testing, etc
- AS - did not update minor vers, just latest patch vers
- KS 2.2.3 - 2.2.9
- AS - next would be 2.4, feature wanted was to deleted PDC by triggering functions is in 2.5
- KS - carriers dont want data in PDC
- results of extractions, once in report, should be in PDC
- KS - didn't test for some sizing
- some unneccesaary logging
- of things to do inside abd required beefing up machines
- ND is an outlier - passing 100k rows, diff than stat reporting
- more than running out of memory issues, probs with timeouts and restartting pods, etc.
- Aashish
- haven't touched performance tests
- code optimization around loops too
- Adnan
- most taken care of by resource restructuring
- specifically when running EP in Mongo, resources not enough
- after beefing up resources, some loops and logs needed to be reprogrammed
- longer timeouts, etc.
- 1MM rows of data, give proper time, etc.
- KS
- had to batch stuff in PDC
- AC
- going over config value for PDCs
- chopped up w/ config value and saved results in PDC 1 by 1
- taken for further processing, next set of process
- YZ
- perf problems due to chaincode and how it processed data
- AC
- unique nature of data itself, had to recalibrate a few things
- comparison of data needed to be efficient
- YZ
- report processor - report being created
- what are the problems on the network side - fix or extra effort to make it work
- AC
- saw timeouts
- tranax timeouts
- needed where we were trying to inc the resources
- made sure the node with peer was not overloaded
- SL
- doc memory issue
- processing larger data in PDC, couchDB issues with peers (size issue)
- some app components getting killed due to "out of memory"
- mult cases where we see the processing time is taking longer and getting killed due to OOM
- YZ
- analysis on why running out of memory?
- AC - found issue
- openIDL has 1 status DB, saving status of datacall
- for some reason fails, scheduler comes in and tries to finish the job, comes back every 30 or 45 min
- in test enviornment
- some data calls did not - saw test environment was doing transax even though not doing data calls
- YZ - issue was application side, not fabric
- JB
- Resource sizing for HDS in AWS is distinct from resource requirements for Fabric nodes, is it not ?
- If HDS AWS resource for carrier is different than the Fabric node resource, the Fabric node resource would not need to be so large ?
- diff resource sizing reqs were for app requirements, not the node itself
- AS - typical data call, size of data less than ND?
- KS - EP would be hundreds of records, small results at extraction
- PA - we will be somewhere between, comes back as JSON, formatting data layer, string of JSONs
- under 1000 element JSON
- AS - will cut down a lot of processing
- KS - stat reports are similar, will have other situations
- MS POC might have a similr result set
- PA - much bigger, not drivers, whole state
- AS # of carriers? per state?
- PA - 100+ sep entities reco by NAIC
- 200 by AAIS
- less than 3k total
- KS - way expecting to see it unfold, loading on behalf of carriers into multitenant node
- PA - load testing to see how much we can fit
- lots of carriers in a single node
- 200+ carriers
- a lot of small mutuals, crrier node 4 has 100 carriers in same table
- data call, agg info, wont have all the primary keys
- KS - ind nodes for less than 20 carriers for a while
- AS - multi-tenant node size - MS ?
- KS stat reporting will stretch multitenant
- PA - didn't have any of these carriers working with us
- KS - putting data in was a huge win
- KS - cool thing
- adding HDS to analutics node made it possible
- wasnt considered in the original design
- allowed DOT to load data that could be used by reporting processor
- ability to quicly pivot and create diff reports was a big win as well
- data quality issues
- bad VIN situation, vehilces not on the list due to not insurable
- KS - how do we merge?
- YZ - tickets in github with problem and solution, how approached, get aproved, can/has addressed can create a branch to get approved and merged
- KS
- how do we decide
- YZ - focus on aplication side issues, few probs
- processing data, etc.
- Network side - going with operator so we prob wont use those changes, not relevant to new openIDL
- ex: performance prob with processing data during data call, title of a git hub issue
- AS - can move over things on trello, tracking this, as issues in github
- JB - any reason to make a repo for ND app specific stuff, baseline concept
- KS - are we refactoring a repo? roadmap, backlog
- YZ - merge the fixes into MAIN, then refactoring later
- KS - grab Trello items, pick what applies, merge and refactor
- PA - Mason and PA working on ETL, are we making a new repo? stay where we are?
- KS - ETL will stick around - kee working there