【Responsibilities】
-
Building and managing the Global SRE team, including team recruitment, new talent training, system operation/maintenance/coordination and team culture building.
-
Improve the cross-team/time zone/regional cooperation mechanism, and provide SRE solutions in line with actual business scenarios based on business orientation.
-
Responsible for SRE team arrangement and project management, guiding basic SRE work to be more effective, and improving the overall SRE efficiency.
-
Develop process specifications and plans for compliant access, configuration, disaster recovery and fault handling of critical paths of overseas SRE services.
-
Responsible for continuously improving the core SRE capabilities of OLAP engine in efficiency, cost, quality, security, etc.
-
Familiar with a database system or big data engine, such as K/V, ClickHouse, Spark, Doris,Starrocks with in-depth understanding of core modules, including execution plan optimization, execution, storage, etc.
-
Develop automation, data visualization and automated monitoring processes to facilitate the optimization of the cloud-native OLAP engine infrastructure.
-
Drive the design and engineering of tools, as well as platform solutions, to optimize product engineering and operation efficiencies.
-
Manage oncall processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.
【Requirements】
-
Bachelor degree or above in Computer Science or a related technical discipline and good English communication skills.
-
Familiar with SRE-related processes, understand the development trend of SRE technology in the industry, and have a good ability to build an SRE system, 6 years+ SRE experience, big-data or OLAP engine SRE experience is best to have .
-
Familiar with SRE technologies, including Kubernetes, Terraform, Ansible, Bash Scripting etc.
-
Familiar with cloud computing technologies of Amazon Web Services, Google Cloud Platform and other suppliers.
-
Expertise in operations, deployment, and trouble shooting high availability and quality assurance of large-scale distributed systems, with a strong focus on stability and performance.
-
Possesses a strong sense of responsibility, a proactive team spirit, and a strong ability to comprehensively analyze and solve problems.
【Benefits】
-
Option to work remotely within Malaysia&Spain&Barbados even up to 100% – you choose; with option to work abroad up to 25 days yearly
-
Competitive base and bonus
-
Flat structure with a positive team spirit
-
Multiple company overseas trips per year
-
Leisure activities such as sports, board games, etc.
【Location】 Suzhou,China/Johor Bahru,Malaysia/Dublin,Ireland/Santiago,Chile