Links

* BCG Data Science and Analytics Internship

It is a global management consulting company with offices in many countries of the world and its headquarters is located in Boston, and it is known as one of the highest-level consulting companies in the world. It is famous for creating many management analysis methods, including the growth and participation matrix, the effects of the experience curve, and others.

The course includes:

* Business understanding

* Hypothesis framing

* Exploratory data analysis

* Feature engineering and modeling

* TATA Data Visualization Internship

The TATA Group includes many companies that provide energy, engineering and information systems services, in addition to training programs related to data science, especially with regard to solving problems and dealing with them to reach the best results.

The course includes:

Data cleaning

data analysis

Data visualization

Advertisements

* British Airways Data Science Internship

This course will enable you to learn about the day-to-day work of the Data Science team at British Airways. You will learn how they extract data from customer reviews and create predictive models.

The course includes:

* Predicting customer behavior

* Data cleaning

* Cognizant Artificial Intelligence Internship

Similar to the previous company, during this course, you will be allowed to enter the daily work world of the American company Cognizant, allowing you to virtually complete the tasks of the artificial intelligence team and gain experience and skill

This course includes:

* Exploratory data analysis

* Data modeling

* Machine Learning Production

* Accenture Data Analytics Internship

A strong company in various fields with highly qualified employees and the latest technology

Allows you to make effective virtual changes to virtual projects as required, raising your skills and expanding your range of expertise

Course content:

* Project Understanding

* Data cleaning and modeling

* Data visualization and storytelling

* Client Presentation

* Quantium Data Analytics Internship

This training program allows you to learn about the ability of data to penetrate individuals and organizations. This program is provided by Quantium, a leading company in data science and technology, by creating decision support tools, generating insights, and developing data sets

Course content:

* Data validation

* Data Wrangling

* Data analysis

* PwC Power BI Internship

PwC offers an outstanding training program for those wishing to develop their skills in Power BI

It is a leading company in providing consulting services in relation to human resources, auditing, accounting and taxation

Course includes:

* Defining KPIs

* Power BI Dashboard / Editor

* Develop visions

From what we have seen, these courses are an opportunity to get acquainted with the mechanism of dealing with important companies with data science and various analysis techniques, so that they allow you to work with them virtually to increase your experience and expand your skills.

Advertisements

مجموعة دورات تدريبية مجانية افتراضية بعلوم البيانات مقدمة من أفضل الشركات

Advertisements

مع التقدم العلمي المتسارع أصبحت أطر التعلم أكثر توسعاً وتنوعاً، ونظراً لاعتبار التعلم المستمر هو ركن أساسي بالنسبة للمتعلم ليطور من نفسه ويزيد مهاراته التي يحتاجها لنمو عمله

لذا فالتطوير المهني هو من دعائم التقدم لأي عمل أو مهنة سواء على مستوى الفرد أو المؤسسة وصولاً إلى الشركات بكافة مستوياتها

ويمكن إسقاط ما سلف ذكره على علوم البيانات وكل ما يتفرع عنها من علوم واختصاصات، فتطوير عالِم البيانات لمهاراته وخبراته وبالتالي مواكبته للتطورات والتحديثات المستمرة يرفع من قيمته ومستواه العلمي

يمكن اكتساب الخبرة والمهارة في علم البيانات وتحليلاتها من عدة مصادر منها الدورات التدريبية، لكن بالمقابل يجب أن تكون مصادر هذه الدورات التدريبية موثوقة من حيث المعلومة الصحيحة والكفاءة العالية، لذا سنقدم قائمة بدورات تدريبية افتراضية مجانية مقدمة من أفضل الشركات المختصة بعلم البيانات مع روابط التسجيل الخاصة بها

* KPMG Data Analytics Internship

تعتبر هذه الشركة فرد من عائلة شركات محاسبة كبرى تقدم محتوى علمي قيّم، تركز في برنامجها التعليمي على تبسيط مفهوم التعامل مع البيانات الضخمة وكيفية التعامل الأمثل مع التحليلات الفعالة للبيانات

: محتوى الدورة

رؤى البيانات*

تصور البيانات *

تقييم جودة البيانات *

* BCG Data Science and Analytics Internship

هي شركة استشارات إدارية عالمية لها مكاتب في العديد من دول العالم ويقع مقرها الرئيسي في بوسطن، وتعرف على أنها واحدة من أرفع الشركات الاستشارية مستوى في العالم. تشتهر بابتكار العديد من أساليب التحليل الإداري ومنها مصفوفة النمو والمشاركة، وتأثيرات منحني الخبرة وغيرها.

: تتضمن الدورة

فهم الأعمال *

تأطير الفرضية *

تحليل البيانات الاستكشافية *

هندسة الميزات والنمذجة *

* TATA Data Visualization Internship

العديد من الشركات TATA تضم مجموعة

التي تقدم خدمات الطاقة والهندسة وأنظمة المعلومات إضافة إلى البرامج التدريبية المتعلقة بعلوم البيانات وخاصة فيما يتعلق بحل المشكلات والتعامل معها للوصول إلى أفضل النتائج

:تتضمن الدورة

تنظيف البيانات *

تحليل البيانات *

تصور البيانات *

Advertisements

* British Airways Data Science Internship

تتميز هذه الدورة بأنها ستمكنك من التعرف على العمل اليومي الذي يقوم به فريق علوم البيانات في الخطوط الجوية البريطانية ستتعرف على كيفية استخراجهم لبيانات مراجعات العملاء وإنشاء النماذج التنبؤية

: تتضمن الدورة

التنبؤ بسلوك العملاء *

تنظيف البيانات *

* Cognizant Artificial Intelligence Internship

على غرار الشركة السابقة سيتاح لك في خلال هذه الدورة

الأمريكية Cognizant الدخول إلى عالم العمل اليومي لشركة

بحيث تتيح لك بشكل افتراضي إكمال مهام فريق الذكاء الاصطناعي واكتساب الخبرة والمهارة

:تتضمن هذه الدورة

تحليل البيانات الاستكشافية *

نمذجة البيانات *

نموذج البناء إنتاج التعلم الآلي *

* Accenture Data Analytics Internship

شركة قوية في مختلف المجالات تمتلك موظفين على مستوى عالي من الكفاءة كما وتمتلك أحدث التقنيات

تتيح لك القيام بإجراء التغييرات الافتراضية الفعالة لمشاريع افتراضية وفق المطلوب بحيث ترفع مهاراتك وتوسع نطاق خبراتك

: محتوى الدورة

فهم المشروع *

تنظيف البيانات والنمذجة *

تصور البيانات ورواية القصص *

عرض العميل *

* Quantium Data Analytics Internship

يتيح لك هذا البرنامج التدريبي التعرف على مدى قدرة البيانات على الاختراق للأفراد والمؤسسات

Quantium وهذا البرنامج مقدم من شركة

الرائدة في علوم البيانات والتكنولوجيا من خلال ابتكار أدوات دعم القرار وتكوين الرؤى وتطوير مجموعات البيانات

: محتوى الدورة

التحقق من صحة البيانات *

البيانات المشاحنات *

تحليل البيانات *

* PwC Power BI Internship

برنامجاً تدريبياً متميزاً PwC تقدم شركة

Power BI للراغبين بتطوير مهاراتهم في

وهي شركة رائدة في تقديم الخدمات الاستشارية فيما يتعلق بالموارد البشرية وأعمال المراجعة والمحاسبة والضرائب

: تتضمن الدورة

تحديد مؤشرات الأداء الرئيسية *

Power BI محرر *

تطوير الرؤى *

https://github.com/mahesh989/Basic-Data-Cleaning

من خلال ما رأينا تعتبر هذه الدورات بمثابة فرصة للتعرف على آلية تعامل الشركات المهمة مع علم البيانات وتقنيات التحليل المتنوعة بحيث تتيح لك العمل معها بشكل افتراضي لتزيد خبراتك وتتوسع مهاراتك

Advertisements

Comprehensive Illumination on What A Beginner in Data Science Should Learn

Posted on July 20, 2023 by s4l8384gmailcom

Advertisements

We have already noted in previous articles that a job in data science is the dream of many in recent times, and this matter requires effort to obtain great experience and knowledge due to the high level of competition to obtain this job.

And the most important pillars of the required expertise is not only knowing the tools and dealing with them, but it is necessary for the data scientist to have a comprehensive idea of the main concepts and techniques and use them later according to the requirements of the work to be accomplished.

In this article, we will provide a comprehensive guide for beginners who are about to learn data science

Let’s first learn about the concept of data science

Data science in a simplified way is the integration of a group of sciences such as mathematics, statistics and programming that work together to obtain useful insights when dealing with data.

Many related sciences branch out from data science, and the following sciences are the most common, including:

Machine learning, data analysis, business intelligence, statistics, mathematics and other sciences whose prevalence is no longer a secret

Data science is utilized according to previous features and technologies in several areas, including:

Language translation and text analytics, image sorting, remote sensing and health services management

The three most common tasks in data science

Data Analyst: Analyze data to generate better insights for business decisions

Data Scientist: Extracting useful information from big data

Data architecture: dealing with data pipelines

What are the best ways to learn data work?

Learning data science is distinguished by the fact that the deeper you study it, the more knowledge horizons will increase in front of you, and you will feel that you still have a lot to learn. Through this plan, diversify learning sources, such as using online training courses, viewing certificates, and choosing the appropriate ones. There are other means that we will discuss later.

* Know the basic concepts

Knowing the necessary tools and software used by a data scientist as well as the main techniques is one of the most important necessities to learn

Learning a programming language is the most important pillar necessary to start the journey of learning as the Python language (or any language of your choice), you must learn it to the point of proficiency, and reading articles related to the basics of programming and learning how to write code helps you to enable and consolidate the information you receive

* learning through the implementation of projects

This method is the best for learning, as it will introduce you to the work environment in data science. As you implement projects, you will have clear visions, and you will have your own style in deducing options and exploring appropriate solutions.

The implementation of projects requires conducting many searches and carrying out relevant studies. It is advised to start with simple projects that suit your level as a beginner, and with continuous repetition and good follow-up, you will find yourself starting to learn broader concepts to move on to implementing more complex projects, thus increasing your experience and skills.

What are the most important points that a beginner data scientist should learn?

You must choose a field in which you specialize in data science, and accordingly we mention several concepts that you must learn and master

1. Comprehensive knowledge

You must realize the real world around you by following the news that benefits you in your field of learning and keeping abreast of all updates and technologies. By employing the events around you in your studies in a field of data science, you can get the maximum benefit from the course of events around you.

2. Mathematics and Statistics

mathematics

* Linear Algebra: It is a branch that is useful in machine learning because it relies on the formation of matrices, which is a basic pillar of machine learning, so that the matrix represents the data set

Probability: This branch of mathematics is useful in predicting the unknown outcomes of a particular event

* Calculus: They are useful in collecting small differences to determine the derivatives and integrals of functions, and this appears in deep learning and machine learning

Statistics

Descriptive statistics: includes (average, median, cut statistics, and weighted statistics). This is considered the beginning of the stages of analyzing quantitative data formed in the form of charts and graphs.

Inferential statistics: includes determining working measures A and B tests and creating hypothesis tests, probability value, and alpha values for analyzing the collected data

3. Dealing with databases

When talking about data engineering, we should mention the intersection between a data scientist and a data engineer, where pipelines are created for all data from several sources and stored in a single data warehouse.

As a beginner it is recommended to learn SQL and then move to One RDBMS such as

MySQL and One NoSQL

Advertisements

4. Python and its libraries

It is the most widely used programming language for later use in data analytics due to its simplicity in terms of building code and organizing sentences, and it has many libraries such as NumPy, Pandas, Matplotlib, and Scikit-Learn.

This allows the data scientist to use data more effectively

There are courses for beginners in Python on Udemy or Coursera that can be used to learn the principles of Python

5. Data cleaning

It is a time-consuming task for beginners, but it must be implemented in order to obtain good data analysis resulting from clean data.

For a detailed explanation of data cleaning, you can read a comprehensive article through this link Click here

6. Exploratory data analysis

This type of analysis is meant to detect anomalies in the data and test hypotheses with the help of statistics and graphs

As a beginner, you can use Python to perform EDA according to the following steps

Data collection: It involves gathering, measuring, and analyzing accurate data from multiple sources in order to find a solution to a specific problem

Data cleaning: Troubleshoot incorrect data

Univariate analysis: It is an analysis process based on a single change without addressing complex relationships and aims to describe the data and identify existing patterns

Bivariate Analysis: This process compares two variables to determine how the features affect each other to perform the analysis and determine the causes

7. Visualization

One of the most important pillars of all data analysis projects, visualization is a technique that makes seeing data clear and effective in the end, and reaching effective results in visualization depends on having the right set of visualizations for different types of data

Types of perceptions:

HISTOGRAM

bar chart

BUBBLE CHART

RADAR CHART

WATERFALL CHART

PIE CHART

LINE CHART

AREA CHART

TREE MAP

SCATTERPLOT

BOX PLOT

The most important visualization tools:

Tableau: This is the most popular tool for data visualization due to its reliance on scientific research, which improves analysis results with the required speed

BI Bower: An interactive program developed by Microsoft that is often used in business intelligence

Google Chart: It is widely used by the analyst community due to its provision of graphical visualizations

JupiterR: This web-based application features the convenience of creating and sharing documents with visualizations

So we conclude from the above that visualization is the process of showing data in a visual way without having to plan all the information

I hope that I have been successful in identifying the most important points that help a beginner in data science to stand on his feet and prove himself as a data scientist seeking to develop himself and refine his skills

It is certain that many of you, dear readers, have knowledge of other important points that I did not mention. Share them with us in the comments, Thank you.

Advertisements

إضاءة شاملة على ما يجب أن يتعلمه المبتدئ في علم البيانات

Advertisements

سبق وأن نوهنا في مقالات سابقة أن الوظيفة في علم البيانات هي حلم الكثيرين في الآونة الأخيرة، وأصبح هذا الأمر يتطلب مجهوداً في الحصول على خبرة ومعرفة كبيرين بسبب ارتفاع مستوى المنافسة للحصول على هذه الوظيفة

وأهم ركائز الخبرة المطلوبة ليس فقط معرفة الأدوات والتعامل معها بل من الضروري أن يمتلك عالِم البيانات فكرة شاملة عن المفاهيم والتقنيات الرئيسية واستخدامها فيما بعد وفق متطلبات العمل المراد إنجازه

في هذا المقال سنتقدم دليلاً إرشادياً شاملاً للمبتدئين المقبلين على تعلم علم البيانات

لنتعرف في البداية على مفهوم علم البيانات

علم البيانات بشكل مبسط هو تكامل مجموعة علوم كالرياضيات والإحصاء والبرمجة تؤدي عملها مع بعضها للحصول على رؤى مفيدة عند التعامل مع البيانات

:يتفرع عن علم البيانات العديد من العلوم ذات الصلة وتعد العلوم الآتية هي الأكثر شيوعاً نذكر منها

التعلم الآلي وتحليل البيانات وذكاء الأعمال والإحصائيات والرياضيات وغيرها من العلوم التي لم يعد انتشارها يخفى على أحد

:يُستفاد من علم البيانات وفق الميزات والتقنيات السابقة في عدة مجالات نذكر منها

ترجمة اللغة وتحليلات النص، فرز الصور، الاستشعار عن بعد وإدارة الخدمات الصحية

المهام الثلاث الأكثر شيوعاً في علم البيانات

محلل البيانات: تحليل البيانات لتكوين رؤى أفضل لقرارات العمل

عالِم البيانات: استخراج المعلومات المفيدة من البيانات الضخمة

مهندس بيانات: التعامل مع خطوط أنابيب البيانات

ما هي الطرق الأمثل لتعلم عمل البيانات؟

يتميز تعلم علم البيانات بأنه كلما تعمقت في دراسته أكثر كلما ازدادت الأفق المعرفية أمامك أكثر وستشعر بأن ما زال أمامك الكثير لتتعلمه، وبإمكانك كمتعلم مبتدئ أن تضع لنفسك خطة تدريبية تعينك على التعلم بمرونة وسهولة لتتجنب الوقوع في فخ الملل ثم اليأس كما يحصل مع الكثيرين ويمكنك من خلال هذه الخطة تنويع مصادر التعلم كالاستعانة بالدورات التدريبية عبر الإنترنت والاطلاع على الشهادات واختيار المناسب منها وهناك وسائل أخرى سنتطرق إليها لاحقاً

التعرف على المفاهيم الأساسية *

التعرف على الأدوات والبرامج اللازمة التي يستخدمها عالِم البيانات إضافة إلى التقنيات الرئيسية هي من أهم الضرورات التي يجب تعلمها

فتعلم لغة برمجة هو أهم الركائز الضرورية لبدء رحلة التعلم كلغة بايثون (أو أي لغة تختارها)، يجب عليك تعلمها إلى درجة الإتقان كما وأن قراءة المقالات المتعلقة بأساسيات البرمجة وتعلم كيفية كتابة الكودات البرمجية يساعدك على تمكين وترسيخ المعلومات التي تتلقاها

طريقة التعلم عن طريق تنفيذ المشاريع *

تعتبر هذه الطريقة هي الأفضل للتعلم فهي ستدخلك في بيئة العمل في علم البيانات فقيامك بتنفيذ المشاريع ستتشكل لديك الرؤى الواضحة وسيتكون عندك أسلوباً خاصاً بك في استنتاج الخيارات واستكشاف الحلول المناسبة

يتطلب تنفيذ المشاريع إجراء العديد من عمليات البحث وتنفيذ الدراسات ذات الصلة وينصح بالبدء بمشاريع بسيطة تناسب مستواك كمبتدئ، ومع التكرار المستمر والمتابعة الجيدة ستجد نفسك بدأت تتعلم مفاهيم أوسع لتنتقل إلى تنفيذ مشاريع أكثر تعقيداً فتزداد خبرتك ومهاراتك

ما هي أبرز النقاط التي يجب على عالِم البيانات المبتدئ أن يتعلمها؟

يجب عليك اختيار مجال تختص فيه في علم البيانات وبناءً عليه نذكر لك عدة مفاهيم يجب أن تتعلمها وتتقنها

1. المعرفة الشاملة

عليك أدراك العالم الواقعي من حولك عن طريق متابعة الأخبار التي تفيدك في مجال تعلمك ومواكبة كافة التحديثات والتقنيات، فمن خلال توظيف الأحداث من حولك في دراستك في مجال من مجالات علم البيانات يمكنك تحصيل الاستفادة القصوى من مجريات الأحداث من حولك

2. الرياضيات والإحصاء

الرياضيات

الجبر الخطي: هو فرع يفيد في التعلم الآلي لاعتماده على تشكيل المصفوفات التي هي ركيزة أساسية في التعلم الآلي، بحيث تمثل المصفوفة مجموعة البيانات

* الاحتمالات: يفيد هذا الفرع من الرياضيات في التنبؤ بالنتائج الجهولة لحدث معين

التفاضل والتكامل: يفيدان في جمع الفروق الصغيرة لتحديد مشتقات وتكاملات الوظائف وهذا يظهر في التعلم العميق والتعلم الآلي

الإحصاء

الإحصاء الوصفي: يشمل (المتوسط والوسيط والإحصاءات المقطوعة والإحصاءات الموزونة) وتعتبر هذه بداية مراحل تحليل البيانات الكمية المتشكلة على هيئة مخططات ورسوم بيانية

الإحصاء الاستدلالي: تشمل تحديد مقاييس العمل اختبارات أ وَ ب وإنشاء اختبارات الفرضيات والقيمة الاحتمالية وقيم ألفا لتحليل البيانات المجمعة

3. التعامل مع قواعد البيانات

عند التطرق إلى الحديث عن هندسة البيانات فيجدر بنا التنويه إلى التقاطع بين عالم البيانات ومهندس البيانات، بحيث يتم إنشاء خطوط أنابيب لجميع البيانات من عدة مصادر وتخزينها في مستودع بيانات واحد

SQL وكمبتدئ ينصح بتعلم

One RDBMS ومن ثم الانتقال إلى نظام

One NoSQL و MySQL مثل

Advertisements

4. لغة بايثون والتعرف على مكتباتها

وهي اللغة الأكثر استخداماً في البرمجة للاستخدام اللاحق في تحليلات البيانات نظراً لبساطتها من حيث بناء الكودات وتنظيم الجُمل

وهي تمتلك العديد من المكتبات

NumPy و Pandas و Matplotlib و Scikit-Learn مثل

ما يتيح لعالِم البيانات باستخدام البيانات بفاعلية أكبر

يوجد دورات تدريبية للمبتدئين في بايثون

Coursera أو Udemy على

يمكن الاستفادة منها في تعلم مبادئ بايثون

5. تنظيف البيانات

وهي مَهمة تستهلك بالنسبة للمبتدئين كثيراً من الوقت لكن لابد من تنفيذها وذلك من أجل الحصول على تحليل بيانات جيد ناتج عن بيانات نظيفة

وللتوضيح بشكل تفصيلي عن تنظيف البيانات يمكنك قراءة مقال شامل من خلال هذا الرابط

6. تحليل البيانات الاستكشافية

يقصد بهذا النوع من التحليل اكتشاف حالات الشذوذ في البيانات واختبار الفرضيات بمساعدة الإحصاءات والرسوم البيانية

كمبتدئ يمكنك استخدام بايثون

وفق الخطوات التالية EDA لإجراء

جمع البيانات: تتضمن جمع البيانات الدقيقة من مصادر متعددة وقياسها وتحليلها بغية إيجاد حل لمشكلة معينة

تنظيف البيانات: استكشاف البيانات غير الصحيحة وإصلاحها

التحليل أحادي المتغير: وهي عملية تحليل تعتمد على تغير واحد دون التطرق إلى العلاقات المعقدة والهدف منها وصف البيانات وتحديد الأنماط الموجودة

التحليل الثنائي المتغير: تجري هذه العملية مقارنة بين متغيرين لتحديد كيفية تأثير الميزات على بعضها البعض لإجراء التحليل وتحديد الأسباب

7. التصور

أحد أهم الدعائم الأساسية لكافة مشاريع تحليل البيانات، فالتصور هو تقنية تجعل من رؤية البيانات بشكل واضح وفعال في النهاية، والوصول إلى نتائج فعالة في التصور يعتمد على امتلاك المجموعة الصحيحة من التصورات لأنواع البيانات المختلفة

:أنواع التصورات

HISTOGRAM

BAR CHART

BUBBLE CHART

RADAR CHART

WATERFALL CHART

PIE CHART

LINE CHART

AREA CHART

TREE MAP

SCATTERPLOT

BOX PLOT

: أهم أدوات التصور

:Tableau

تعد هذه الأداة الأكثر شيوعاً في تصور البيانات لاعتمادها على البحث العلمي مما يحسن نتائج التحليل بالسرعة المطلوبة

:Bower BI

برنامج تفاعلي مطوَّر من قِبَل شركة مايكروسوفت يستخدم غالباً في ذكاء الأعمال

:Google Chart

يستخدم بكثرة عند مجتمع المحللين نظراً لما يوفره من إنتاج التصورات الرسومية

:JupiterR

يعتمد هذا التطبيق على الويب ويتميز بأنه يتيح إنشاء المستندات التي تتضمن التصورات ومشاركتها بكل أريحية

إذاً نستنتج مما سبق أن التصور هو عملية إظهار البيانات بشكل مصوَّر مرئي دون الحاجة إلى تخطيط جميع المعلومات

أرجو أن أكون قد وُفِّقت في تحديد أكثر النقاط أهمية والتي تعين المبتدئ في علم البيانات على الوقوف على قدميه وإثبات نفسه كعالِم بيانات يسعى إلى تطوير ذاته وصقل مهاراته

من المؤكد أن كثيراً منكم أعزاءي القراء لديهم المعرفة بنقاط هامة أخرى لم أقم بذكرها شاركونا بها في التعليقات ولكم الشكر

Advertisements

What is the concept of data cleaning?

Posted on July 17, 2023July 17, 2023 by s4l8384gmailcom

Advertisements

Data cleaning

Data sets often contain errors or inconsistencies, especially when collected from multiple sources. In these cases, it is necessary to organize that data, correct errors, remove redundant entries, work to organize and format data, and exclude outliers. These procedures are called data cleaning.

The purpose of data cleaning

This process aims to detect any defect in the data and deal with it from the beginning, thus avoiding wasting time spent on arriving at incorrect results

In other words, early detection and fixing of errors leads to correct results

This fully applies to data analysis. Going with clean and formatted data enables analysts to save time and get the best results.

Here is an example showing the stages of data cleaning:

In this example we used Jupyter Notebook to run Python code inside Visual Studio Code

The code is in the GitHub repository at the link

The first stage: reading the data:

This is done in our example using pandas by reading the data that we import from the source in the link:

https://github.com/justmarkham/DAT8/blob/master/data/chipotle.tsv

So that the libraries to be used are called

The second stage:

a. Observing Data

This stage aims to identify the data structure in terms of type and distribution in order to detect errors and imbalances in the data

This process will print the first and last 10 entries of the dataset and thus determine the applicable dataset type so that you choose the first or last entry according to the desired purpose and then output using df.head(10)

We notice some NaN entries in the Choice_description column

and a dollar sign in the item_price column

B. Data types of columns

You must now determine what type of data is in each column

In the following code, we define the column names and data types in an organized and coordinated manner

The output is:

Advertisements

The third stage: data cleaning

a. Change the data type

If the work requires converting data types, this is done while monitoring the data

In our example item_price includes a dollar sign, we can remove it and replace it with float64 because it contains a decimal number

B. Missing or empty values

The stage of searching for missing values in the data set comes:

The output is:

We notice from the output result above that the null value is represented by True, while False does not represent null values
We’ll have to find the number of null entries in the table using the sum because we won’t be able to see all the real values in the table

This procedure indicates to us the columns that contain null values and the number of them is empty. We can also note that the “option_description” column is the column that contains empty entries and 1246 of them are empty

We can also determine the presence of null values for each column and find the number as in the following image

We then proceed to find the missing values for each column

In our example, we notice that only one column contains null values

It should be noted here that it is necessary to calculate the percentage of the values in each column because, especially in the case of large data, it is possible that there will be empty values within several columns.

The output is:

We find here that the description column contains missing values by 27%, and this percentage does not necessitate deleting the entire column because it did not exceed 70%, which is the percentage of missing values that if found in a column, it is preferable to get rid of it

Another approach to dealing with missing values when cleaning data is to depend on the type of data and the defect to be addressed

To further clarify we have the column “choice_description” and to understand what the problem is we check the unique entries in this column to get more solutions

Now we make sure how many choice_description contains choice_description

Considering that the missing values are for the customer’s choice, they can be replaced on the assumption that these customers did not give details of their requests, so we replace the missing values with “Regular”.

And replace the null values with “Regular Order”

The output is:

Now let’s make sure that there are null values

By replacing null values with their descriptions, we got rid of all the missing values and began to improve our data

B. Remove redundancy

Now we will check the number of duplicate entries and then get rid of them and this deletion is not done if at least one of the entries is different from row to row as duplicate entries mean that all rows are exactly the same as the other row

We can check by running the code

The output is:

We will now delete duplicate entries

As a precautionary step we will make sure that there are no duplicate entries again

c. Delete extra spaces

That is, getting rid of spaces, extra spaces that are useless between letters and words

This task can be carried out by them:

String processing functions
regular expressions
Data cleaning tools

Fourth stage: data export

This step involves exporting the clean data keeping in mind that in our example we are working on a narrow and simplified scale

This code writes the cleaned data to a new CSV file named cleaned_data.csv

In the same path as our Python script with the ability to modify the file name and path as required

The argument index = False indicates that pandas does not include row index numbers in the exported data.

Fifth stage: data visualization using Tableau

We have reached the end of the data filtering journey with the clean data which we will export to visualization and now ready for easy analysis

Advertisements

ما هو مفهوم تنظيف البيانات؟

Advertisements

تنظيف البيانات

غالباً ما تحتوي مجموعات البيانات على أخطاء أو تناقضات وخصوصاً على تجميعها من مصادر متعددة ففي هذه الحالات من الضروري تنظيم تلك البيانات وتصحيح الأخطاء وإزالة الإدخالات المتكررة والعمل على تنظيم وتنسيق البيانات واستبعاد القيم المتطرفة، هذه الإجراءات تسمى تنظيف البيانات

الهدف من تنظيف البيانات

تهدف هذه العملية إلى اكتشاف أي خلل في البيانات والتعامل معه منذ البداية مما يجنِّب هدر الوقت المستهلك في الوصل إلى نتائج غير صحيحة

وبمعنى آخر، اكتشاف الأخطاء وإصلاحها في وقت مبكر يوصلنا إلى نتائج صحيحة بشكل مؤكد

وهذا ينطبق تماماً على تحليل البيانات فالمضي ببيانات نظيفة ومنسقة يمكِّن المحللين من توفير الوقت والحصول على أفضل النتائج

وهذا مثال يوضح مراحل تنظيف البيانات

Jupyter Notebook في هذا المثال استخدمنا

Visual Studio Code لتشغيل كود بايثون داخل

على الرابط GitHub الكود موجود في مستودع

https://github.com/mahesh989/Basic-Data-Cleaning

المرحلة الأولى: قراءة البيانات

يتم ذلك في مثالنا باستخدام باندا بأن نقرأ البيانات التي نستوردها من المصدر الموجود في الرابط

https://github.com/justmarkham/DAT8/blob/master/data/chipotle.tsv

بحيث يتم استدعاء المكتبات المراد الاستعانة بها

:المرحلة الثانية

أ. مراقبة البيانات

تهدف هذه المرحلة إلى التعرف على بنية البيانات من حيث النوع والتوزيع بغية اكتشاف الأخطاء والخلل في البيانات

بهذه العملية سيتم طباعة الإدخالات العشرة الأولى والأخيرة من مجموعة البيانات وبالتالي تحديد نوع مجموعة البيانات المعمول بها بحيث تختار الإدخال الأول أو الأخير وفق الغرض المطلوب

df.head(10) ثم الناتج باستخدام

NaN نلاحظ بعض إدخالات

Choice_description في عمود

item_price وعلامة الدولار في عمود

ب. أنواع بيانات الأعمدة

لابد الآن من تحديد نوع البيانات الموجودة في كل عمود

في الكود التالي يتحدد لدينا أسماء الأعمدة وأنواع البيانات بأسلوب منظم ومنسق

: النتيجة

Advertisements

المرحلة الثالثة: تنظيف البيانات

أ. تغيير نوع البيانات

إذا تطلب العمل تحويل أنواع البيانات فيتم ذلك أثناء مراقبة البيانات

علامة الدولار item_price وفي مثالنا يتضمن

float64 نستطيع إزالته واستبداله بـ

لاحتوائه على رقم عشري

ب. القيم المفقودة أو الفارغة

تأتي مرحلة البحث عن القيم المفقودة في مجموعة البيانات

النتيجة

نلاحظ من نتيجة الإخراج أعلاه

True أن القيمة الخالية متمثلة بـ

False بينما لا يمثل

قيماً خالية سنضطر إلى البحث عن عدد الإدخالات الخالية في الجدول باستخدام المجموع لأننا لن نستطيع رؤية كل القيم الحقيقية الموجودة في الجدول

يدلنا هذا الإجراء على الأعمدة التي تتضمن قيم خالية وعددها فارغ ويمكن أن نلاحظ أيضاً

“option_description” أن العمود

هو العمود الذي يحوي إدخالات فارغة و1246 منها خالية

كما ويمكننا تحديد وجود القيم الخالية لكل عمود مع إيجاد الرقم كما في الصورة التالية

ثم نتوجه إلى العثور على القيم المفقودة لكل عمود

وفي مثالنا نلاحظ أن عمود واحد فقط يتضمن قيم فارغة

يجدر التنويه هنا إلى أنه من الضروري حساب النسبة المئوية للقيم الموجودة في كل عمود لأنه وخصوصاً في حالة وجود بيانات ضخمة فمن المحتمل وجود قيم فارغة ضمن عدة أعمدة

النتيجة

description نجد هنا أن عمود

يحوي قيم مفقودة بنسبة 27% وهذه النسبة لا تستوجب حذف العمود بأكمله لأنها لم تتجاوز 70% وهي نسبة القيم المفقودة التي إن وجدت في عمود فيفضل التخلص منه ومن الطرق الأخرى المتبعة في التعامل مع القيم المفقودة عند تنظيف البيانات الاعتماد على نوع البيانات والخلل المطلوب معالجته

“choice_description”ولمزيد من التوضيح لدينا العمود

ولفهم ماهية المشكلة نتحقق من الإدخالات الفريدة في هذا العمود لنحصل على مزيد من الحلول

choice_description نتأكد الآن من عدد

choice_description الذي يتضمن

على اعتبار أن القيم المفقودة مخصصة لاختيار العميل فيمكن استبدالها على فرض أن هؤلاء العملاء لم يعطوا تفصيلاً عن طلباتهم

” Regular” فنستبدل القيم المفقودة بـ

” Regular Order” ونستبدل القيم الخالية بـ

النتيجة

ولنتأكد الآن من وجود قيم خالية

وعن طريق استبدال القيم الخالية بالأوصاف الخاصة بها تخلصنا من جميع القيم المفقودة وهكذا بدأنا بتحسين بياناتنا

ب. إزالة التكرار

سنتحقق الآن من عدد الإدخالات المكررة لنقوم بعد ذلك بالتخلص منها وعملية الحذف هذه لا تتم إذا كان أحد الإدخالات على الأقل مختلفاً من صف إلى آخر حيث أن الإدخالات المتكررة تعني أن جميع الصفوف متطابقة تماماً مع الصف الآخر

يمكننا التحقق من خلال تشغيل الكود

النتيجة

سنقوم الآن بحذف الإدخالات المتكررة

كخطوة احترازية سنتأكد من عدم وجود إدخالات مكررة مرة أخرى

ج. حذف المسافات الزائدة

أي التخلص من المسافات الفراغات الإضافية التي لا فائدة منها بين الأحرف والكلمات

ويمكن أن تنفذ هذه المهمة منها

وظائف معالجة السلاسل

التعبيرات العادية

الأدوات المخصصة لتنظيف البيانات

المرحلة الرابعة: تصدير البيانات

هذه الخطوة تتضمن تصدير البيانات النظيفة مع الأخذ بعين الاعتبار أننا في مثالنا نعمل على نطاق ضيق ومبسط

يعمل هذا الكود على كتابية البيانات المنظفة

cleaned_data.csv جديد اسمه CSV إلى ملف

في نفس المسار مثل نص بايثون الخاص بنا مع إمكانية تعديل اسم الملف والمسار حسب المطلوب

index = False تدل الوسيطة

أن “باندا” لا تقوم بتضمين أرقام فهرس الصفوف في البيانات المصدرة

المرحلة الخامسة: تصور البيانات باستخدام تابلو

وصلنا إلى نهاية رحلة تصفية البيانات بحصولنا على البيانات النظيفة والتي سنصدرها إلى التصور فهي الآن جاهزة لإجراء عملية التحليل بسهولة

Advertisements

Stages of preparation for building a successful data science team

Posted on July 5, 2023July 5, 2023 by s4l8384gmailcom

Advertisements

The data science employee in his first appointment period often suffers from some difficulties that are embodied in some chaos, instability, lack of organization, and perhaps difficulty in adapting and confusion, especially in the early days, at the very least, but the new employee must overcome these obstacles, which, in my opinion, are a normal condition. His first steps towards success and development

What we will discuss in this article is how to create the right conditions for building a successful team in the data science job

Co-workers are the environment that helps each person in this group to progress and develop at their various levels, and as a junior employee, your colleague, who was hired a short time ago, will help you answer beginners’ questions, and soon you will have an idea of the basics of the work system for the job, and as your activity grows and develops Your level You can start the stage of receiving ideas about a group of experiences and skills from those older than you in the job with experience and competence at work until you find in yourself that experienced employee who can discuss with his manager on deeper and more accurate topics, as the manager in general tends to the employee who offers suggestions and initiates To the effective discussion by expressing valuable opinions and providing feasible solutions.

You must agree with me that if we look at any successful functional community, whether it is a company, an institution, or even within the private sector, we see that the basis for success lies in the spirit of cooperation and love among the team members at different levels and degrees.

At the beginning of talking about the incorporation stages, especially in the first month of the job, we recommend asking a lot of questions, as it is an ideal period for receiving information, setting priorities and learning vocabulary, by following the following instructions:

1. Be sure to join the guidance units provided by your company, which are dedicated to guiding new employees, as they are capable of informing you of the company’s policy and approach in terms of privacy, security and ethics, and you will also be able to request comprehensive guides for what you need.

2. Always seek information about the team’s work so that you can keep up with the work with them, through continuous communication with your manager and try to make suggestions that contribute to the progress of the company’s work, and try to know the type of challenges that the company faces to start building a successful plan based on your skills and method to overcome problems and face the challenges.

3. Take advantage of the opportunity when there are no internal repositories to publish analytics suites, collect examples and create one so that these repositories become very important to the team and future employees, and do not miss out on getting to know the previous work or project of the company – that is, before your appointment period – to have an idea of how it works Upcoming projects.

4. Try to stay abreast of current issues in the company by joining e-mail subscriptions and other chat platforms. Joining these channels, getting to know their users, and sharing ideas and experiences with them helps you gain more experience.

5. Make sure to introduce yourself in front of your manager and your colleagues through the meetings, and try briefly to present some of your work and projects that you have undertaken and the solutions that you presented during the implementation of the projects. It will increase their confidence in you.

We have already explained in a previous article how to build a business portfolio in the field of data science. For information, click here

6. It is necessary to know the main contacts in the company so you should request a list of contacts from your manager or colleagues

Advertisements

Start building your own data science ecosystem

1. The first step is to prepare your computer with login and remote access information, download the necessary software for your business, get technical support, and don’t forget the necessary equipment and devices

2. It is very important that you obtain the information as soon as possible after your appointment, as the processing takes some time and the time factor here is very important. You should take the initiative directly to ask your manager and those in charge of the work about the data sets that you need to communicate with them and ask for a list of websites that you may need in your business

3. Definitely don’t forget to download the software that your team relies on to work continuously, such as programming languages and data visualization tools

4. Understanding (domain): It is very necessary to help you ensure that the data is interpreted correctly when doing analysis or using a machine learning model The proficiency stage After completing the correct preparation and preparation stage, you must establish yourself and

prove your competence by following these steps:

1. Start your career journey by getting to know your colleagues and introducing yourself to them, such as asking your manager to work with them by appointing you to the team. Share your opinions and experiences with them, even if they are modest. This will help them determine the level of interaction with you and will help build a spirit of cooperation and participation among team members.

2. The first impression is the effect that will be imprinted on your colleagues from the first meeting, whether it is at the level of your morals or your scientific level. In terms of ethics and dealing, people generally tend towards a humble, loving and tolerant person, and they rush to gain his friendship to be close to him. As for the scientific level, when your colleagues find you A person who loves to cooperate and share ideas and experiences will be an ideal person and a model for an efficient employee

3. Let others know about the nature of your work and your main mission in the company, and keep them up to date with your work style and achievements, such as placing links in newsletters and presenting them to the team

Finally..

I believe that by following these steps it is possible to overcome the most difficult period in the appointment stage for a new job, and this is what came to my mind regarding the matters that necessitated that.

My friends If you, think that there are things that we did not mention that may help in establishing a successful work team, then share them with us in the comments so that we can apply what we have previously read on the ground and build a small team whose members exchange information and experiences between the publisher and the recipient .. Thank you.

Advertisements

مراحل التجهيز لبناء فريق عمل ناجح في علم البيانات

Advertisements

غالباً ما يعاني موظف علم البيانات في فترة تعيينه الأولى من بعض الصعوبات التي تتجسد ببعض الفوضى وعدم الاستقرار وقلة التنظيم ولربما صعوبة التأقلم والارتباك خصوصاً في الأيام الأولى على أقل تقدير، ولكن لابد للموظف الجديد أن يتجاوز هذه العراقيل التي هي باعتقادي حالة طبيعية فتجاوز هذه المرحلة يعتبر أولى خطواته نحو النجاح والتطور

وما سنتناوله في مقالتنا هذه هو كيفية تهيئة الظروف المناسبة لبناء فريق عمل ناجح في وظيفة علم البيانات

يعتبر زملاء العمل هم البيئة التي تساعد كل شخص في هذه المجموعة على التقدم والتطور وعلى مختلف مستوياتهم، وكونك موظف مبتدئ ليعينك زميلك الذي تم تعيينه منذ فترة وجيزة على الإجابة عن أسئلة المبتدئين وسرعان ما تتشكل لديك فكرة عن أساسيات منظومة العمل للوظيفة، ومع نمو نشاطك وتتطور مستواك يمكن أن تبدأ مرحلة تلقي أفكار حول مجموعة خبرات ومهارات ممن هم أقدم منك في الوظيفة من ذوي الخبرة والكفاءة في العمل إلى أن تجد في نفسك ذلك الموظف المتمرس الذي بإمكانه مناقشة مديره في مواضيع أعمق وأدق، فالمدير بشكل عام يميل إلى الموظف الذي يقدم الاقتراحات ويبادر إلى المناقشة الفعالة عن طريق إبداء الآراء القيمة وتقديم الحلول المجدية

ولابد أنكم تتفقون معي بأننا إذا نظرنا إلى أي مجتمع وظيفي ناجح شركة كانت أم مؤسسة أو حتى ضمن القطاع الخاص نرى أن أساس النجاح يكمن روح التعاون والمحبة بين أعضاء الفريق على اختلاف على مستوياتهم وشهاداتهم

وفي مستهل الحديث عن مراحل التأسيس وخصوصاً في الشهر الأول من الوظيفة ننصح بالإكثار من الأسئلة فهي فترة مثالية لتلقي المعلومات وتحديد الأولويات وتعلم المفردات وذلك من خلال اتباع الإرشادات التالية

احرص على الانضمام إلى الوحدات الإرشادية التي تقدمها شركتك وهي مخصصة لتوجيه الموظفين الجدد فهي كفيلة بإطلاعِك على سياسة الشركة ونهج عملها من ناحية الخصوصية والأمن والأخلاق، كما وسيكون بإمكانك طلب أدلة إرشادية شاملة بما تحتاجه

اسعى دائماً للحصول على معلومات تخص عمل الفريق لكي تتمكن من مواكبة العمل معهم وذلك عن طريق التواصل المستمر مع مديرك ومحاولة تقديم اقتراحات تسهم في تقدُّم عمل الشركة، وحاول معرفة نوع التحديات التي تواجهها الشركة لتبدأ ببناء خطة ناجحة معتمداً على مهاراتك وأسلوبك للتغلب على المشاكل ومواجهة التحديات

اغتنم الفرصة في حال عدم وجود مستودعات داخلية لنشر مجموعات التحليلات وقم بجمع الأمثلة وإنشاء واحدة بحيث تصبح هذه المستودعات مهمة جداً للفريق والموظفين المستقبليين ولا تفوت على نفسك التعرف على العمل أو المشروع السابق للشركة – أي قبل فترة تعيينك – لتتكون لديك فكرة عن آلية عمل المشاريع القادمة

حاول البقاء على اطلاع مستمر على المواضيع الحالية في الشركة وذلك عن طريق الانضمام إلى اشتراكات البريد الإلكتروني ومنصات الدردشة الأخرى فانضمامك لهذه القنوات والتعرف على روداها ومشاركتهم الأفكار والخبرات يساعدك اكتساب مزيد من الخبرة

احرص على تقديم نفسك أمام مديرك وزملائك من خلال الاجتماعات وحاول بإيجاز طرح بعض أعمالك ومشاريعك التي قمت بها والحلول التي قدمتها خلال قيامك بالمشاريع أي باختصار أعطِ نظرة مبسطة عن إنجازاتك كتلك التي تحدثت عنها أثناء المقابلة، فاطلاع مديرك وزملائك على محفظة أعمالك سيرفع رصيدك العملي وسيزيد ثقتهم بك

وقد كنا قدر شرحنا في مقال سابق كيفية بناء محفظة أعمال في مجال علم البيانات للاطلاع اضغط هنا

من الضروري التعرُّف على جهات الاتصال الرئيسية في الشركة لذا يجب عليك طلب قائمة بجهات الاتصال من مديرك أو زملائك

Advertisements

ابدأ بتجهيز منظومة عملك الخاصة بعلوم البيانات

الخطوة الأولى هي إعداد الكمبيوتر الخاص بك ورفده بمعلومات تسجيل الدخول والوصول عن بعد، قم بتنزيل البرامج اللازمة لعملك واحصل على الدعم الفني ولا تنسى المعدات والأجهزة اللازمة

من المهم جداً أن تحصل على المعلومات في أسرع وقت ممكن بعد تعيينك إذ أن المعالجة تحتاج لبعض الوقت وعامل الوقت هنا مهم جداً، عليك المبادرة مباشرةً بسؤال مديرك والقائمين على العمل عن مجموعات البيانات التي تلزمك للتواصل معهم واطلب قائمة بمواقع الويب التي قد تحتاجها في عملك

بالتأكيد لا تنس تحميل البرامج التي يعتمد عليها فريقك في العمل بشكل مستمر كلغات البرمجة وأدوات تصور البيانات

فهم (الدومين): فهو أمر ضروري جداً يعينك على ضمان تفسير البيانات بشكل صحيح عند القيام بعمليات التحليل أو استخدام نموذج التعلم الآلي

مرحلة إثبات الكفاءة

بعد إتمامك لمرحلة التجهيز والإعداد الصحيحين يتوجب عليك تثبيت نفسك وإثبات كفاءتك وذلك باتباع الخطوات التالية

ابدأ رحلتك الوظيفية بالتعرف على زملائك وعرفهم عن نفسك كأن تطلب من مديرك أن يعملهم بتعيينك معهم في الفريق شاركهم آراءك وخبراتك ولو كانت متواضعة فهذا سيساعدهم على تحديد مستوى التفاعل معك وسيساعد في بناء روح التعاون والمشاركة بين أعضاء الفريق

الانطباع الأول هو الأثر الذي سينطبع عند زملائك منذ اللقاء الأول سواء كان على مستوى أخلاقك أو مستواك العلمي، فمن ناحية الأخلاق والتعامل يميل الناس بشكل عام نحو الشخص المتواضع والمحب والمتسامح ويسارعون لاكتساب صداقته ليكونوا مقربين منه، أما على المستوى العلمي فعندما يجد زملاؤك فيك الشخص المحب للتعاون وتشارك الأفكار والخبرات فستكون بنظرهم شخصاً مثالياً ونموذجاً للموظف الكفؤ

دع الآخرين يتعرفون على طبيعة عملك ومهمتك الأساسية في الشركة وابقهم على اطلاع دائم بأسلوب عملك وإنجازاتك كأن تضع روابط في الرسائل الإخبارية وتقدمها إلى الفريق

أخيراً.. أعتقد أنه باتباع هذه الخطوات يمكن تجاوز الفترة الأصعب في مرحلة التعيين في وظيفة جديدة، وهذا ما وصل إلى ذهني من أمور تعين على ذلك

إن كنتم أصدقائي ترون أن هناك أمور لم نتطرق إلى ذكرها قد تساعد على تأسيس فريق عمل ناجح فشاركونا فيها في التعليقات لنطبق ما قرآناه سابقاً على أرض الواقع ونبني فريقاً مصغراً يتبادل أعضاؤه المعلومات والخبرات بين الناشر والمتلقي .. وشكراً

Advertisements

The 10 most popular machine learning algorithms for 2023

Posted on July 2, 2023July 3, 2023 by s4l8384gmailcom

Advertisements

1. Linear regression

This term stands for a process of statistical analysis to test the relationship between two continuous variables, the first is independent and the second is one dependent

This type of statistics is used to find the best line through a set of data points that in turn will reveal the best future predictions

The simple linear regression equation is as follows:

y = b0 + b1*x

y is the dependent variable

x represents the independent variable

b0 represents the y-intercept (the point of intersection of the y-axis with the line)

b1 represents the slope of the line

And by the method of least squares, we can get the most appropriate line, that is, the line that reduces the sum of the square differences between the actual and expected values of the value of y

We can also customize the work of linear regression to expand it to several independent variables, then it is called multiple linear regression, whose equation is as follows:

y = b0 + b1x1 + b2x2 +… + bn * xn

x1, x2, …, xn represent the independent variables

b1, b2, …, bn represent the corresponding variables

As mentioned above, linear regression is useful for obtaining future predictions, as is the case when predicting stock prices or determining future sales of a specific product, and this is done by making predictions about the dependent variable

However, there are cases in which the regression model is not very accurate, in the event that there are extreme values that do not take the direction of the data in general

In order to show the optimal treatment in linear regression in the presence of extreme values, the following figure is given

– Neutralizing outliers from the data set before training the model

– Minimize the effect of outliers by applying a transform as taking a data log

Use powerful regression methods such as RANSAC or Theil-Sen because they mitigate the negative impact of outliers more effectively than traditional linear regression.

However, it cannot be denied that linear regression is an effective and commonly used statistical method

2. Logistic regression

It is a statistical method used to obtain predictions for options that bear two options, i.e. binary outcome, by relying on one or more independent variables, and this regression has a role in classification and sorting functions, such as predicting customer behavior and other tasks.

The work of logistic regression is based on a sigmoid function that sets the input variables to a probability between 0 and 1, and then comes the role of the prediction to get the possible outcome

Logistic regression is represented by the following equation:

P(y=1|x) = 1/(1+e^-(b0 + b1x1 + b2x2 + … + bn*xn))

P(y = 1|x) represents the probability that the outcome of y is 1 compared to the input variables x

b0 represents the intercept

b1, b2, …, bn represent the coefficients of the input variables x1, x2, …, xn

By training the model on a data set and using the optimization algorithm, the coefficients are determined and then used to make predictions by entering new data and calculating the probability that the result is 1

In the following diagram we see the logistic regression model

By examining the previous diagram , we find that the input variables x1 and x2 were used to predict the result y that has two options.

This regression is tasked with assigning the input variables to a probability that will determine in the future the shape of the expectation of the outcome

The coefficients b1 and b2 are determined by training the model on a data set and setting the threshold to 0.5.

3. Support Vector Machines (SVMs)

SVM is a powerful algorithm for both classification and regression. It divides data points into different categories by finding the optimal level with maximum margin. SVMs have been successfully applied in various fields, including image recognition, text classification, and bioinformatics.

The cases where SVMs are used are when the data cannot be separated by a straight line, this channel can distribute the data over a high-dimensional swath to facilitate the detection of nonlinear boundaries

SVMs have proven memory utilization, they focus on storing only the support vectors without the entire data set, and they are highly efficient in high-dimensional spaces even if the number of features is greater than the number of samples

This technique is strong against outliers due to its dependence on support vectors

However, one of the drawbacks of this technique is that it is sensitive to kernel function selection, and it is not effective for large data sets, as its training time is often very long.

4. Decision Trees:

Decision trees are multi-pronged algorithms that build a tree-like model of decisions and their possible outcomes. By asking a series of questions, decision trees classify data into categories or predict continuous values. They are common in areas such as finance, customer segmentation, and manufacturing

So, it is a tree-like diagram, where each internal set forms a decision point, while the leaf node expresses prediction

To explain how the decision tree works:

The process of building the tree begins with selecting the root node so that it is easy to sort the data into different categories, then the data is iteratively divided into subgroups based on the values of the input features in order to find a classification formula that facilitates the sorting of the different data or required values

The decision tree diagram is easy to understand as it enables the user to create a well-defined visualization that allows the correct and beneficial decision-making

However, it should be known that the deeper the decision tree and the greater the number of its leaves, the greater the probability of neglecting the data, and this is one of the negative aspects of the decision tree.

If we want to talk about other negative aspects, it must be noted that the decision tree is often sensitive to the order of the input features, and this leads to different tree diagrams, and on the other hand, the final tree may not give the best result.

5. Random Forest:

The random forest is a group learning method that combines many decision trees to improve prediction accuracy. Each tree is built on a random subset of the training data and features. Random forests are effective for classification and regression tasks, finding applications in areas such as finance, healthcare, and bioinformatics.

Random forests are used if the data in a single decision tree is subject to overfitting, thus improving the model with greater accuracy

This forest is formed using the Bootstrapping technique which generates multiple decision trees

It is a statistical method based on randomly selecting data points and replacing them with the original data set. As a result, multiple data sets are formed that include a different set of data points that are later used to train individual decision trees.

Random forest allows to improve overall model performance by reducing the correlation between trees within a random forest because it relies on using a random subset of features for each tree and this method is called “random subspace”.

One of the drawbacks of a random forest is the higher computational cost of training and predictions as the number of trees in a forest increases

In addition to its lower interpretability compared to a single decision tree, it is superior to a single decision tree by being less prone to overfitting and having a higher ability to handle high-dimensional datasets.

Advertisements

6. Naive Bayes

Naive Bayes is a probability algorithm based on Bayes’ theory with the assumption of independence between features. Despite its simplicity, Naive Bayes performs well in many real-world applications, such as spam filtering, sentiment analysis, and document classification.

Based on Bayes’ theorem, the probability of a particular class is calculated according to the values of the input features

There are different types of probability distributions when implementing the Naive Bayes algorithm, depending on the type of data

Among them:

Gaussian: for continuous data

Multinomial: for discrete data

Bernoulli: for binary data

Turning to the advantages of using this algorithm, we can say that it enjoys its simplicity and quality in terms of its need for less training data compared to other algorithms, and it is also characterized by the ability to deal with missing data.

But if we want to talk about the negatives, we will collide with their dependence on the assumption of independence between features, which often contradicts real-world data.

In addition, it is negatively affected by the presence of features different from the data set, so the level of performance decreases and the required efficiency decreases with it

7. KNN

KNN is a non-parametric algorithm that classifies new data points based on their proximity to the seeded examples on the training set. It is widely used in pattern recognition and recommendation systems

KNN can handle classification and regression tasks.

That is, it relies on assigning similarity to similar data points

After choosing the k value, the value closest to the prediction, the data is sorted into training and test sets to make a prediction for a new input by calculating the distance between the entry and each data point in the training set, then choosing the k nearest data points to set the prediction later using the closest set of data points

8. K-means

The working principle of this algorithm is based on the random selection of k centroids

So that k represents the number of clusters we want to create and then each data point is mapped to the cluster that was closest to the central point

So it is an algorithm that relies on grouping similar data points together and it is based on distance so that distances are calculated to assign a point to a group

This algorithm is used in many market segmentation, image compression and many other widely used applications

The downside of this algorithm is that its assumptions for data sets often do not match the real world

9. Dimensional reduction algorithms

This algorithm aims to reduce the number of features in the data set while preserving the necessary information. This technique is called “Dimensional Reduction”.

Like many dimension reduction algorithms, this algorithm makes data visualization easy and simple.

As in Principal Components Analysis (PCA)

and linear discriminant analysis (LDA)

Distributed Random Neighborhood Modulation (t-SNE)

We will come to explain each one separately

* Principal Component Analysis (PCA): It is a linear pattern of dimension reduction. Principal components can be defined as a set of correlated variables that have been orthogonally transformed into uncorrelated linear variables. Its aim is to identify patterns in the data and reduce its dimensions while preserving the necessary information.

* Linear Discrimination Analysis (LDA): is a supervised dimensionality reduction pattern used to obtain the most discriminating features of the sorting and classifying function

*t-Distributed Stochastic Neighbor Embedding (t-SNE)

It is a well-proven nonlinear dimension reduction technique for visualizing high-dimensional data in order to obtain a low-dimensional representation that prevents loss of data structure.

The downside of the dimension reduction technique is that some necessary information may be lost during the dimension reduction process

It is also necessary to know the type of data and the task to be performed in order to choose the dimension reduction technique, so the process of determining the appropriate number of dimensions to keep may be somewhat difficult.

10. Gradient boosting algorithm and AdaBoosting algorithm

They are two algorithms used in classification and regression functions and they are widely used in machine learning

The working principle of these two algorithms is based on forming an effective model by collecting several weak models

Gradient enhancement:

It depends on building a pattern in a progressive manner according to multiple stages, starting from installing a simple model on the data (such as a decision tree, for example) and then correcting the errors made by the previous models by adding additional models. Thus, each added model obtains agreement with the negative gradient of the loss function in terms of the predictions of the previous model.

In this way, the final output of the model is the result of assembling the individual models

AdaBoost:

It is an acronym for Adaptive Boosting. This algorithm is similar to its predecessor in terms of its mechanism of action by relying on creating a pattern for the forward staging method and differs from the gradient boosting algorithm by focusing on improving the performance of weak models by adjusting the weights of the training data in each iteration, i.e. it depends on the wrong training models according to the previous model. It then adjusts the weights for the erroneous models so that they have a higher probability of being selected in the next iteration until finally arriving at a model weighted for all individual models. These two algorithms are characterized by their ability to deal with wide types of numerical and categorical data, and they are also characterized by their strength in dealing with the extreme value and with data with missing values, so they are used in many practical applications

Advertisements

أشهر عشرة خوارزميات التعلم الآلي للعام 2023

Advertisements

1. الانحدار الخطي

يرمز هذا المصطلح إلى عملية تحليل إحصائي لاختبار العلاقة بين متغيرين مستمرين الأول مستقل والثاني تابع واحد

يستخدم هذا النوع من الإحصاء لإيجاد الخط الأفضل عن طريق مجموعة من نقاط البيانات التي بدورها ستكشف لنا التنبؤات المستقبلية الأفضل

:تتمثل معادلة الانحدار الخطي البسيط بالشكل التالي

y = b0 + b1*x

متغير التابع y يمثل

المتغير المستقل x يمثل

y تقاطع b0 يمثل

(مع الخط y نقطة تقاطع المحور)

ميل الخط b1 يمثل

وبطريقة المربعات الصغرى نستطيع الحصول على الخط الأنسب أي الخط الذي يقلل من مجموع الفروق المربعة بين القيم الفعلية

y والمتوقعة للقيمة

كما وأننا نستطيع تخصيص عمل الانحدار الخطي ليتوسع إلى عدة متغيرات مستقلة فيسمى عندها الانحدار الخطي المتعدد والذي تتمثل معادلته بالشكل التالي

y = b0 + b1x1 + b2x2 +… + bn * xn

المتغيرات المستقلة x1 ، x2 ، … ، xn تمثل

المتغيرات المقابلة b1 ، b2 ، … ، bn وتمثل

وكما ذكرنا آنفاً يفيد الانحدار الخطي للحصول على التنبؤات المستقبلية، كما هو الحال عند التنبؤ بأسعار الأسهم أو تحديد مبيعات مستقبلية لمنتج معين ويتم ذلك بإجراء تنبؤات حول المتغير التابع

إلا أنه يوجد حالات لا يكون فيها نموذج الانحدار دقيق جداً وذلك في حال وجود قيم متطرفة لا تأخذ اتجاه البيانات بشكل عام

ولتبيان التعامل الأمثل في الانحدار الخطي بوجود القيم المتطرفة على الشكل التالي

تحييد القيم المتطرفة وإبعادها من مجموعة البيانات قبل تدريب النموذج *

تقليل تأثير القيم المتطرفة عن طريق تطبيق تحويل كأخذ سجل البيانات *

Theil-Senأو RANSAC استخدام طرق الانحدار القوية مثل *

لأنها تخفف من التأثير السلبي للقيم المتطرفة بفعالية أكبر من الانحدار الخطي التقليدي

ومع ذلك لا يمكن إنكار أن الانحدار الخطي يعتبر طريقة إحصاء فعالة وشائعة الاستخدام

2. الانحدار اللوجستي

وهو طريقة إحصاء تستخدم للحصول على تنبؤات للخيارات التي تحتمل خيارين أي ثنائية النتيجة وذلك بالاعتماد على مغير مستقل أو أكثر كما وأن لهذا الانحدار دور في وظائف التصنيف والفرز كأن يتنبأ بسلوك العملاء وغيرها من المهام الأخرى

يعتمد عمل الانحدار اللوجستي على دالة سينية تقوم بتعيين متغيرات الإدخال

إلى احتمال بين صفر وواحد

ثم يأتي دور التوقع للحصول على النتيجة المحتملة

:يتمثل الانحدار اللوجستي بالمعادلة التالية

P(y=1|x) = 1/(1+e^-(b0 + b1x1 + b2x2 + … + bn*xn))

P (y = 1 | x) يمثل

1 هي y احتمال أن تكون نتيجة

x مقارنةً مع متغيرات الإدخال

التقاطع b0 تمثل

b1 ، b2 ، … ، bn تمثل

معامِلات متغيرات الإدخال

x1 ، x2 ، … ، xn

ومن خلال تدريب النموذج على مجموعة بيانات والاستعانة بخوارزمية التحسين يتم تحديد المعاملات ثم يتم استخدامه في إجراء التنبؤات عن طريق إدخال بيانات جديدة

1 وحساب احتمالية أن تكون النتيجة

في الشكل التالي نلاحظ نموذج الانحدار اللوجستي

وبدراسة الشكل السابق نجد أنه استُخدمت

y للتنبؤ بالنتيجة x2و x1 متغيرات الإدخال

التي تحتمل خيارين

يتولى هذا الانحدار مهمة تعيين متغيرات الإدخال إلى احتمالية والتي ستحدد مستقبلاً شكل التوقع للنتيجة

b2و b1 أما المعامِلان

فيتحددان من خلال تدريب النموذج على مجموعة بيانات

0.5 وتعيين الحد على

3. (SVMs) دعم آلات المتجهات

خوارزمية قوية لكل من التصنيف والانحدار SVM يعد

يقسم نقاط البيانات إلى فئات مختلفة من خلال إيجاد المستوى الأمثل مع الحد الأقصى للهامش

بنجاح في مجالات مختلفةSVMs تم تطبيق

بما في ذلك التعرف على الصور وتصنيف النص والمعلوماتية الحيوية

SVMs تعتبر الحالات التي تستخدم فيها

هي التي لا يمكن فيها فصل البيانات بخط مستقيم، فبإمكان هذه القنية أن توزع البيانات على رقعة عالية الأبعاد لتسهيل اكتشاف حدود غير خطية

قدرتها على استخدام الذاكرة SVMs أثبتت أجهزة

فهي تركز على تخزين متجهات الدعم فقط دون الحاجة إلى مجموعة البيانات كلها، كما وأنها تتمتع بكفاءة عالية في المساحات عالية الأبعاد حتى لو كان عدد الميزات أكبر من عدد العينات

تعتبر هذه التقنية قوية ضد القيم المتطرفة نظراً لاعتمادها على ناقلات الدعم

إلا أن أحد سلبيات هذه التقنية هو أنها

kernel حساسة لاختيار وظيفة

كما أنها غير فعالة لمجموعات البيانات الضخمة كونها وقت التدريب فيها طويل جداً على الأغلب

4. أشجار القرار

أشجار القرار هي خوارزميات متعددة الجوانب تبني نموذجًا شبيهًا بالشجرة من القرارات ونتائجها المحتملة. من خلال طرح سلسلة من الأسئلة، تصنف أشجار القرار البيانات إلى فئات أو تتنبأ بقيم مستمرة. وهي شائعة في مجالات مثل التمويل وتجزئة العملاء والتصنيع

إذاً هي مخطط يشبه الشجرة بحيث تشكل كل عدة داخلية نقطة قرار أما العقدة الورقية فتعبر عن التنبؤ

:ولشرح عمل شجرة القرار

تبدأ عملية بناء الشجرة باختيار عقدة الجذر بحيث يسهل فرز البيانات إلى فئات مختلفة، ثم يتم تقسيم البيانات إلى مجموعات فرعية بشكل متكرر بالاعتماد على قيم ميزات الإدخال بغية إيجاد صيغة تصنيفية تسهل فرز البيانات المختلفة أو القيم المطلوبة

مخطط شجرة القرار سهل الفهم فهو يمكن المستخدم من إنشاء تصور واضح المعالم يتيح اتخاذ القرار الصائب والمفيد

إلا يجب معرفة أنه كلما كانت شجرة القرار عميقة أكثر وكان عدد أوراقها أكبر كلما زاد احتمال التفريط في البيانات وهذا أحد الجوانب السلبية في شجرة القرار

وإذا أردنا التحدث عن جوانب سلبية أخرى فلابد من التنويه إلى أن شجرة القرار غالباً ما تكون حساسة لترتيب ميزات الإدخال وهذا يؤدي إلى مخططات شجرية مختلفة والمقابل قد لا تعطي الشجرة النهائية النتيجة الأفضل

Advertisements

5. الغابة العشوائية

الغابة العشوائية هي طريقة تعلم جماعية تجمع بين العديد من أشجار القرار لتحسين دقة التنبؤ، كل شجرة مبنية على مجموعة فرعية عشوائية من بيانات التدريب والميزات، تعتبر الغابات العشوائية فعالة في مهام التصنيف والانحدار وإيجاد تطبيقات في مجالات مثل التمويل والرعاية الصحية والمعلوماتية الحيوية

ويتم استخدام الغابات العشوائية في حال كانت البيانات في شجرة قرار واحدة معرضة للإفراط في التجهيز وبالتالي تحسين النموذج بدقة أكبر

Bootstrapping يتم تشكيل هذه الغابة باستخدام تقنية

التي تقوم بإنشاء أشجار قرارات متعددة

وهي طريقة إحصائية تعتمد على اختيار عشوائي لنقاط بيانات واستبدالها مع مجموعة البيانات الأصلية فتتشكل بالنتيجة مجموعات بيانات متعددة تتضمن مجموعة مختلفة من نقاط البيانات المستخدمة لاحقاً لتدريب أشجار القرار الفردية

تتيح الغابة العشوائية تحسين أداء النموذج بشكل عام عن طريق تقليل الارتباط بين الأشجار ضمن الغابة العشوائية لأنها تعتمد على استخدام مجموعة فرعية عشوائية من الميزات لكل شجرة وهذه الطريقة تسمى “الفضاء الجزئي العشوائي”

أحد سلبيات الغابة العشوائية يكمن في ارتفاع التكلفة الحسابية للتدريب والتنبؤات كلما زاد عدد الأشجار في الغابة علاوة على انخفاض قابلية التفسير مقارنة بشجرة قرار واحدة إلا أنها تتفوق على شجرة القرار الواحدة بكونها أقل عرضة للإفراط في التجهيز وقدرتها العالية على التعامل مع مجموعات بيانات عالية الأبعاد

6. Naive Bayes

هي خوارزمية احتمالية تعتمد على نظرية بايز مع افتراض الاستقلال بين الميزات

Naive Bayes على الرغم من بساطته فإن

يعمل بشكل جيد في العديد من تطبيقات العالم الحقيقي، مثل تصفية البريد العشوائي، وتحليل المشاعر، وتصنيف المستندات

بالاعتماد على نظرية بايز يتم حساب احتمالية فئة معينة وفق قيم ميزات الإدخال ويوجد أنواع مختلفة من التوزيعات الاحتمالية

تستخدم حسب نمط البيانات Naive Bayes عند تنفيذ خوارزمية

:نذكر منها

للبيانات المستمرة :Gaussian

للبيانات المنفصلة :Multinomial

للبيانات الثنائية :Bernoulli

وبالتطرق إلى إيجابيات استخدام هذه الخوارزمية فيمكننا القول أنها تتمتع ببساطتها وجودتها من حيث حاجتها لبيانات تدريب أقل مقارنة بالخوارزميات الأخرى وتتميز أيضاً بإمكانية التعامل مع البيانات المفقودة

أما إذا أردنا التحدث عن السلبيات فسنصطدم باعتمادها على افتراض الاستقلال بين الميزات والذي غالباً ما يتعارض مع بيانات العالم الواقعي

إضافة إلى أنها تتأثر سلباً بوجود ميزات مختلفة عن مجوعة البيانات فينخفض مستوى الأداء وتقل معها الكفاءة المطلوبة

7. KNN

هي خوارزمية غير معلمية تصنف نقاط البيانات الجديدة بناءً على قربها من الأمثلة المصنفة في مجموعة التدريب، يستخدم على نطاق واسع في التعرف على الأنماط وأنظمة التوصية

التعامل مع مهام التصنيف والانحدار KNN يمكن لـ

أي أنها تعتمد على إضفاء صفة التشابه على نقاط البيانات المتشابهة

القيمة الأقرب للتنبؤ k بعد اختيار قيمة

يتم فرز البيانات إلى مجموعات تدريب واختبار لعمل تنبؤ لمدخل جديد عن طريق حساب المسافة بين الإدخال وكل نقطة بيانات في مجموعة التدريب

أقرب نقاط البيانات k ثم تختار

ليتم تعيين التنبؤ لاحقاً باستخدام المجموعة الأكثر قرباً لنقاط البيانات

8. K-means

يعتمد مبدأ عمل هذه الخوارزمية

k centroids على الاختيار العشوائي لـ

عدد المجموعات التي نريد إنشاءها k بحيث تمثل

ثم يتم تحديد كل نقطة بيانات إلى المجموعة التي تم أقرب نقطة مركزية

إذاً هي خوارزمية تعتمد على تجميع نقاط البيانات المتشابهة معاً وهي قائمة على المسافة بحيث تُحسب المسافات لتعيين نقطة إلى مجموعة

تستخدم هذه الخوارزمية في كثير من تطبيقات تجزئة السوق وضغط الصور وغيرها العديد من التطبيقات الواسعة الاستخدام

يتمثل الجانب السلبي لهذه الخوارزمية هو أن افتراضاتها لمجموعات البيانات لا تطابق الواقع الحقيقي في أغلب حيان

9. خوارزميات تقليل الأبعاد

تهدف هذه الخوارزمية إلى تقليل عدد الميزات في مجموعة البيانات مع المحافظة على المعلومات الضرورية، تسمى هذه التقنية تقليل الأبعاد

تسهم هذه الخوارزمية في جعل تصور البيانات أمراً سهلاً وبسيطاً شأنها شأن كثير من خوارزميات تقليل الأبعاد

(PCA) كما في تحليل المكونات الرئيسية

(LDA) والتحليل التمييزي الخطي

(t-SNE) والتضمين المتجاور العشوائي الموزع

وسنأتي على شرح كل واحدة منها على حدا

: (PCA) تحليل المكون الرئيسي *

هو نمط خطي لتقليل الأبعاد، ويمكن تعريف المكونات الأساسية بأنها مجموعة من المتغيرات المرتبطة تم تحويلها تحويلاً متعامداً إلى متغيرات خطية غير مترابطة، الهدف منه تحديد الأنماط في البيانات وتقليل أبعادها مع المحافظة على المعلومات الضرورية

: (LDA) تحليل التمييز الخطي *

هو نمط تقليل الأبعاد خاضع للإشراف يستخدم بغية الحصول على السمات الأكثر تمييزاً لوظيفة الفرز والتصنيف

t-Distributed Stochastic Neighbor Embedding (t-SNE) تضمين *

وهي تقنية لتقليل الأبعاد غير الخطية أثبتت جدارتها لتصور البيانات عالية الأبعاد بغية الحصول على تمثيل منخفض الأبعاد يَحُول دون فقدان بنية البيانات

تتمثل سلبيات تقنية تقليل الأبعاد هو أنه بعض المعلومات الضرورية قد تتعرض الفقدان أثناء عملية تقليل الأبعاد

كما وأنه من الضروري معرفة نوع البيانات والمهمة المطلوب تنفيذها لاختيار تقنية تقليل الأبعاد لذا قد تكون عملية تحديد العدد الأنسب للأبعاد للاحتفاظ بها صعبة نوعاً ما

10. AdaBoosting خوارزمية تعزيز التدرج وخوارزمية

وهما خوارزميتان تستخدمان في وظائف التصنيف والانحدار وهما تستخدمان على نطاق واسع في التعلم الآلي

يعتمد مبدأ عمل هاتين الخوارزميتين على تشكيل نموذج فعال من خلال جمع عدة نماذج ضعيفة

:تعزيز التدرج

تعتمد على بناء نمط بأسلوب تقدمي وفق مراحل متعددة انطلاقاً من تركيب نموذج بسيط على البيانات (كشجرة القرار مثلاً) ثم تصحيح الأخطاء التي ارتكبتها النماذج السابقة وذلك بإضافة نماذج إضافية وبذلك يحصل كل نموذج مضاف على توافق مع التدرج السلبي لوظيفة الخسارة من حيث تنبؤات النموذج السابق

وعلى هذا النحو يكون الناتج النهائي للنموذج هو حصيلة تجميع النماذج الفردية

:AdaBoost

Adaptive Boosting وهي اختصار لـ

تشبه هذه الخوارزمية سابقتها من حيث آلية عملها باعتمادها على إنشاء نمط لأسلوب المرحلي للأمام وتختلف عن خوارزمية تعزيز التدرج بتركيزها على تحسين أداء النماذج الضعيفة من خلال تعديل أوزان بيانات التدريب في كل تكرار أي أنها تعتمد على نماذج التدريب الخاطئة حسب النموذج السابق وثم تثوم بتعديل الأوزان النماذج الخاطئة بحيث يصبح لديها احتمال أكبر للاختيار في التكرار الذي يليه حتى الوصول في النهاية إلى نموذج مرجح لجميع النماذج الفردية

تمتاز هاتان الخوارزميتان إلى بقدرتهما على التعامل مع أنماط واسعة من البيانات الرقمية منها والفئوية وتمتازان أيضاً بقوتهما بالتعامل مع القيمة المتطرفة ومع البيانات ذات القيم المفقودة لذا تستخدمان في العديد من التطبيقات العملية

Advertisements

Collection of an Advanced SQL Technologies That is Indispensable For Every Data Scientist

Posted on June 16, 2023 by s4l8384gmailcom

Advertisements

Structured Query Language (SQL) is the standard query language for relational databases. This language is simple and easy to understand, but moving to an advanced level in data analysis requires mastering the advanced techniques of this language.

And when we talk about the techniques that need to be learned to move to an advanced level, we are talking about a system of functions and features that allow you to perform complex tasks on data such as joining, aggregation, subqueries, window functions, and other functions that can deal with big data to obtain effective and accurate results.

Some vivid examples of using advanced SQL techniques

* Window functions

With this technique you can perform arithmetic operations across multiple rows related to the specified row

For example, if we have a table with the following columns:

order_id, customer_id, order_date and order_amount

It is required to calculate the current total sales for each individual customer sorted by order date

SUM can be used to perform this task

To calculate the current total for each individual customer, the SUM function must be applied to the order_amount column and divided by the customer_id column.

ORDER BY indicates that the rows are ordered according to the order dates in each section

Phrase:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Indicates that the calculation is within the window between the rows at the beginning of the section up to the current row

The result of the query will come in the form of a table consisting of the same columns as the orders table, in addition to a column called run_total, which indicates the current total sales for each customer, and as we mentioned, arranged according to the dates of orders

* Common Table Phrases (CTEs)

CTEs allow you to get a set of results that can be used in SQL statements at a later time

For example: We have a table, let’s call it the Employees table, composed of the following columns:

employee_id, employee_name, department_id and salary

What is required is to calculate the average salary for each department, then search for employees with a higher salary than the average salary of the department to which they belong.

CTE can be used to perform two queries, the first to calculate the average salary of each department and the second to search for employees with higher salaries than the average salary of the department

We note here that the task has been divided into two phases to facilitate the query

The first stage is the salary calculation for each department

The second stage is to find the employees whose salary is higher than the average salary of the department to which they belong

In the first query, CTE, called department_avg_salary, which is assigned to calculate the salary for each department, using the AVG function and the GROUP BY statement, which sorts the employees according to the department to which they belong.

As for the second query, CTE, called department_avg_salary, it is used as if it were a table, then it is joined to the employees table in the department_id column, and then the result is extracted by WHERE to finally get the employees with the highest salary from the average salary of the department they belong to.

* Aggregate functions

Aggregate functions can be defined briefly as functions whose task is to perform an arithmetic operation on a group of values to derive a result in the form of a single value, such as performing arithmetic operations in a table on several rows or columns in order to obtain a useful data summary.

In fact, the use of aggregate functions is a real advantage in the SQL language, as it makes queries in it more easy and accurate.

The functions SUM, AVG, MIN, MAX, and COUNT are the most used in SQL

To be clear: we have a sales table composed of the following columns

sale_id, product_id, sale_date, sale_amount, and region

It is required to calculate the total sales and the average sales for each product separately, and then determine the best-selling product in each region

This is done by following the following steps

We have to sort the sales by product and region, calculate the total and average sales, then discover the best-selling product in the region by using aggregate functions

In this example, we use three aggregate functions AVG, SUM, and RANK

We will explain the task of each of them separately

AVG function calculates the average value of the product and the region

SUM function calculates the total value of each product and region using the GROUP BY statement

RANK function finds and explores the best-selling product in each region

To specify sorting by region, the OVER clause takes over this task

And to specify the column to divide the data (area) we use the PARTITION BY statement

As for getting the descending order of the sum of the value of each product in each region specified by the ORDER BY statement

The result of the query on a column shell is:

product_id, region, total_sale_amount, avg_sale_amount, and rank

So that the ranking column indicates the classification of each product in each region according to the total value of sale, so that the best-selling product in each region ranks first.

The uses of aggregate functions vary according to the tasks assigned to them. For example, you can calculate records, calculate maximum values, and other tasks.

Advertisements

* Pivot tables

They are tables that contain data extracted from large tables in order to analyze it easier, as it allows converting data from rows to columns to display the data in a more coordinated manner.

These tables are built using the PIVOT operator, whose task is to sort the data according to a specific column, and then show the results in the form of a formatted table.

To clarify

The PIVOT operator in the previous image is used to define the data axis by product_id plus columns per product and rows per customer

The SUM function calculates the total quantity of each product required by each customer

The p subquery extracts the necessary columns from the orders table

Then the PIVOT is run on the subquery in conjunction with the SUM function to find out the total quantity of each product ordered by each customer

The FOR statement is tasked with specifying the pivot column product_id in our example

The IN statement specifies target values ( [1], [2], [3], [4], [5] )

The pivot table appears as a result of a query for the total quantity ordered by each customer in the form of columns for each product and rows for each customer

* Subqueries

They are nested queries whose task is to retrieve data from one or more tables, and its results are used in the main query, and its function can be to sort and group data into one row or group of rows.

Subqueries such as SELECT, FROM, WHERE, HAVING are used within brackets in various places of the SQL statement.

To be clear: we have two tables

The first table is the employees table consisting of the following columns

employee_id, first_name, last_name, department_id

The second table is the payroll table and consists of the following columns

employee_id, salary, salary_date

It is required to know the highest paid employees in each department

We can find the highest salary in each department using a subquery and then join the result to the Employees and Salaries tables to extract the names of the employees who earn that salary

After executing the subquery as a first step, a result set representing the highest salary in each department is returned, then the employee and salary tables are linked to the result of the subquery by means of the main query to extract the names of the highest paid employees in each department.

To demonstrate this join process, an INNER JOIN statement is used to join the Employee and Salary tables, using the employee_id column as the join key.

The subquery is joined to the main query using the department_id column

The salary column is then used to match the highest salary in each department

The result appears in the form of a table containing the names of the highest paid employees in each department along with the department ID and salary

* Cross Joins

Cross Joins are a type of join operation that returns the Cartesian product of two or more tables without using a join condition, but by combining rows from one table with rows from another table separately, then the result is a table consisting of the available combinations of rows from both tables

This operation is useful in certain circumstances, such as performing a calculation that requires all available value combinations from a set of tables, or generating test data, for example

For clarity we have two tables

The first table is the customers table and it consists of columns

customer_id, customer_name, and city

The second table is the orders table and it consists of columns

order_id, customer_id, and order_date

The requirement is to know the total number of orders for each customer in each city

This is done by creating a result set that includes each customer with each city and then joining the result to the orders table to extract the number of requests for each group

The previous image shows that a result set has been created that includes each customer with each city, and thus the query is Use cross join to return the result set that contains a group that includes the customer and the city

The main query then joins the result of the cross-join with the orders table

Important Notes :

Here left join should be used to keep clients visible in the result even if they did not place any order

In order to ensure that the result of the number of requests for each customer appears in his city, the WHERE clause is used to sort the results and get the rows that match the city in which the customer resides in the cross join

To group the result according to the customer’s ID, name, and city, we use the GROUP BY statement

To calculate the number of orders per customer in each city, we use the COUNT() function.

The result is finally shown in the form of a table containing the total number of orders for each customer in each city

* temporary tables

These tables are relied upon to store the intermediate results in memory or on disk and use them at the end of the work and then get rid of them automatically

Or this type of table is used to divide large and complex queries into smaller parts to make it easier to process

The CREATE TEMPORARY TABLE statement is used to create temporary tables

The SELECT, INSERT, UPDATE, DELETE commands are used to process these tables as if they were regular tables in order to reduce the amount of large and complex data to facilitate processing.

For clarity, we have a sales table consisting of the following columns

date, product, category, sales_amount

It is required to create a report showing the total sales for each category for each month over the past year

We can address this issue through the following actions

The first goal is to obtain the total sales for each category. This is done by creating a temporary table that includes a summary of sales data for each month, and then linking it to the sales table.

This is done by following these steps

Create the temporary table using the CREATE TEMPORARY TABLE statement

A temporary table named Monthly_sales_summary is created with three columns:

month, category, and total_sales

The month column is of type DATE

category column of type VARCHAR (50)

total_sales column of type DECIMAL(10,2)

Using the INSERT INTO statement, we populate the temporary table with the shortened data

To separate the date column into the month level and group the sales data by month and category, we used the DATE_TRUNC function

Then we enter the result of this query into the month_sales_summary table, which now contains a summary of sales data for each month separately.

To get the total sales for each category we can join the temporary table with the sales table

The Sales table is joined with the Monthly_sales_summary table in the columns designated for category and month

From the temporary table, the month, category, and total_sales columns can be selected

To get the required result, which is last year’s sales data, we use the WHERE phrase

To sort the result by category and month we use the ORDER BY statement

The result of the query appears in the form of a table containing the total sales for each category for each month of the previous year

* Materialized Views

The task of these insights is to improve the performance of frequently executed queries, which are various results that are previously stored in the form of actual tables and are called to the original tables without the need to perform operations in them

This process is used to improve the performance of complex queries through data storage and business intelligence applications, which contributes to shortening the time for preparing reports and raising the efficiency of dashboards

The image above shows that an actual offer named Monthly_sales_summary has been created

This presentation contains a summary of sales data for each category for each month

We use the SELECT statement to store the result in the actual view

Actual views are automatically updated when the underlying data changes although they are similar to tables stored on disk, and can also be updated manually using the

REFRESH MATERIALIZED VIEW

You can query the actual view once it is created just like any other table

In the above image the category, month and total sales columns are selected from the actual view month_sales_summary and it sorts the result by category and month

As we mentioned earlier, with the actual view method, you can shorten a lot of the time it takes to run the query, as this method allows you to use pre-calculation and storage of summary data.

In conclusion:

Remember, my reader friend, that keeping abreast of developments and keeping pace with the accelerating technology is very important, and your knowledge of all new technologies and skills makes things easier for you and even increases your scientific level, whether in the field of programming and data analysis or in any other scientific field.

I hope that you have gained a great deal of interest, and please share this information and support the blog so that we can continue to provide everything new, and we are pleased to see your opinions on the comments, thank you.

Advertisements

SQL مجموعة تقنيات

متقدمة لا غنى عنها لكل عالِم بيانات

Advertisements

(SQL) لغة الاستعلام الهيكلية

هي لغة الاستعلام القياسية لقاعدة البيانات العلائقية، تمتاز هذه اللغة ببساطتها وسهولة فهمها، إلا أن الانتقال إلى مستوى متقدم في تحليل البيانات يتطلب إتقان التقنيات المتقدمة لهذه اللغة

وعندما نتحدث عن التقنيات المطلوب تعلمها للانتقال إلى مستوى متقدم فإننا نتحدث عن منظومة من الوظائف والميزات التي تتيح لك القيام بمهام معقدة على البيانات كالضم والتجميع والاستعلامات الفرعية ووظائف النافذة وغيرها من الوظائف الأخرى التي يمكنها التعامل مع البيانات الضخمة للحصول على نتائج فعالة ودقيقة

بعض الأمثلة الحية

المتقدمة SQL على استخدام تقنيات

وظائف النافذة *

من خلال هذه التقنية يمكنك تنفيذ عمليات حسابية عبر عدة صفوف مرتبطة بالصف المحدد

كأن يكون لدينا جدول طلبات يضم الأعمدة التالية

order_id, customer_id, order_date, order_amount

والمطلوب حساب المبيعات الإجمالي الحالي لكل عميل على حدة مرتبة حسب تاريخ الطلب

لتنفيذ هذه المهمة SUM بالإمكان الاستعانة بـ

ليتم حساب الإجمالي الحالي لكل عميل على حدى

SUM يجب تطبيق الدالة

order_amount على عمود

customer_id ويتم تقسيمها وفق العمود

ORDER BY تشير عبارة

إلى أن ترتيب الصفوف يتم وفق تواريخ الطلب في كل قسم

: تشير عبارة

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

إلى أن الحساب يقع ضمن النافذة المحصورة بين الصفوف الواقعة في بداية القسم حتى الصف الحالي ستأتي نتيجة الاستعلام على شكل جدول مؤلف من نفس أعمدة جدول الطلبات

run_total مضافاً إليها عمود يسمى

والدال على المبيعات الإجمالية الحالية لكل عميل وكما ذكرنا مرتبة حسب تواريخ الطلبات

(CTEs) عبارات الجدول الشائعة*

CTEs تتيح لك

الحصول على مجموعة من النتائج التي يمكن الاستعانة بها

في وقت لاحق SQL في جمل

على سبيل المثال: لدينا جدول ولنطلق عليه اسم جدول الموظفين مؤلف من الأعمدة التالية

employee_id, employee_name, department_id, salary

والمطلوب حساب متوسط الراتب لكل قسم، ثم البحث عن الموظفين الأعلى راتباً من متوسط الراتب الخاص بالقسم الذي ينتمون إليه

لإجراء استعلامين CTE يمكن الاستعانة بـ

الأول لحساب متوسط الراتب كل قسم والثاني للبحث عن الموظفين الأعلى راتباً من متوسط راتب القسم

نلاحظ هنا أنه تم تقسيم المهمة إلى مرحلتين لتسهيل الاستعلام

مرحلة أولى وهي حساب الراتب لكل قسم

مرحلة ثانية إيجاد الموظفين الأعلى راتباً من متوسط راتب القسم الذي ينتمون إليه

CTE ففي الاستعلام الأول

department_avg_salary والمسمى

المخصص لحساب الراتب لكل قسم وذلك باستخدام

GROUP BY وعبارة AVG دالة

التي تقوم بفرز الموظفين كلٌ حسب القسم الذي ينتمي إليه

CTE أما الاستعلام الثاني

department_avg_salary المسمى

فيستخدم كما لو كان جدولاً ثم يتم

department_id ضمه إلى جدول الموظفين في العمود

WHERE ثم يتم استخلاص النتيجة بواسطة

لنحصل في النهاية على الموظفين الأعلى راتباً من متوسط الراتب الخاص بالقسم الذي ينتمون إليه

الدالات التجميعية *

يمكن تعريف الدالات التجميعية بشكل مختصر على أنها وظائف مهمتها إجراء عملية حسابية على مجموعة من القيم لاستخلاص نتيجة على شكل قيمة واحدة كإجراء عمليات حسابية في جدول على عدة صفوف أو أعمدة بغية الحصول على خلاصة بيانات مفيدة

وفي الحقيقة يعتبر استخدام الدالات التجميعية

SQL مكسباً حقيقياً في لغة

إذ تجعل الاستعلامات فيها أكثر سهولة ودقة

وتعتبر الدالات

SUM, AVG, MIN, MAX, COUNT

SQL هي الأكثر استخداماً في

وللتوضيح: لدينا جدول مبيعات مؤلف من الأعمدة التالية

sale_id, product_id, sale_date, sale_amount, region

والمطلوب حساب المبيعات الإجمالية ومتوسط المبيعات الخاصة بكل منتج على حدة ثم تحديد المنتج الأكثر مبيعاً في كل منطقة

يتم ذلك من خلال اتباع الخطوات التالية

علينا فرز المبيعات حسب المنتج والمنطقة وحساب إجمالي ومتوسط المبيعات ثم اكتشاف المنتج الأكثر مبيعاً في المنطقة وذلك عن طريق الاستعانة بالدالات التجميعية

استخدمنا في هذا المثال

AVG, SUM, RANK ثلاث دالات تجميعية

وسنبين مهمة كل واحدة منها على حدة

AVG الدالة

مهمتها حساب متوسط قيمة المنتج والمنطقة

SUM الدالة

مهمتها حساب القيمة الإجمالية لكل منتج والمنطقة

GROUP BY باستخدام عبارة

RANK الدالة

مهمتها استكشاف المنتج الأكثر مبيعاً في كل منطقة

ولتحديد الفرز وفق كل منطقة

هذه المهمة OVER تتولى عبارة

ولتحديد العمود لتقسيم البيانات (المنطقة)

PARTITION BY نستخدم عبارة

أما للحصول على الترتيب التنازلي لمجموع قيمة كل منتج في

ORDER BY كل منطقة تحدده عبارة

: تأتي نتيجة الاستعلام على شكل أعمدة هي

product_id, region, total_sale_amount, avg_sale_amount, rank

بحيث يدل عمود الترتيب على تصنيف كل منتج في كل منطقة وفق القيمة الإجمالية للبيع بحيث يحصل المنتج الأكثر مبيعاً في كل منطقة على المرتبة الأولى

تتنوع استخدامات الدالات التجميعية حسب المهام الموكلة إليها فيمكنك مثلاً حساب السجلات وحساب الحد الأقصى للقيم وغيرها من المهام الأخرى

Advertisements

الجداول المحورية *

وهي جداول تحوي بيانات مستخلصة من جداول كبيرة بغية تحليلها بشكل أسهل بحيث تتيح تحويل البيانات من الصفوف إلى الأعمدة لعرض البيانات بشكل أكثر تنسيقاً

يتم بناء هذه الجداول

PIVOT بالاستعانة بعامل التشغيل

والذي مهمته فرز البيانات وفق عمود معين ثم إظهار النتائج على شكل جدول منسق

للتوضيح

PIVOT استُخدِمَ عامل التشغيل

في الصورة السابقة لتحديد محور البيانات

product_id حسب معرف المنتج

بالإضافة إلى أعمدة لكل منتج وصفوف لكل عميل

SUM يأتي دور الدالة

لتقوم بحساب الكمية الإجمالية لكل منتج مطلوب من قبل كل عميل

pأما الاستعلام الفرعي

يتولى مهمة استخراج الأعمدة الضرورية من جدول الطلبات

PIVOT ثم يتم تشغيل

على الاستعلام الفرعي بالتعاون

SUM مع الدالة

لمعرفة الكمية الإجمالية لكل منتج يتم طلبه من قبل كل عميل

FOR أما عبارة

product_id فمهمتها تحديد العمود المحوري

في مثالنا هذا

IN أما عبارة

تحدد القيم المستهدفة ( [1]، [2]، [3]، [4]، [5] )

يظهر الجدول المحوري كنتيجة للاستعلام عن الكمية الإجمالية التي طلبها كل عميل على شكل أعمدة لكل منتج وصفوف لكل عميل

الاستعلامات الفرعية *

وهي استعلامات متداخلة مهمتها استعادة بيانات من جدول واحد أو أكثر ونتائجها تستخدم في الاستعلام الرئيسي كما ويمكن أن تكون وظيفتها فرز البيانات وتجميعها في صف واحد أو مجموعة صفوف تستخدم الاستعلامات الفرعية مثل

SELECT, FROM, WHERE, HAVING

SQL ضمن أقواس في أماكن متعددة من عبارة

وللتوضيح: لدينا جدولين

الجدول الأول هو جدول الموظفين يتألف من الأعمدة التالية

employee_id, first_name, last_name, department_id

الجدول الثاني هو جدول الرواتب ويتألف من الأعمدة التالية

employee_id, salary, salary_date

والمطلوب معرفة الموظفين الأعلى راتباً في كل قسم

يمكننا إيجاد الراتب الأعلى في كل قسم باستخدام استعلام فرعي ثم ضم النتيجة إلى جدولَي الموظفين والرواتب لاستخلاص أسماء الموظفين الذين يتقاضون هذا الراتب

بعد تنفيذ الاستعلام الفرعي كخطوة أولى يتم إرجاع مجموعة نتائج تمثل الراتب الأعلى في كل قسم ثم يتم ربط جداول الموظفين والرواتب بنتيجة الاستعلام الفرعي بواسطة الاستعلام الرئيسي لاستخراج أسماء الموظفين الأعلى راتباً في كل قسم

ولشرح عملية الانضمام هذه

INNER JOIN تستخدم عبارة

للانضمام إلى جداول الموظفين والرواتب وذلك باستخدام

كمفتاح الانضمام employee_id العمود معرف الموظف

ويتم ربط الاستعلام الفرعي بالاستعلام الرئيسي

department_id باستخدام العمود معرف القسم

salary ثم يتم استخدام عمود الراتب

لمطابقة الراتب الأعلى في كل قسم

فتظهر النتيجة على شكل جدول يحوي أسماء الموظفين الأعلى راتباً في كل قسم بجانب معرف القسم والراتب

Cross Joins

Cross Joins عمليات الانضمام المتقاطعة

هي أحد أنواع عمليات الربط التي تعيد المنتج الديكارتي لجدولين أو أكثر دون استخدام شرط ربط بل بتجميع صفوف من جدول مع صفوف من جدول آخر كل على حدة ثم تكون النتيجة جدول يتألف من مجموعات الصفوف المتاحة من كلا الجدولين

تفيد هذه العملية في ظروف معينة كإجراء عملية حسابية تتطلب كل مجموعات القيمة المتاحة من مجموعة جداول أو إنشاء بيانات الاختبار مثلاً

للتوضيح لدينا جدولان

وهو يتألف من الأعمدة customers الجدول الأول هو جدول العملاء

customer_id, customer_name, city

وهو يتألف من الأعمدة orders الجدول الثاني هو جدول الطلبات

order_id, customer_id, order_date

المطلوب هو معرفة عدد الإجمالي لطلبات كل عميل في كل مدينة

يتم ذلك بإنشاء مجموعة نتائج تضم كل عميل مع كل مدينة ثم ضم النتيجة إلى جدول الطلبات لاستخراج عدد طلبات كل مجموعة

توضح الصورة السابقة أنه تم إنشاء مجموعة نتائج تضم كل عميل مع كل مدينة

cross join وبهذا يكون الاستعلام استخدم

لتعود مجموعة النتائج التي تحتوي مجموعة تضم العميل والمدينة

ثم ينضم الاستعلام الرئيسي إلى نتيجة الضم التبادلي مع جدول الطلبات

:ملاحظات هامة

left join هنا يجب استخدام

للمحافظة على ظهور العملاء في النتيجة حتى لو لم يقوموا بتقديم أي طلب

ولضمان ظهور نتيجة عدد طلبات كل عميل في مدينته

WHERE تستخدم عبارة

لفرز النتائج والحصول على الصفوف التي تطابق المدينة

cross join التي يقيم فيها العميل في

لتجميع النتيجة وفق معرف العميل واسمه ومدينته

GROUP BY نستخدم عبارة

ولحساب عدد طلبات كل عميل في كل مدينة

COUNT () نستخدم الدالة

تظهر النتيجة في النهاية على شكل جدول يحتوي العدد الإجمالي لطلبات كل عميل في كل مدينة

جداول مؤقتة *

يتم الاعتماد على هذه الجداول لتخزين النتائج الوسيطة في الذاكرة أو على القرص واستخدامها في نهاية العمل ثم التخلص منها تلقائياً

أو يستخدم هذا النوع من الجداول لتقسيم الاستعلامات الكبيرة والشائكة إلى أجزاء أصغر لتسهيل معالجتها

CREATE TEMPORARY TABLE تستخدم عبارة

لإنشاء الجداول المؤقتة

SELECT, INSERT, UPDATE, DELETE وتستخدم الأوامر

لمعالجة تلك الجداول كأنها جداول عادية بغية تقليل كمية البيانات الكبيرة والمعقدة لتسهيل معالجتها

وللتوضيح لدينا جدول مبيعات يتألف من الأعمدة التالية

date, product, category, sales_amount

والمطلوب إنشاء تقرير يبين المبيعات الإجمالية لكل فئة لكل شهر على خلال العام الفائت

يمكننا معالجة هذا الموضوع من خلال الإجراءات التالية

الهدف الأول هو الحصول على إجمالي المبيعات لكل فئة ويتم ذلك بإنشاء جدول مؤقت يشمل ملخص لبيانات المبيعات عن كل شهر ثم ربطه بجدول المبيعات

ويتم ذلك باتباع الخطوات التالية

:إنشاء الجدول المؤقت

وذلك باستخدام عبارة

CREATE TEMPORARY TABLE

Monthly_sales_summary فينشأ جدول مؤقت يسمى

يتألف من ثلاثة أعمدة

month, category, total_sales

DATEمن النوع month عمود الشهر

VARCHAR (50) من نوع category عمود الفئة

total_sales عمود إجمالي المبيعات

DECIMAL (10،2) من نوع

INSERT INTO وباستخدام عبارة

نقوم بتعبئة الجدول المؤقت بالبيانات المختصرة

لفصل عمود التاريخ إلى المستوى الشهر وتجميع بيانات المبيعات وفق الشهر والفئة

DATE_TRUNC استخدمنا الدالة

ثم نقوم بإدخال نتيجة هذا الاستعلام

month_sales_summary في جدول

الذي أصبح يحوي ملخص لبيانات مبيعات كل شهر على حدة وللحصول على إجمالي مبيعات كل فئة يمكننا الانضمام إلى الجدول المؤقت مع جدول المبيعات

يتم الانضمام إلى جدول المبيعات

Monthly_sales_summary مع جدول

في الأعمدة المخصصة للفئة والشهر

ومن الجدول المؤقت يمكن تحديد أعمدة

month, category, total_sales

ولنحصل على النتيجة المطلوبة وهي بيانات مبيعات العام الماضي

WHERE نستخدم عبارة

ولفرز النتيجة حسب الفئة والشهر

ORDER BY نستخدم عبارة

تظهر نتيجة الاستعلام على شكل جدول يحوي إجمالي المبيعات لكل فئة عن كل شهر من العام الفائت

الرؤى الفعلية *

مهمة هذه الرؤى تحسين أداء الاستعلامات التي يتم تنفيذها بشكل متكرر وهي عبارة عن نتائج متنوعة مخزنة مسبقاً على شكل جداول فعلية ويتم استدعاؤها إلى الجداول الأصلية دون الحاجة إلى إجراء العمليات فيها

يستفاد من هذه العملية في تحسين أداء الاستعلامات المعقدة من خلال تخزين البيانات وتطبيقات ذكاء الأعمال مما يسهم في اختصار الوقت بالنسبة لإعداد التقارير ورفع كفاءة لوحات المعلومات

توضح الصورة أعلاه أنه تم إنشاء

Monthly_sales_summary عرض فعلي اسمه

هذا العرض يحوي ملخص لبيانات المبيعات لكل فئة عن كل شهر

SELECT نستخدم عبارة

لتخزين النتيجة في الرؤية الفعلية

يتم تحديث طرق العرض الفعلية تلقائياً عندما تتغير البيانات الأساسية على الرغم من أنها تشبه الجداول من حيث تخزينها على القرص، كما ويمكن تحديثها يدوياً باستخدام عبارة

REFRESH MATERIALIZED VIEW

يمكنك الاستعلام عن العرض الفعلي بمجرد إنشائه كأي جدول آخر

في الصورة أعلاه يتم تحديد أعمدة الفئة والشهر وإجمالي المبيعات

month_sales_summary من طريقة العرض الفعلية

ويقوم بفرز النتيجة وفق الفئة والشهر

كما ذكرنا سابقاً بطريقة العرض الفعلي يمكنك اختصار الكثير من الوقت الذي يستغرقه تشغيل الاستعلام إذ أن هذه الطريقة تتيح لك استخدام الحساب المسبق لبيانات الملخص وتخزينها

:ختاماً

تذكر صديقي القارئ أن مواكبة التطورات والسير في ركب التكنولوجيا المتسارعة أمر غاية في الأهمية ومعرفتك بكل ما جديد من تقنيات ومهارات يسهل عليك كثيراً من الأمور بل ويزيد من مستواك العلمي سواء في مجال البرمجة وتحليل البيانات أو في أي مجال علمي آخر

أتمنى أن تكونوا قد حصلتم على قسط كبير من الفائدة ونرجو مشاركة هذه المعلومات ودعم المدونة لنستمر في تقديم كل ما هو جديد ويسعدنا مشاهدة آرائكم على التعليقات وشكراً

Advertisements

A Comprehensive Spotlight on SQL For Data Analysis

Posted on June 12, 2023June 12, 2023 by s4l8384gmailcom

Advertisements

SQL is a powerful programming language dedicated to data in relational databases. It is a language that has existed for decades and is relied upon by many large companies around the world. Data analysts use it to access, read, process and analyze data saved in the database to form a comprehensive view that helps make the right decisions.

We will discuss in detail the mechanism and stages of working on this tool in terms of its query capabilities with databases, while mentioning the types of data analysis.

data analysis

All companies of all sizes and specializations seek advancement and growth, so their primary goal in this approach is to satisfy customers and provide them with the best services. By expanding the customer base, the company grows and thrives, and therefore most companies intend to examine, purify, transform and model data to extract valuable information that helps in making critical decisions, this process It’s called data analysis

Types of data analysis

This classification is done according to the types of data and terms of reference for the analysis process

Descriptive analysis:

It is the main analysis on which the rest of the types of analyzes are based, and it is the simplest, so it is the most used for data in all commercial activities at the present time. This analysis allows extracting trends between the raw data and giving a view of the events in their time. Here, the initial answer to “what happened” appears by summarizing the previous data, and it is usually represented in the form of a dashboard

Diagnostic analysis:

It is the step that immediately follows the previous step, which is to delve deeper into the previous question, “What happened?” This step is embodied in asking another question, which is “Why did it happen?” Diagnostic analysis, then, is the one that completes the work of the descriptive analysis by taking the initial readings resulting from the descriptive analysis and deepening them to interpret and analyze them in order to reach more correlations between the data, so that features of behavior patterns begin to form for us, and from the learned aspects also is that if problems arise during work, then you are Now you have enough data related to this problem, so the solution becomes easier, and thus this saves you from having to re-work

Predictive analytics:

It is complementary to the work of the two previous analyses, and from its name it seems that it makes probabilities and predictions about the events that will come later based on previous predictions in addition to the current variables. Thus, this analysis represents the answer to the third question, which is “what might happen in the future”?

This type of analysis helps companies make more accurate and effective decisions

Mandatory Analysis:

It is the final limit of data analysis capabilities, as it is not satisfied with forecasting or forecasting, but rather proposes options to benefit from the results of previous analyzes, and determines the steps that must be implemented in the event of a potential problem or forming a plan to develop work. This is done by using advanced techniques such as machine learning algorithms. Especially when dealing with huge amounts of data

So this analysis is the answer to the question “what should we do next”? Which defines the general approach to the company’s business plan

What are the advantages of SQL when used in data analysis?

* Easy and uncomplicated language

* Speed in query processing

* Ability to call up big data from different databases

* Providing various documents to analysts

Advertisements

Explain the use of SQL in data analysis

Temporary tables

Temporary tables in SQL are defined as tables that are created to perform a temporary task and persist for a specific period of time or during a session by storing and processing intermediate results using the same join, select, and update techniques.

Assembly as per requirement

For example, this phrase is used to count the number of employees in each department or to obtain the salaries of the department in total, so it is used to extract summary data based on different groups, whether on one or more columns

aggregation functions

Its task is to perform an arithmetic operation on a set of values to extract a single value

String functions and operations

The task of SQL string operators is to perform matching on the form, sequence, capitalize the string, and other matching functions

Date and time operations

Some of the services offered by SQL are many types of date and time tasks such as

SYSUTCDATETIME()

CURRENT_TIMESTAMP

GETDATE()

DAY()

MONTH()

YEAR()

DATEFROMPARTS()

DATETIME2FROMPARTS()

TIMEFROMPARTS()

DATEDIFF()

DATEADD()

ISDATE()

etc. It is used to implement date and time entries

Display and indexing methods

The database is the main repository for the index, so indexing the view helps speed up work and improve the performance of queries and applications that use it

Join:

This statement is used to combine different tables in databases using a primary key and a foreign key

The following explains the different types of JOINs in SQL with an example of data in left and right tables

(INNER) JOIN: Returns records that contain identical values in both tables

LEFT (OUTER) JOIN: Returns all records from the left table and matching records from the right table

RIGHT (OUTER) JOIN : Returns all records from the right table and matching records from the left table

FULL (OUTER) JOIN : Returns all records when there is a match in the left or right table

windows functionality

They are intended to work within an array of rows to extract one value per row from the underlying query so they simplify queries as much as possible

nested queries

It is a query inside another query, and the result of the inner query is used by the outer query

Data analysis tools:

SQL: The standard programming language for performing programming used to communicate with relational databases, and it also has a major role in retrieving the required information.

Python: a versatile programming language, which is very popular in the field of technology and programming, and no data analyst can do without it. It relies on the principle of its work on readability, so it is not classified within complex programming languages. different analysis

R: Its tasks and features are not much different from Python, except that it is specialized in performing statistical analysis of data

Microsoft Excel: The most famous program in the world in the field of spreadsheets. It has many different features, ranging from scheduling, performing calculations, and typical graphing functions for data analysis.

Tableau: It is intended for creating visualizations and interactive dashboards without the need for high coding expertise, so it is the perfect tool for commercial data analysis

In conclusion

We put in your hands, dear reader, everything related to the SQL language

If you see that there is information that we did not mention regarding this programming language, share it with us in the comments to exchange information and benefit everyone, Thank you.

Advertisements

لتحليل البيانات SQL إضاءة شاملة على

Advertisements

هي لغة برمجة قوية SQL

مخصصة للبيانات الموجودة في قواعد البيانات العلائقية وهي لغة موجودة منذ عشرات السنين وتعتمد عليها الكثير من الشركات الكبرى في جميع أنحاء العالم إذ يستخدمها محللو البيانات للوصول إلى البيانات المحفوظة في قاعدة البيانات وقراءتها ومعالجتها وتحليلها لتكوين رؤية شاملة تساعد على اتخاذ القرارات الصحيحة

وسنتناول بالتفصيل آلية ومراحل العمل على هذه الأداة من حيث إمكانات استعلاماتها مع قواعد البيانات مع ذِكر أنوع تحليل البيانات

تحليل البيانات

تسعى جميع الشركات على مختلف أحجامها واختصاصاتها إلى الارتقاء والنمو لذا هدفها الأساسي في هذا النهج هو إرضاء العملاء وتقديم أفضل الخدمات لهم فبتوسع قاعدة العملاء تنمو الشركة وتزدهر وبالتالي تعمد معظم الشركات على فحص وتنقية وتحويل ونمذجة البيانات لاستخراج معلومات قيّمة تساعد في اتخاذ القرارات الحاسمة، هذه العملية تسمى تحليل البيانات

أنواع تحليل البيانات

ويتم هذا التصنيف حسب أنواع البيانات والاختصاصات المحددة لعملية التحليل

:التحليل الوصفي

هو التحليل الرئيسي الذي ترتكز عليه باقي أنواع التحليلات وهو أبسطها لذا فهو الأكثر استعمالاً للبيانات في كافة النشاطات التجارية في الوقت الراهن. يسمح هذا التحليل باستخلاص الاتجاهات بين البيانات الأولية وإعطاء نظرة عن الأحداث في وقتها وهنا تظهر الإجابة الأولية على “ماذا حدث” من خلال تلخيص البيانات السابقة وتتمثل عادة على شكل لوحة معلومات

:التحليل التشخيصي

وهو الخطوة التي تلي الخطوة السابقة مباشرة والتي تتمثل في التعمق أكثر في السؤال السابق “ماذا حدث” فتتجسد هذه الخطوة في طرح سؤال آخر وهو “لماذا حدث”؟ فالتحليل التشخيصي إذاً هو الذي يتمم عمل التحليل الوصفي من خلال أخذ القراءات الأولية الناتجة عن التحليل الوصفي والتعمق بها لتفسيرها وتحليلها بغية الوصول إلى المزيد من ترابطات بين البيانات فتبدأ تتشكل لنا معالم أنماط السلوك ومن الجوانب المستفادة أيضاً هو أنه في حال ظهور مشكلات أثناء العمل فحكماً أنت أصبح لديك البيانات الكافية المتعلقة بهذه المشكلة فيصبح الحل أسهل وبالتالي هذا يغنيك عن تضطر لإعادة العمل

:التحليلات التنبؤية

وهو متمم لعمل التحليلين السابقين، ومن اسمه يبدو أن يقوم بوضع احتمالات وتنبؤات حول الأحداث التي ستأتي فيما بعد بناءً على تنبؤات سابقة إلى جانب المتغيرات الراهنة وبالتالي يمثل هذا التحليل الإجابة عن السؤال الثالث وهو “ماذا يمكن أن يحدث في المستقبل”؟

يساعد هذا النوع من التحليل على اتخاذ قرارات أكثر دقة وفاعلية للشركات

:التحليل الإلزامي

وهو الحد النهائي لقدرات تحليل البيانات، حيث أنه لا يكتفي بالتوقّع أو التنبؤ بل يقوم باقتراح خيارات للاستفادة من النتائج التحليلات السابقة، وتحديد الخطوات التي يجب تنفيذها في حال حدوث مشكلة محتملة أو تشكيل خطة لتطوير العمل، يتم ذلك عن طريق استخدام تقنيات متطورة كخوارزميات التعليم الآلي وخصوصاً عند التعامل مع كميات ضخمة من البيانات

إذاً هذا التحليل هو الإجابة عن السؤال “ماذا يجب أن نفعل بعد ذلك”؟ والذي يحدد النهج العام لخطة عمل الشركة

عند استخدامه في تحليل البيانات؟ SQL ماهي ميزات

لغة سهلة وغير معقدة *

السرعة في معالجة الاستعلام *

القدرة على استدعاء البيانات الضخمة من قواعد البيانات مختلفة *

توفير وثائق متنوعة للمحللين *

Advertisements

في تحليل البيانات SQL شرح استخدام

الجداول المؤقتة

SQL تعرف الجداول المؤقتة في

على أنها الجداول التي يتم انشاؤها لتنفيذ مهمة مؤقتة ويستمر وجودها لمدة زمنية محددة أو خلال جلسة ما عن طريق تخزين النتائج الوسيطة ومعالجتها باستخدام نفس تقنيات الانضمام والتحديد والتحديث

التجميع حسب الشرط

على سبيل المثال تستخدم هذه العبارة لإحصاء عدد الموظفين في كل قسم أو الحصول على رواتب القسم بالمجمل، إذاً هي تستخدم لاستخراج بيانات التلخيص بناءً على مجموعات مختلفة سواء على عمود أو أكثر

وظائف التجميع

مهمتها تنفيذ عملية حسابية على مجموعة من القيم لاستخراج قيمة واحدة

وظائف وعمليات السلسلة

SQL مهمة عوامل تشغيل السلسلة في

هي تنفيذ المطابقة على النموذج والتسلسل وجعل السلسلة تبتدئ بحروف كبيرة وغيرها من وظائف المطابقة الأخرى

عمليات التاريخ والوقت

SQL من الخدمات التي يقدمها

أنواع عديدة من مهام التاريخ والوقت مثل

SYSUTCDATETIME ()

CURRENT_TIMESTAMP

GETDATE ()

DAY ()

MONTH ()

YEAR ()

DATEFROMPARTS ()

DATETIME2FROMPARTS ()

TIMEFROMPARTS ()

DATEDIFF ()

DATEADD ()

ISDATE ()

وغيرها وهي تستخدم لتنفيذ إدخالات التاريخ والوقت

طرق العرض والفهرسة

تعتبر قاعدة البيانات المستودع الرئيسي للفهرس لذا فعملية فهرسة العرض تساعد على تسريع العمل وتحسين أداء الاستعلامات والتطبيقات التي تستخدمها

:Joins

تستخدم هذه العبارة لدمج جداول مختلفة في قواعد البيانات ويتم ذلك باستخدام مفتاح أساسي ومفتاح خارجي

SQL في JOINS فيما يلي شرح الأنواع المختلفة من

ضمن مثال على بيانات في جدولين يميني ويساري

: (INNER) JOIN

إرجاع السجلات التي تحتوي على قيم متطابقة في كلا الجدولين

: LEFT (OUTER) JOIN

إرجاع كافة السجلات من الجدول الأيسر والسجلات المتطابقة من الجدول الأيمن

: RIGHT (OUTER) JOIN

إرجاع كافة السجلات من الجدول الأيمن والسجلات المتطابقة من الجدول الأيسر

: FULL (OUTER) JOIN

إرجاع كافة السجلات عند وجود تطابق في الجدول الأيمن أو الأيسر

وظائف النوافذ

مخصصة للعمل ضمن مجموعة من الصفوف لاستخراج قيمة واحدة لكل صف من الاستعلام الأساسي لذا فهي تبسط الاستعلامات قدر الإمكان

الاستعلامات المتداخلة

وهو استعلام داخل استعلام آخر ويتم استخدام نتيجة استعلام الداخلي بواسطة الاستعلام الخارجي

:أدوات تحليل البيانات

لغة البرمجة النموذجية : SQL

لإجراء البرمجة المستخدمة للتواصل مع قواعد البيانات العلائقية، كما ولها دور رئيسي في استرجاع الملومات المطلوبة

: بايثون

لغة برمجة متعددة الاستخدامات، تلقى رواجاً كبيراً في مجال التكنولوجيا والبرمجة ولا يمكن لأي محلل بيانات الاستغناء عنها، تعتمد في مبدأ عملها على قابلية القراءة لذا لا تصنف ضمن لغات البرمجة المعقدة، تضم عدد كبير من المكتبات المتنوعة وفق متطلبات المهمة المراد تنفيذها في عمليات التحليل المختلفة

: R لغة

لا تختلف مهامها وميزاتها كثيراً عن بايثون إلا أنها متخصصة في إجراء عمليات التحليل الإحصائي للبيانات

مايكروسوفت إكسل: البرامج الأشهر على مستوى العالم في مجال الجداول، يتمتع بميزات عديدة ومختلفة تتنوع بين الجدولة وتنفيذ العمليات الحسابية ووظائف الرسوم البيانية النموذجية لتحليل البيانات

: Tableau

وهو مخصص لإنشاء التصورات ولوحات المعلومات التفاعلية دون الحاجة إلى خبرة عالية في الترميز إذاً يعتبر الأداة الأمثل لتحليل البيانات التجارية

ختاماً

وضعنا بين يديك عزيزي القارئ

SQL كل ما يتعلق بلغة

فإن كنت ترى أن هناك معلومات لم نقم بذكرها فيما يتعلق بلغة البرمجة هذه شاركنا بها في التعليقات لنتبادل المعلومات ولتعم الاستفادة للجميع وشكراً

Advertisements

With a little experience, you can land a job in data science

Posted on June 8, 2023June 8, 2023 by s4l8384gmailcom

Advertisements

Although the job market in data science requires skill and experience, lack of experience or even a lack of it does not prevent you from getting a data science job. How is that done? This is what we will discuss in this article

It is noticeable in recent years the great interest in the development of data science of all kinds, such as big data generated by smart devices and the diversity of computer resources such as cloud computing. On the other hand, the development of algorithms has received a great deal of attention.

In addition, the diversity of the fields of the labor market in data science, which includes health, transportation, and industry sectors, in addition to academic, environmental, security, and other activities.

And with the diversity of areas that branch out from data science, such as data analysis, predictive analysis, machine learning, deep learning, data visualization, and other branches.

All these factors have led to an increased demand for data scientists, who have a variety of fields of employment, with a variety of available opportunities, including:

Data Scientist, Data Analyst, Predictive Analyst, Business Analyst, AI Writer, Data Visualizer, Data Engineer

So we are going to give you a set of tips that will help you get a job in data science

1. Learn key skills:

It is necessary to learn the basic principles of data science by following good-level online training courses, and it is preferable to obtain a degree in a university, and these skills include:

Problem Solving, Decision Making, Programming (Python or R), Statistics, Mathematics (Linear Algebra and Calculus), Machine Learning, Deep Learning, Data Visualization, Report Writing

Mastering these skills will increase your chance of getting a job in data science

2. Learn about data science libraries:

The most famous of these libraries:

NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn, Tensorflow, Keras

And other libraries that must be recognized

3. Stay up-to-date with developments and developments:

One may think that once he gets the job, he no longer needs to keep up with new developments and technologies in this field, but this view is wrong par excellence. Staying abreast of developments in data science increases the skills and experience of the learner because forgetting or interrupting learning is the first enemy of progress and distinction.

4. Specialization in a specific field:

Especially for those who do not have the comprehensive experience that qualifies them to get a job in data science. Therefore, expanding the mastery of a specific field in one of the sub-fields is considered an effective weapon in the hands of its bearer, as is the case in mastering machine learning or deep learning.

5. Self-training on practical experiences:

This advice is specifically directed at learning and developing machine learning algorithms. After the learning stage comes the stage of being able to write code that leads to algorithmic outputs that produce real data, and this will pave the way for you to be able to modify codes, produce new outputs, and make comparisons and analyzes.

Advertisements

6. Take notes

Recording notes and all the experiences you have learned will help you to retrieve information when you need to refer to it, and with the passage of time it will form a blog that you can benefit from in the future so that you can build your own brand.

7. Follow online training courses

It is widely available on the Internet, but be sure to follow the reliable courses in terms of information led by trainers with scientific weight in this field

Start by learning the principles of data science, machine learning, deep learning, and other technologies

And I recommend courses offered by famous platforms such as Coursera, as they offer scientific degrees in cooperation with the best universities in the world, and it is not necessary to apply for paid courses in order for the novice learner to start developing his skills, as the free courses are sufficient for such cases

8. Support your CV with a professional certificate

In continuation of what was stated in the previous paragraph, you can obtain a certificate after you have followed a paid course. This certificate is considered an official document indicating your level of experience and skill.

9. Create a community of data scientists

It is one of the things that increase your chances of being accepted into a job in data science

The following platforms are fertile environments for building a community of data scientists

LinkedIn: A scientific community is built by creating and sharing data science posts on the platform

Medium: Through it, you can create a blog related to data science and build an information network

Kaggle: Through it, you can participate in data science competitions and build a network

10. Completion of projects in accordance with the requirements of the potential job

You must complete projects related to the field of work that you prefer to apply for in the potential job, for example, if you prefer to apply for a job in the field of data visualization, you must implement projects related to data visualization

11. Start your career at a low job level

As working at low job levels does not require you to have a lot of sufficient experience as a beginner in the job, and with the acquisition of more experience, you can search for a higher-level job, but the right start for the inexperienced starts from a mini-work environment

12. Build a distinguished resume

Building a distinguished CV reflects a positive impression on decision makers in employment matters, and thus will support your chances of getting a job.

And we can call the characteristic of excellence on the resume if it has the factors we mentioned in a previous article, you can view them by reading the article in detail from here How to write a killer resume and ace the interview

Advertisements

بقليل من الخبرة يمكنك أن تحصل على وظيفة في علوم البيانات

Advertisements

رغم أن سوق العمل في علم البيانات يتطلب المهارة والخبرة إلا أن قلة الخبرة أو حتى انعدامها لا يمنع من أن تحصل على وظيفة علم البيانات كيف يتم ذلك؟ هذا ما سنناقشه في هذا المقال

من الملاحظ في السنوات الأخيرة الاهتمام الكبير بتطوير علوم البيانات بأنواعها كالبيانات الضخمة المتولدة عن طريق الأجهزة الذكية وتنوع الموارد الحاسوبية كالحوسبة السحابية، ومن جانب آخر نال تطوير الخوارزميات حيزاً كبيراً من الاهتمام، والجانب الأكثر أهمية من ذلك أن مجالات العمل في علم البيانات ذات مصدر مفتوح

علاوة على ذلك تنوع مجالات سوق العمل في علم البيانات والتي تشمل قطاعات الصحة والنقل والصناعة إضافة إلى النشاطات الأكاديمية والبيئية والأمنية وغيرها من الفعاليات الأخرى

ومع تنوع المجالات التي تتفرع عن علم البيانات كتحليل البيانات والتحليل التنبؤي والتعلم الآلي والتعلم العميق وتصور البيانات وغيرها من الفروع الأخرى

كل هذه العوامل أدت إلى تزايد الطلب على علماء البيانات الذين تنوعت أمامهم مجالات التوظيف مع تنوع الفرص المتاحة والتي نعدد منها

عالم بيانات، محلل بيانات، محلل تنبؤي، محلل الأعمال، كاتب الذكاء الاصطناعي، مصور البيانات، مهندس بيانات

لذا سنتقدم لك مجموعة نصائح تساعدك على الحصول على وظيفة في علم البيانات

1- : تعلم المهارات الرئيسية

من الضروري تعلم المبادئ الأساسية لعلم البيانات وذلك عن طريق متابعة دورات تدريبية ذات مستوى جيد عبر الإنترنت كما ويفضل الحصول على شهادة في إحدى الجامعات وتشمل هذه المهارات

(R حل المشكلات، صنع القرار، البرمجة ( بايثون أو

الإحصاء، الرياضيات (الجبر الخطي وحساب التفاضل والتكامل)، التعلم الآلي، التعلم العميق، تصور البيانات، كتابة التقارير

إتقان هذه المهارات سيزيد فرصتك في الحصول على وظيفة في علم البيانات

2- :التعرف على مكتبات علوم البيانات

:وأشهر هذه المكتبات

NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn, Tensorflow, Keras

وغيرها من المكتبات التي لابد من التعرف عليها

3- :الاطلاع على المستجدات والتطورات بشكل مستمر

قد يعتقد المرء أنه بمجرد حصوله على الوظيفة فإنه لم يعد بحاجة إلى مواكبة التطورات والتقنيات الجديدة في هذا المجال إلا أن هذه النظرة خاطئة بامتياز فالبقاء على اطلاع دائم على تطورات علم البيانات يزيد المهارات والخبرات عند المتعلم لأن النسيان أو الانقطاع عن التعلم هو العدو الأول للتقدم والتميز

4- :التخصص في مجال معين

وخصوصاً للذين لا يمتلكون الخبرة الشاملة التي تؤهلهم للحصول على الوظيفة في علم البيانات لذا فالتوسع في إتقان مجال معين في إحدى المجالات الفرعية يعتبر سلاح فعال بيد حامله كما هو الحال في إتقان التعلم الآلي أو التعلم العميق

5-:التدرب الذاتي على الخبرات العملية

وهذه النصيحة موجهة خصيصاً لتعلم وتطوير خوارزميات التعلم الآلي فبعد مرحلة التعلم تأتي مرحلة القدرة على كتابة الكودات البرمجية المؤدية إلى مخرجات خوارزميات تنتج بيانات حقيقية، وهذا سيمهد أمامك الطريق لتصبح قادراً على تعديل الكودات وإنتاج مخرجات جديدة وإجراء المقارنات والتحليلات

Advertisements

6- :تدوين الملاحظات

تسجيل الملاحظات وكل ما تعلمته من خبرات سيعينك على استعادة المعلومات عند الحاجة إلى الرجوع إليها وسيشكل مع مرور الزمن مدونة يمكنك الاستفادة منها مستقبلاً بحيث تبني لك علامة تجارية خاصة بك

7-:متابعة دورات تدريبية عبر الإنترنت

وهي متوفرة بشكل كبير على شبكة الإنترنت ولكن احرص على اتباع الدورات الموثوقة من حيث المعلومات يقودها مدربون يتمتعون بثقل علمي في هذا المجال

ابدأ من تعلم مبادئ علوم البيانات والتعلم الآلي والتعلم العميق وغيرها من التقنيات

Coursera وأوصي بدورات تقدمها منصات شهيرة مثل

فهي تقدم شهادات علمية بالتعاون مع أفضل الجامعات في العالم، ولا يشترط التقدم إلى الدورات المدفوعة لكي يبدأ المتعلم المبتدئ بتطوير مهاراته فالدورات المجانية تفي بالغرض لمثل هذه الحالات

8- :ادعم سيرتك الذاتية بشهادة مهنية

واستطراداً لما ورد في الفقرة السابقة يمكنك الحصول على شهادة بعد اتباعك لدورة مدفوعة وتعتبر هذه الشهادة وثيقة رسمية تدل على مستوى خبرتك ومهارتك

9- :إنشاء مجتمع علماء البيانات

وهي من الأمور التي ترفع من حظوظك في القبول في وظيفة في علم البيانات

وتعتبر المنصات الأساسية التالية بيئات خصبة لبناء مجتمع يضم علماء البيانات

: LinkedIn

يتم بناء مجتمع علمي عن طريق إنشاء منشورات علوم البيانات ومشاركتها على المنصة

: Medium

ومن خلالها يمكنك إنشاء مدونة تتعلق بعلم البيانات وبناء شبكة معلومات

: Kaggle

ومن خلالها يمكنك المشاركة في مسابقات علوم البيانات وبناء شبكة

10- :إنجاز مشاريع وفق متطلبات الوظيفة المحتملة

عليك إنجاز مشاريع تتعلق بمجال العمل الذي تفضل التقدم إليه في الوظيفة المحتملة، مثلاً إن كنت تفضل التقدم على وظيفة في مجال تصور البيانات فيجب عليك تنفيذ مشاريع تتعلق بتصور البيانات

11- :ابدأ مسيرتك الوظيفية بمستوى وظيفي منخفض

إذ أن العمل في مستويات وظيفية منخفضة لا يحتاج منك الكثير من الخبرة الكافية كمبتدئ في الوظيفة، ويمكنك مع اكتساب مزيد من الخبرات أن تبحث عن وظيفة ذات مستوى أعلى لكن البداية الصحيحة بالنسبة لقليلي الخبرة تنطلق من بيئة عمل مصغرة

12- :بناء سيرة ذاتية متميزة

بناؤك لسيرة ذاتية متميزة يعكس انطباعاً إيجابياً لدى أصحاب القرار في أمور التوظيف وبالتالي ستدعم حظوظك في الحصول الوظيفة

ويمكن أن نطلق صفة التميز على السيرة الذاتية إذا توفرت فيها عوامل ذكرناها في مقال سابق يمكنك الاطلاع عليها من خلال قراءة المقال بشكل مفصل من هنا

How to write a killer resume and ace the interview

Advertisements

Data visualization using ChatGpt and Tableau

Posted on May 24, 2023May 24, 2023 by s4l8384gmailcom

Advertisements

With the development of data analysis tools and software, users of Tableau visualizations can save time and effort by taking advantage of the integration between ChatGpt and Tableau, thus automating processing with more flexibility.

How is that done? This is what we will explain in our article today, let’s get started

As we mentioned, the process will be done using the ChatGPT application. What is the concept of this application?

We will not go into complex technical details that explain the mechanism of action of this application, as this is not our topic today, and we may devote a detailed explanation to it in the coming days, but what we are interested in explaining now is what serves the topic we are talking about, which is integration with Tableau functions

ChatGPT is a conversational bot based on artificial intelligence with its amazing capabilities in conducting conversations and interacting with questions and inquiries in a linguistic manner similar to the nature of human reaction and you can use it in a variety of functions and inquiries including data visualization which is the focus of our topic for this day

First we need to install the OpenAI API as a first step to start using ChatGBT and then authenticate our credentials using JavaScript and entering the following code:

Once this process is complete, we can use ChatGpt to create visualizations in Tableau

Why is ChatGPT integration important to Tableau functionality?

In short, the basic necessity of this integration process is that it allows answering the most difficult questions and inquiries in an easy way and in natural language, and through which we can Tableau visualize these answers

Also, through this integration process, we can create interactive dashboards that help users find solutions to their inquiries in a timely manner, and thus their ability to identify patterns in their data and outliers at high speed makes reaching sound decisions easier.

Now let’s learn how to integrate ChatGPT with Tableau

This is done by carrying out the following stages

Step 1: Connect Tableau to your data source

This is done by selecting the Connect button in the upper left corner of the Tableau interface and then selecting the data source

Step 2: Install and configure TabPy

TabPy is a Python package that allows us to use Python scripts in Tableau

First enter the following command

After completing the installation of TabPy, we proceed to configure it to work with Tableau, and this is done by running TabPy with the following command in the terminal

Step 3: Install and configure the ChatGPT API

The ChatGPT API is a REST interface

At this stage, we install and prepare the ChatGPT API, and to be able to interact with the ChatGPT pattern, we install the ChatGPT API, and this is done by entering the following command in the Terminal window

Then, we set up the authentication, and this is done by obtaining the API key through a subscription request in OpenAI, and then you go to set it up in Python by running the following command:

Advertisements

Create integration between ChatGPT and Tableau (Python)

After successfully completing the previous steps, it remains to create the ChatGPT integration with Tableau

This is done by following these steps:

Step 1: Choose a Python function that calls the CHatGPT API

ChatGPT’s function here is to return the response from the queries entered into it

This is what the following example shows

Step 2: Use TabPy to register a Python function

This means registering a Python function with TabPy to be used in Tableau by running the following command in the Terminal window

This will create a TabPy configuration file, open it and add the following lines:

Save the file, and to start TabPy run the following command:

Step 3: Use the Python function in Tableau

To do this, we open a new workbook in Tableau and do the following:

1. We drag the “Text” object into the control panel

2. Click on the text and choose “Edit text” and in the dialog box type the following formula:

3. Then click OK and the text edit box will close

4. Drag the Parameter object onto the Control Panel

5. In the “Create Parameter” dialog box, set the data type to “String” and choose “all” to the available values, and set the current value to “empty string”, then click OK.

6. On the Parameter object, right-click and select Show Parameter Control.

7. Type a query in “Input Text” and press Enter

8. It will display the reply from ChatGPT in a “text” object and then call ChatGPT and Tableau together

Merger may seem a tiring process at first, but doing it repeatedly, even on a small scale, will develop your skills and develop capabilities to process data in a flexible and fast manner, and help you to troubleshoot problems and address them more effectively than before.

Create visualizations:

Using ChatGpt:

The first thing we need to do is provide ChatGpt with the data to be visualized, and after it receives the data that we have given it to it via a group or by passing a table, it will create the visualization according to the requests assigned to it

See in this code inserted in JavaScript how we create a visualization

In the above code we use OpenAI API functions to generate a bar chart of sales by location

We enter this request into ChatGpt via the immediate variable, to create the visualization we use the client.completions.create function and at the end we can display the resulting visualization in Tableau which was previously stored in the message variable

customize the resulting perceptions

We can customize the resulting visualizations according to our requirements in terms of changing the visualization type, size and color style, and this is done by providing ChatGpt with additional parameters

We can do this by using the following code in JavaScript

And keep in mind that experimenting with different parameters is a powerful tool for creating engaging and innovative visuals

What we did in the previous code is we created a quarterly earnings line chart using blue

Then we entered our request into ChatGpt through the immediate variable

Then we selected the appropriate visualization style, so we have a line chart in blue color and size according to demand

Show the visuals in Tableau

And as a reminder.. All of the above stages and procedures are to create a visualization using ChatGpt

But you promised us in this article that we will show the visualization in Tableau

Well don’t worry we’re not done yet..let’s go

The first thing we have to do is copy the resulting visualization from the message variable and paste it into Tableau and this is done by implementing the following steps

• Create a new worksheet in Tableau

• Select “Text” from the “Marks” section.

• Paste the visualization copied from the message variable into the text box and adjust the size of the text box to fit the visualization

• Congratulations.. The visualization has finally appeared in Tableau

And at the end of our interview today, allow me to pre-empt things and gladly answer some questions that some of the readers are likely to have.

Question 1: Are there free versions of ChatGPT?

Answer: Yes, there are free versions, but although their uses are limited, they often suffice

Question 2: Can we integrate ChatGPT with visualization tools other than Tableau?

Answer: Yes, and this is done by following the same steps that we followed above

Question 3: Does ChatGPT give accurate answers?

Answer: Not only accurate answers, but very accurate in general, when the information is entered correctly

In the end, I hope that you have found valuable information in this article as a data analyst looking for permanent and continuous development in his work. A successful person, my friend, as you know, is the person who accomplishes his work accurately and as quickly as possible.

If you find the benefit, please share with friends and support us, and quickly join wonderful partners by following the blog. We are honored to have you with us.. Welcome.

Advertisements

Tableauو ChatGpt تصور البيانات باستخدام

Advertisements

مع تطور أدوات وبرامج تحليل البيانات أصبح بمقدور مستخدمي التصورات البيانية

Tableau على برنامج

توفير الوقت والجهد عن طريق الاستفادة من

Tableau و ChatGpt التكامل بين

وبالتالي أتمتة المعالجة بمرونة أكثر

كيف يتم ذلك؟ هذا ما سنوضحه في مقالنا اليوم، هيا لنبدأ

ChatGPT كما ذكرنا ستتم العملية بالاستعانة بتطبيق

ما هو مفهوم هذا التطبيق؟

لن ندخل في تفاصيل تقنية معقدة تشرح آلية عمل هذا التطبيق فليس هذا موضوعنا اليوم وربما نخصص له شرحاً تفصيلياً في القادم من الأيام وإنما ما يهمنا شرحه الآن هو ما يخدم الموضوع الذي نتحدث فيه

Tableau وهو التكامل مع وظائف

هو روبوت محادثة يعتمد على الذكاء الاصطناعي ChatGPT

ويمتاز بقدراته المذهلة في إجراء المحادثات والتفاعل مع الأسئلة والاستفسارات بطريقة لغوية طبيعة تشبه رد فعل الإنسان ويمكنك الاستعانة به في مجموعة متنوعة من الوظائف والاستفسارات بما فيها تصور البيانات التي هي محور موضوعنا لهذا اليوم

OpenAI API نحتاج في البداية إلى تثبيت

ChatGBT كخطوة أولى للشروع في استخدام

ثم مصادقة بيانات الدخول الخاصة بنا

JavaScript ويتم ذلك باستخدام

: وإدخال الكود التالي

وعند إتمام هذه العملية أصبح بمقدورنا

Tableau لإنشاء تصورات في ChatGpt استخدام

؟Tableau مع وظائف ChatGPT ما أهمية تكامل

باختصار تمكن الضرورة الأساسية لعملية الدمج هذه بأنها تتيح الإجابة على أصعب الأسئلة والاستفسارات بطريقة سهلة وبلغة طبيعية

من تصور هذه الإجابات Tableau ومن خلالها يمكننا

كما ويمكننا من خلال عملية الدمج هذه إنشاء لوحات معلومات تفاعلية تساعد المستخدمين في إيجاد الحلول على استفساراتهم في زمن مناسب وبالتالي قدرتهم على تحديد أنماط بياناتهم والقيم المتطرفة بسرعة عالية تجعل الوصول إلى قرارات سليمة أمراً أكثر سهولة

Tableau مع ChatGPT لنتعرف الآن على كيفية الدمج

يتم ذلك بتنفيذ المراحل التالية

بمصدر بياناتك Tableau المرحلة الأولى: ربط

ويتم ذلك بتحديد الزر “اتصال” في الزاوية اليسرى العلوية

Tableau من واجهة

ثم تحديد مصدر البيانات

TabPy المرحلة الثانية: تثبيت وتجهيز

Python هي حزمة TabPy

Tableau النصية في Python تتيح لنا استخدام تعليمات

أولاً أدخل الأمر التالي

TabPy وبعد الانتهاء من تثبيت

Tableau نتجه إلى تهيئته للعمل مع

TabPy ويتم ذلك بتشغيل

terminal بواسطة الأمر التالي في

ChatGPT API المرحلة الثالثة: تثبيت وتجهيز

REST هي واجهة ChatGPT واجهة برمجة تطبيقات

نقوم في هذه المرحلة بتثبيت وتجهيز واجهة

ChatGPT برمجة تطبيقات

ChatGPT ولنتمكن من إحداث التفاعل مع نمط

ChatGPT API نقوم بتثبيت

ويتم ذلك عن طريق إدخال الأمر التالي

Terminal window في

ثم بعد ذلك نقوم إعداد المصادقة

API ويتم ذلك بالحصول على مفتاح

OpenAI من خلال طلب اشتراك في

Python لتنتقل بعدها إلى إعداده في

:بواسطة تشغيل الأمر التالي

Advertisements

( بايثون )Tableau و ChatGPT إنشاء التكامل بين

بعد إنجاز المراحل السابقة بنجاح

Tableau مع ChatGPT يبقى أمامنا إنشاء تكامل

:ويتم ذلك باتباع الخطوات التالية

:الخطوة الأولى

ChatGPT API اختيار دالة بايثون التي تستدعي

هنا إعادة الاستجابة ChatGPT وظيفة

من الاستعلامات المدخلة إليها

وهذا ما يوضحه المثال التالي

TabPy الخطوة الثانية : استخدام

في تسجيل وظيفة بايثون

TabPy وهذا يعني تسجيل دالة بايثون مع

Tableau ليتم استعمالها في

Terminal window ويتم ذلك بتشغيل الأمر التالي في

TabPy وهنا سيتشكل ملف تكوين لـ

: افتحه وأضف الأسطر التالية

TabPy قم بحفظ الملف، ولبدء

عليك تشغيل الأمر التالي

Tableau الخطوة الثالثة: استخدم دالة بايثون في

Tableau جديد في workbook وللقيام بذلك نفتح

ونقوم بالإجراءات التالية

إلى لوحة التحكم “Text” نسحب الكائن *

“Edit text” انقر فوق النص واختر *

وفي صندوق الحوار اكتب الصيغة التالية

ثم انقر فوق “موافق” فيُغلَق صندوق تحرير النص *

إلى لوحة التحكم “Parameter” اسحب الكائن *

“Create Parameter” في صندوق حوار *

“all” واختر “String” اضبط نوع البيانات على

على القيم المتاحة واضبط القيمة الحالية

ثم انقر الأمر موافق “empty string” على

انقر بالزر الأيمن للماوس “Parameter” على الكائن *

“Show Parameter Control” واختر

Enter واضغط “Input Text” اكتب استعلاماً في *

في كائن “نص” ChatGPT سيعرض الرد من *

معاً Tableauو ChatGPT ثم يتم استدعاء

قد يبدو الدمج عملية متعبة في البداية ولكن تنفيذها بشكل متكرر ولو على نطاق ضيق سينمي عندك المهارات ويطور القدرات على معالجة البيانات بشكل مرن وسريع ويساعدك على استكشاف المشكلات ومعالجتها بفاعليها أكبر من ذي قبل

: إنشـــــــــــاء التصورات

:ChatGpt باستخدام

ChatGpt أول ما نحتاج إليه هو تزويد

بالبيانات المراد تصورها، وبعد تلقيه البيانات التي لقناه إياها عن طريق مجموعة أو عن طريق تمرير جدول سيقوم بإنشاء التصور وفق الطلبات الموكلة إليه

JavaScript شاهد في هذا الكود المدخل في

كيف نقوم بإنشاء تصور

OpenAI API في الكود السابق نستعين بوظائف

لتوليد مخطط شريطي للمبيعات حسب الموقع

ChatGpt نُدخِل هذا الطلب في

عبر المتغير الفوري، ولإنشاء التصور

client.completions.create نستخدم وظيفة

Tableau وفي النهاية يمكننا عرض التصور الناتج في

والذي كان قد خزن مسبقاً في متغير الرسالة

تغيير خصائص التصورات الناتجة

يمكننا تخصيص التصورات الناتجة وفق متطلباتنا من حيث تغيير نوع التصور وحجمه ونمط الألوان

إضافية parameters بـ ChatGpt ويتم ذلك عن طريق تزويد

JavaScript ويمكننا تنفيذ ذلك عن طريق استخدام الكود التالي في

مختلفة parameters وتذكر دائماً أن تجريب

يعتبر أداة قوية لتحصل على بيانات مرئية جذابة ومبتكرة

ما فعلناه في الكود السابق هو أننا قمنا بإنشاء مخطط خطي للأرباح بمقدار ربع سنة باستخدام اللون الأزرق

من خلال المتغير الفوري ChatGpt ثم أدخلنا طلبنا في

ثم حددنا نمط التصور المناسب فنتج لدينا مخطط خطي بلون أزرق وبحجم وفق الطلب

Tableau إظهار المرئيات في

وللتذكير.. كل ما سبق من مراحل وإجراءات

ChatGpt هي إنشاء تصور باستخدام

أول ما علينا فعله هو نسخ التصور الناتج من متغير الرسالة

ويتم ذلك بتنفيذ المراحل التالية Tableau ولصقه في

Tableau أنشئ ورقة عمل جديدة في *

“Marks” من جزء “Text” اختر *

الصق التصور المنسوخ من متغير الرسالة في مربع النص واضبط حجم مربع النص ليناسب التصور *

Tableau تهانينا .. لقد ظهر التصور أخيراً في *

وفي نهاية مالقتنا اليوم اسمحوا لي أن استبق الأمور وأجيب بكل سرور عن بعض الأسئلة التي على الأرجح قد تتبادر إلى أذهان بعض القراء

؟ChatGPT السؤال الأول: هل يوجد نسخ مجانية من

الجواب: نعم يوجد إصدارات مجانية ولكنها ورغم أن استخداماتها محدودة ولكنها غالباً ما تفي بالغرض

ChatGPT السؤال الثاني: هل نستطيع دمج

؟Tableau مع أدوات تصور أخرى غير

الجواب: نعم ويتم ذلك باتباع نفس الخطوات التي اتبعناها آنفاً

أجوبة دقيقة؟ ChatGPT السؤال الثالث: هل يعطي

الجواب: ليست أجوبة دقة فحسب، بل بمنتهى الدقة بشكل عام وذلك عند إدخال المعلومات بشكل صحيح

وفي النهاية آمل أن تكون قد وجدت في هذا المقال معلومات قيمة كمحلل بيانات يبحث عن التطور الدائم والمستمر في عمله فالشخص الناجح يا صديقي كما تعلم هو الشخص الذي ينجز عمله بدقة وبأسرع وقت ممكن

فإن وجدت الفائدة أرجو أن تقوم بالمشاركة بين الأصدقاء وتقديم الدعم لنا وسارع بالالتحاق بشركاء رائعين عن طريق متابعة المدونة فنحن نتشرف بوجودك معنا.. أهلاً بك

Advertisements

5 Free Data Engineering Projects Which To Build Your High-Level portfolio

Posted on May 22, 2023May 30, 2023 by s4l8384gmailcom

Advertisements

Data engineering in our current era enjoys a great deal of interest and unprecedented demand, as many believe that it will be the most important science in the near future and will occupy a prominent place within the family of all data sciences, and even beyond that, data engineering is considered the future of artificial intelligence.

This science derives its importance as it mainly represents the backbone of data, so to speak, and rather the data infrastructure on which data science in all its branches depends.

Therefore, due to the scarcity of data engineering projects, we put in your hands five projects that will help you build a strong business portfolio that raises your chances when applying for any job related to data science.

Before moving on to the list of projects, please share this information and follow the blog in support of us to continue providing everything that is useful, and we are pleased to see your opinions and experiences in the comments .. thanks

Let’s get to know the five projects:

1. Surfline Dashboard

What you will learn in this project You will collect data from Surfline API via pipeline and export CSV file to Amazon S3

The goal of this project is to have a nice dashboard showing the data and to that end it loads the latest file into S3 to eventually feed it into the Postgres data warehouse

Advertisements

2. Audiophile End-To-End ELT Pipeline

The implementation of this project requires the creation, design, and management of a data pipeline that will extract data from Crinacle’s Headphone and InEarMonitor databases and finalize metabase dashboard data.

You will learn AWS S3, Redshift, RDS, data transformation tool dbt, streaming

3. The FinnHub Streaming Data Pipeline

The aim of this project is to provide users with real-time financial data through a solid foundation

You will deal with building and implementing a data architecture that will handle big data in real time and stream data pipelines based on FinnHub.io API which is WebSocket which is used for real time handling data.

You will learn, for example:

Apache Kafka, Spark, Cassandra, Kubernetes and Grafana

4- Twitter data pipeline using Airflow

Through this project you will learn the main principles of Airflow and the skills of creating a data pipeline

In a big data environment, the concept of data pipeline is automatically associated with data engineering, and data engineering mastery is associated with mastery of data pipeline skills

You will also learn:

Python for DE
Airflow Basics
Working with Tweepy
Twitter Data Package
Writing ETL functions
Data storage on Amazon S3

5. Youtube data engineering project from start to finish

Frankly, this project carries a great benefit, so do not skimp on yourself by enriching your information and raising your scientific balance in data engineering, in addition to learning how to understand problems and address them, so you will implement a complete data engineering project, and the implementation will take you about three hours.

You will follow the trainer’s instructions step by step, highlighting the important points and necessary details

Advertisements

خمسة مشاريع مجانية لهندسة البيانات تبني بها محفظة أعمال عالية المستوى

Advertisements

تحظى هندسة البيانات في عصرنا الحالي بحيزٍ كبيرٍ من الاهتمام والإقبال غير المسبوق حيث أن الكثيرين يرون أنها من ستكون أهم العلوم في المستقبل القريب وستحتل مكانة مرموقة ضمن عائلة علوم البيانات كافة بل ويتعدى ذلك إلى اعتبار هندسة البيانات مستقبل الذكاء الاصطناعي

يستمد هذا العلم أهميته باعتبار أنه يمثِّل بشكل رئيسي عصب البيانات إن صح التعبير وبالأحرى البنية التحتية للبيانات التي تعتمد عليها علوم البيانات بكافة فروعها

لذا ونظراً لندرة توفر مشاريع هندسة البيانات، نضع بين أيديكم خمسة مشاريع تساعدك على بناء محفظة أعمال قوية ترفع من حظوظك عند التقدم لأي وظيفة تتعلق بعلوم البيانات

قبل الانتقال إلى قائمة المشاريع الرجاء مشاركة هذه المعلومات ومتابعة المدونة دعماً لنا للاستمرار بتقديم كل ما هو مفيد، كما ويسعدنا مشاهدة آراءكم وتجاربكم في التعليقات.. مع جزيل الشكر

:هيا بنا لنتعرف على المشاريع الخمسة

1. Surfline Dashboard

ما ستتعلمه في هذا المشروع بأنك ستقوم بتجميع البيانات

عبر خط الأنابيب Surfline API من

Amazon S3 إلى CSV وتصدير ملف

الهدف من هذا المشروع هو الحصول على لوحة معلومات رائعة تعرض البيانات وللوصول إلى هذه الغاية يتقوم بتحميل أحدث ملف في S3 ليتم في نهاية المطاف إدخاله في مستودع بيانات Postgres

Advertisements

2. Audiophile End-To-End ELT Pipeline

يتطلب تنفيذ هذا المشروع إنشاء وتصميم وإدارة خط أنابيب البيانات التي ستقوم باستخراج البيانات

Crinacle’s Headphone من قواعد البيانات

InEarMonitorو

وإنهاء بيانات لوحة بيانات قاعدة التعريف

ستتعلم

AWS S3 ، Redshift ، RDS ،

التدفق ،dbt أداة تحويل البيانات

3. The FinnHub Streaming Data Pipeline

الهدف من هذا المشروع هو إمداد المستخدمين ببيانات مالية في الوقت الفعلي من خلال قاعدة متينة سوف تتعامل مع بناء وتنفيذ بنية البيانات التي بدورها ستتعامل مع بيانات ضخمة في الوقت الفعلي كما وستقوم بتدفق خطوط أنابيب البيانات

WebSocket وهو FinnHub.io API استناداً على

والذي يستخدم لبيانات التعامل في الزمن الحقيقي

سوف تتعلم على سبيل المثال لا الحصر:

Apache Kafka, Spark, Cassandra, Kubernetes and Grafana

4- Twitter data pipeline using Airflow

من خلال هذا المشروع

Airflow ستتعلم المبادئ الرئيسية لـ

ومهارات إنشاء خط أنابيب البيانات

في بيئة البيانات الضخمة يرتبط مفهوم خط أنابيب البيانات تلقائياً بهندسة البيانات ويعتبر احتراف هندسة البيانات مقروناً بإتقان المهارات المتعلقة بالتعامل مع خط أنابيب البيانات

: ستتعلم أيضاً

Python for DE

Airflow أساسيات

Tweepy والعمل مع

Twitter Data Package

ETL كتابة وظائف

Amazon S3 تخزين البيانات على

5. من البداية إلى النهاية Youtube مشروع هندسة بيانات

بصراحة هذا المشروع يحمل فائدة كبيرة لذا لا تبخل على نفسك بإغناء معلوماتك ورفع رصيدك العلمي في هندسة البيانات إضافة إلى تعلمك كيفية فهم المشاكل ومعالجتها لذا فستقوم بتنفيذ مشروع هندسة بيانات كامل وسيستغرق معك التنفيذ حوالي ثلاث ساعات

ستتَّبِع تعليمات المدرب خطوة بخطوة مع الوقوف على النقاط الهامة والتفاصيل الضرورية