我感到有點羞愧;作為一名分析師,我的工作核心之一就是撰寫 SQL。事實上,如果你把這個問題提給 ChatGPT,它會這樣回答。

分析師需要哪些技術技能?

ChatGPT 的回應:

數據庫管理: 理解用於查詢和管理數據庫的 SQL(結構化查詢語言)。了解關聯數據庫的概念很重要。

我必須承認,有時候我在撰寫 SQL 時也會感到有些頭痛,但我仍然可以自信地說我在這方面相當熟練。

然而,今天,我終於 lived long enough to see myself become a villain。我設計了這個提示,正是為了以後再也不需要寫任何 SQL。

你確定要和我一起走向這個深淵?

好吧… 提示送你

I have a table in BigQuery as such data schema:

event_date
event_timestamp
event_name
event_params
event_previous_timestamp
event_value_in_usd
event_bundle_sequence_id
event_server_timestamp_offset
user_id
user_pseudo_id
privacy_info
user_properties
user_first_touch_timestamp
user_ltv
device
geo
app_info
traffic_source
stream_id
platform
event_dimensions
ecommerce
items
collected_traffic_source
is_active_user
event_params
items.item_id
items.item_name
items.item_brand
items.item_variant
items.item_category
items.item_category2
items.item_category3
items.item_category4
items.item_category5
items.price_in_usd
items.price
items.quantity
items.item_revenue_in_usd
items.item_revenue
items.item_refund_in_usd
items.item_refund
items.coupon
items.affiliation
items.location_id
items.item_list_id
items.item_list_name
items.item_list_index
items.promotion_id
items.promotion_name
items.creative_name
items.creative_slot
privacy_info.analytics_storage
privacy_info.ads_storage
privacy_info.uses_transient_token
user_properties
user_ltv.revenue
user_ltv.currency
device.category
device.mobile_brand_name
device.mobile_model_name
device.mobile_marketing_name
device.mobile_os_hardware_model
device.operating_system
device.operating_system_version
device.vendor_id
device.advertising_id
device.language
device.is_limited_ad_tracking
device.time_zone_offset_seconds
device.browser
device.browser_version
device.web_info
device.web_info.browser
device.web_info.browser_version
device.web_info.hostname
traffic_source.name
traffic_source.medium
traffic_source.source
geo.continent
geo.country
geo.region
geo.city
geo.sub_continent
geo.metro
app_info.id
app_info.version
app_info.install_store
app_info.firebase_app_id
app_info.install_source
event_dimensions.hostname
ecommerce.total_item_quantity
ecommerce.purchase_revenue_in_usd
ecommerce.purchase_revenue
ecommerce.refund_value_in_usd
ecommerce.refund_value
ecommerce.shipping_value_in_usd
ecommerce.shipping_value
ecommerce.tax_value_in_usd
ecommerce.tax_value
ecommerce.unique_items
ecommerce.transaction_id
collected_traffic_source.manual_campaign_id
collected_traffic_source.manual_campaign_name
collected_traffic_source.manual_source
collected_traffic_source.manual_medium
collected_traffic_source.manual_term
collected_traffic_source.manual_content
collected_traffic_source.gclid
collected_traffic_source.dclid
collected_traffic_source.srsltid

under event_params, it is a nested structure with a key & value pair, where under value, there is 4 types, one event_params can     have many key & value pairs
structure as below: 
event_params.key
event_params.value
event_params.value.string_value
event_params.value.int_value
event_params.value.float_value
event_params.value.double_value

under user_properties, it is a nested structure with a key & value pair, where under value, there is 5 types, one user_properties can     have many key & value pairs
structure as below: 
user_properties.key
user_properties.value
user_properties.value.string_value
user_properties.value.int_value
user_properties.value.float_value
user_properties.value.double_value
user_properties.value.set_timestamp_micros

under items it is list of item_id in an array structure, each item_id can up to have 25 dimension , one items can have many item_id, the item_id and the rest of the dimension is in 1:1     relationship.

structure as below: 
items.item_id
items.item_name
items.item_brand
items.item_variant
items.item_category
items.item_category2
items.item_category3
items.item_category4
items.item_category5
items.price_in_usd
items.price
items.quantity
items.item_revenue_in_usd
items.item_revenue
items.item_refund_in_usd
items.item_refund
items.coupon
items.affiliation
items.location_id
items.item_list_id
items.item_list_name
items.item_list_index
items.promotion_id
items.promotion_name
items.creative_name
items.creative_slot

Finally, remember only the items, user_properties and event_params     is a nested data, the rest is non-nested.
Also, for each event_params, there could be multiple event_params.key, meaning if you used subquery in the WHERE condition, it is     likely that the other evet_params.key will not be accessible.
When selecting the event_name with specific event_params.key condition but pulling a different event_params.key within the same event, it is recommend to create a table to get those key value     pairs first, then query on that table to get the final output.

here is an example of getting of getting the key "source" from the     event_params and count the unique number of user_pseudo_id.
SELECT
(SELECT params.value.string_value FROM UNNEST (event_params) AS     params WHERE params.key = 'source') AS session_source,
  COUNT( DISTINCT user_pseudo_id) AS unique_users
FROM
  `YOUR_TABLE_*`
WHERE_TABLE_SUFFIX BETWEEN "20230101" AND "20230701"
GROUP BY
  session_source

Now, based on the above information, can you write me a standard SQL to get the event number of event with name page_view and the corresponding page_location in the event_params?  
I only want to get those event where the event_params .key is with source equal facebook.

welcome to the dark side

而你唯一要做的,便是把最後一個段落換成你的要求。

Now, based on the above information, can you write me a standard SQL to get the event number of event with name page_view and the corresponding page_location in the event_params?

不客氣,我們成功令分析師失業了。

測試

先不開玩笑,儘管這個提示很強大,但成功率一定不是百分百,個人經驗在 80-90%,所以你可能還需要稍微修改一下。

以下是我從 ChatGPT 獲得的範例:

拿到特定來源的頁面瀏覽數

提示

Now, based on the above information, can you write me a standard SQL to get the event number of event with name page_view and the corresponding page_location in the event_params? I only want to get those event where the event_params.key is with source equal facebook.

ChatGPT 回覆:

WITH EventsWithFacebookSource AS (
SELECT
    event_name,
    (SELECT params.value.string_value FROM UNNEST (event_params) AS params WHERE params.key = 'source') AS event_source,
    (SELECT params.value.string_value FROM UNNEST (event_params) AS params WHERE params.key = 'page_location') AS page_location
FROM
    `YOUR_TABLE_*`
WHERE
    _TABLE_SUFFIX BETWEEN "20230101" AND "20230701"
    AND event_name = 'page_view'
)
SELECT
    event_source,
    page_location,
    COUNT(*) AS event_count
FROM
    EventsWithFacebookSource
WHERE
    event_source = 'facebook'
GROUP BY
    event_source, page_location
ORDER BY
    event_count DESC;

不錯。

拿到所有來源以及計數

提示:

can you write me the SQL to list all source (source is from event_params.key) and the corresponding event count?

ChatGPT 回覆:

SELECT
    (SELECT params.value.string_value FROM UNNEST(event_params) AS params WHERE params.key = 'source') AS session_source,
    COUNT(*) AS event_count
FROM
    `YOUR_TABLE_*`
WHERE
    _TABLE_SUFFIX BETWEEN "20230101" AND "20230701"
GROUP BY
    session_source
ORDER BY
    event_count DESC;

也不錯。

拿特定來源頭10頁面瀏覽

提示:

Now, based on the above information, can you write me a standard SQL to get the top 10 page_location? page_location is a key under the string_value in the event page_view, I only want to get those event where the event_params.key is with source equal facebook.

ChatGPT 回覆:

WITH PageViews AS (
SELECT
    user_pseudo_id,
    (SELECT params.value.string_value FROM UNNEST (event_params) AS params WHERE params.key = 'source') AS event_source,
    (SELECT params.value.string_value FROM UNNEST (event_params) AS params WHERE params.key = 'page_location') AS page_location
FROM
    `YOUR_TABLE_*`
WHERE
    _TABLE_SUFFIX BETWEEN "20230101" AND "20230701"
    AND event_name = 'page_view'
)
SELECT
    page_location,
    COUNT(DISTINCT user_pseudo_id) AS unique_users
    FROM
    PageViews
WHERE
    event_source = 'facebook'
GROUP BY
    page_location
ORDER BY
    unique_users DESC
LIMIT
10;

不錯。

計算來源的工作階段以及發生特定事件的次數

提示:

Now, based on the above information, can you group the session by source, session is defined as distinctive ga_session_id under event_params.value.int_value, then calculate how many of those session have completed a ‘scroll’ event_name?

ChatGPT 回覆:

WITH SessionSources AS (
SELECT
    (SELECT params.value.string_value FROM UNNEST (event_params) AS params WHERE params.key = 'source') AS session_source,
    (SELECT params.value.int_value FROM UNNEST (event_params) AS params WHERE params.key = 'ga_session_id') AS session_id
FROM
    `YOUR_TABLE_*`
WHERE
    _TABLE_SUFFIX BETWEEN "20230101" AND "20230701"
)
SELECT
    session_source,
    COUNT(DISTINCT session_id) AS sessions_with_scroll_event
FROM
    SessionSources
WHERE
    session_id IN (
        SELECT
        (SELECT params.value.int_value FROM UNNEST (event_params) AS params WHERE params.key = 'ga_session_id') AS session_id
        FROM
        `YOUR_TABLE_*`
        WHERE
        _TABLE_SUFFIX BETWEEN "20230101" AND "20230701" AND
        event_name = 'scroll'
    )
GROUP BY
    session_source

還好,這樣只會有scroll的事件才會算進去,轉換率會變100%。

結語

運用 ChatGPT 來處理 SQL 確實能夠減少撰寫 SQL 所需的時間,而且它還有可能帶來更具創意且潛在效率更高的 SQL。(其中一個問題是,由 ChatGPT 生成的 SQL 往往不太讀者友好。)

總之,儘管技術知識和技能確實具有價值,但隨著時間的推移,它可能會貶值。雖然擁有技術技能仍然至關重要(正如在正確且準確輸入提示和修改回應的必要性所示),但理解「如何」並且尋找正確的解決方案變得越來越重要。

一名優秀的分析師意味著不僅僅擁有執行工作的技能集或成為某一特定平台的專家。我相信最優秀的分析師是那些能夠提出正確問題、找出真正的問題並隨後提供適切解決方案以產生有重大影響的人。